Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition
Abstract
:1. Introduction
- A pyramid spatial-temporal graph transformer backbone (PGT) is proposed in this work which shows a powerful ability to learn robust features for skeleton graphs with dynamic attention.
- Two kinds of transformer blocks are introduced and applied to discover the long-range spatial-temporal correlations in human actions, and the separate convolution operation in graph embedding makes the model go deeper with high-level semantics.
- Extensive experiments are performed in this work, and the ablation study demonstrates the effectiveness of our backbone. Finally, the experiments show better comparable performance with the state-of-the-art methods and show the potential of this backbone.
2. Related Work
2.1. Skeleton-Based Action Recognition
2.2. Vision Transformer
3. Method
3.1. Graph Embedding
3.2. Transformer Block
3.3. Pyramid Architecture
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Ablation Study
4.3.1. The Effect of the ST-Joint Transformer Block
4.3.2. The Effect of Four-Stream Fusion
4.3.3. The Effect of Separate Convolution in Graph Embedding
4.4. Comparisons with Other Approaches
4.4.1. Experiments on the NTU-RGBD 60 Dataset
4.4.2. Experiments on the NTU-RGBD 120 Dataset
4.4.3. Experiments on the Northwestern-UCLA Dataset
4.4.4. Model Complexity
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
- Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops, ICME Workshops, Hong Kong, China, 10–14 July 2017; pp. 601–604. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops, ICME Workshops, Hong Kong, China, 10–14 July 2017; pp. 597–600. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef] [PubMed]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the Computer Vision—ECCV 2016—14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part III Lecture Notes in Computer Science. Volume 9907, pp. 816–833. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4263–4270. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2136–2145. [Google Scholar]
- Zheng, W.; Li, L.; Zhang, Z.; Huang, Y.; Wang, L. Relational Network for Skeleton-Based Action Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2019, Shanghai, China, 8–12 July 2019; pp. 826–831. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 7444–7452. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 180–189. [Google Scholar]
- Song, Y.; Zhang, Z.; Shan, C.; Wang, L. Richly Activated Graph Convolutional Network for Robust Skeleton-Based Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1915–1925. [Google Scholar] [CrossRef]
- Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
- Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022. online ahead of print. [Google Scholar] [CrossRef] [PubMed]
- Hussein, M.E.; Torki, M.; Gowayyed, M.A.; El-Saban, M. Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, IJCAI 2013, Beijing, China, 3–9 August 2013; pp. 2466–2472. [Google Scholar]
- Kim, T.S.; Reiter, A. Interpretable 3D Human Action Analysis with Temporal Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1623–1631. [Google Scholar]
- Banerjee, A.; Singh, P.K.; Sarkar, R. Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2206–2216. [Google Scholar] [CrossRef]
- Wang, H.; Yu, B.; Xia, K.; Li, J.; Zuo, X. Skeleton edge motion networks for human action recognition. Neurocomputing 2021, 423, 1–12. [Google Scholar] [CrossRef]
- Diao, X.; Li, X.; Huang, C. Multi-term attention networks for skeleton-based action recognition. Appl. Sci. 2020, 10, 5326. [Google Scholar] [CrossRef]
- Jiang, X.; Xu, K.; Sun, T. Action Recognition Scheme Based on Skeleton Representation With DS-LSTM Network. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2129–2140. [Google Scholar] [CrossRef]
- Gao, Y.; Li, C.; Li, S.; Cai, X.; Ye, M.; Yuan, H. A Deep Attention Model for Action Recognition from Skeleton Data. Appl. Sci. 2022, 12, 2006. [Google Scholar] [CrossRef]
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 140–149. [Google Scholar]
- Huang, Z.; Shen, X.; Tian, X.; Li, H.; Huang, J.; Hua, X. Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event/Seattle, WA, USA, 12–16 October 2020; pp. 2122–2130. [Google Scholar] [CrossRef]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 3595–3603. [Google Scholar] [CrossRef]
- Chen, S.; Xu, K.; Mi, Z.; Jiang, X.; Sun, T. Dual-domain graph convolutional networks for skeleton-based action recognition. Mach. Learn. 2022, 111, 2381–2406. [Google Scholar] [CrossRef]
- Zheng, Z.; Wang, Y.; Zhang, X.; Wang, J. Multi-Scale Adaptive Aggregate Graph Convolutional Network for Skeleton-Based Action Recognition. Appl. Sci. 2022, 12, 1402. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13359–13368. [Google Scholar]
- Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXIV. Volume 12369, pp. 536–553. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Srinivas, A.; Lin, T.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual Event, 19–25 June 2021; pp. 16519–16529. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science Part I. Volume 12346, pp. 213–229. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Proceedings of Machine Learning Research. Volume 139, pp. 10347–10357. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Zhang, Y.; Wu, B.; Li, W.; Duan, L.; Gan, C. STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3229–3237. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S. Cross-View Action Modeling, Learning, and Recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar] [CrossRef] [Green Version]
- Zhang, P.; Lan, C.; Zeng, W.; Xing, J.; Xue, J.; Zheng, N. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 1109–1118. [Google Scholar] [CrossRef]
- Cho, S.; Maqbool, M.H.; Liu, F.; Foroosh, H. Self-Attention Network for Skeleton-based Human Action Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, 1–5 March 2020; pp. 624–633. [Google Scholar]
- Liu, X.; Li, Y.; Xia, R. Adaptive multi-view graph convolutional networks for skeleton-based action recognition. Neurocomputing 2021, 444, 288–300. [Google Scholar] [CrossRef]
- Caetano, C.; de Souza, J.S.; Brémond, F.; dos Santos, J.A.; Schwartz, W.R. SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition. In Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2019, Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
- Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton image representation for 3d action recognition based on tree structure and reference joints. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Rio de Janeiro, Brazil, 28–30 October 2019; pp. 16–23. [Google Scholar]
- Das, S.; Dai, R.; Koperski, M.; Minciullo, L.; Garattoni, L.; Bremond, F.; Francesca, G. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 833–842. [Google Scholar]
- Das, S.; Sharma, S.; Dai, R.; Bremond, F.; Thonnat, M. VPN: Learning video-pose embedding for activities of daily living. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 72–90. [Google Scholar]
- Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1012–1020. [Google Scholar]
Stage | Block | Channel | Shape |
---|---|---|---|
0 | Graph Embedding | 3 | |
ST Transformer Block | |||
1 | Graph Embedding | 64 | |
ST Transformer Block | |||
2 | Graph Embedding | 128 | |
Joint Transformer Block | |||
3 | Graph Embedding | 256 | |
Joint Transformer Block |
Transformer Block | Acc. (%) | FLOPs (G) | Param. (M) |
---|---|---|---|
4 Joint | 88.5 | 3.48 | 11.36 |
1 ST + 3 Joint | 89.3 | 3.75 | 11.87 |
2 ST + 2 Joint (PGT) | 90.9 | 4.01 | 12.36 |
3 ST + 1 Joint | 90.6 | 4.37 | 12.87 |
4 ST | 90.7 | 4.64 | 13.36 |
Input Branch | Acc.(%) | FLOPs(G) | Param.(M) |
---|---|---|---|
Joint | 87.3 | 1.00 | 3.09 |
Bone | 87.4 | 1.00 | 3.09 |
Joint Motion | 85.3 | 1.00 | 3.09 |
Bone Motion | 84.8 | 1.00 | 3.09 |
4-s Fusion (PGT) | 90.9 | 4.01 | 12.36 |
Graph Embedding | Acc. (%) | FLOPs (G) | Param. (M) |
---|---|---|---|
Basic | 89.1 | 4.32 | 13.12 |
Separate (PGT) | 90.9 | 4.01 | 12.36 |
Methods | CS (%) | CV (%) | Year |
---|---|---|---|
Lie Group [1] | 50.1 | 82.8 | 2014 |
STA-LSTM [7] | 73.4 | 81.2 | 2017 |
DS-LSTM [21] | 75.5 | 84.2 | 2020 |
Fuzzy-CNN [18] | 84.2 | 89.7 | 2021 |
SEMN [19] | 80.2 | 85.8 | 2021 |
ST-GCN [10] | 81.5 | 88.3 | 2018 |
AS-GCN [25] | 86.8 | 94.2 | 2019 |
2 s AGCN [11] | 88.5 | 95.1 | 2019 |
TS-SAN [44] | 87.2 | 92.7 | 2020 |
MS-AAGCN [28] | 90.0 | 96.2 | 2020 |
4 s Shift-GCN [12] | 90.7 | 96.5 | 2020 |
DC-GCN+ADG [30] | 90.8 | 96.6 | 2020 |
AMV-GCN [45] | 83.9 | 92.2 | 2021 |
3 s RA-GCN [13] | 87.3 | 93.6 | 2021 |
ST-TR-AGCN [14] | 89.2 | 95.8 | 2021 |
Efficient-GCN [15] | 91.7 | 95.7 | 2021 |
PGT (Ours) | 90.9 | 95.9 | 2022 |
Methods | X-sub (%) | X-Set (%) | Year |
---|---|---|---|
SkeleMotion [46] | 67.7 | 66.9 | 2019 |
TSRJI [47] | 67.9 | 62.8 | 2019 |
Part-Aware LSTM [41] | 55.7 | 57.9 | 2020 |
SGN [43] | 79.2 | 81.5 | 2020 |
4 s Shift-GCN [12] | 85.9 | 87.6 | 2020 |
DC-GCN+ADG [30] | 86.5 | 88.1 | 2020 |
Fuzzy CNN [18] | 74.8 | 76.9 | 2021 |
AMV-GCN [45] | 76.7 | 79.0 | 2021 |
3 s RA-GCN [13] | 81.1 | 82.7 | 2021 |
ST-TR-AGCN [14] | 82.7 | 85.0 | 2021 |
SEMN [19] | 84.2 | 85.5 | 2021 |
Efficient-GCN [15] | 88.3 | 89.1 | 2021 |
PGT (Ours) | 86.5 | 88.8 | 2022 |
Methods | Top-1 (%) | Year |
---|---|---|
Lie Group [1] | 74.2 | 2014 |
Ensemble TS-LSTM [50] | 89.2 | 2017 |
Separable STA [48] | 92.4 | 2019 |
SGN [43] | 92.5 | 2020 |
VPN [49] | 93.5 | 2020 |
4 s Shift-GCN [12] | 94.6 | 2020 |
DC-GCN+ADG [30] | 95.3 | 2020 |
PGT (Ours) | 95.4 | 2022 |
Model | Acc. (%) | FLOPs (G) | Param. (M) |
---|---|---|---|
PGT | 90.9 | 4.01 | 12.36 |
PGT (Only Joint) | 87.3 | 1.00 | 4.09 |
ST-GCN [10] | 81.5 | 16.32 | 3.10 |
AS-GCN [25] | 86.8 | 26.76 | 9.50 |
3 s RA-GCN [13] | 87.3 | 32.80 | 2.61 |
2 s AGCN [11] | 88.5 | 37.32 | 6.94 |
4 s Shift-GCN [12] | 90.7 | 6.12 | 0.79 |
Efficient-GCN [15] | 91.7 | 15.24 | 2.03 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, S.; Xu, K.; Jiang, X.; Sun, T. Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition. Appl. Sci. 2022, 12, 9229. https://doi.org/10.3390/app12189229
Chen S, Xu K, Jiang X, Sun T. Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition. Applied Sciences. 2022; 12(18):9229. https://doi.org/10.3390/app12189229
Chicago/Turabian StyleChen, Shuo, Ke Xu, Xinghao Jiang, and Tanfeng Sun. 2022. "Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition" Applied Sciences 12, no. 18: 9229. https://doi.org/10.3390/app12189229
APA StyleChen, S., Xu, K., Jiang, X., & Sun, T. (2022). Pyramid Spatial-Temporal Graph Transformer for Skeleton-Based Action Recognition. Applied Sciences, 12(18), 9229. https://doi.org/10.3390/app12189229