Dynamic Graph Attention Network for Skeleton-Based Action Recognition
Abstract
:1. Introduction
- We propose a Dynamic Graph Attention Network, which leverages dynamic node selection and a masked attention mechanism to effectively model both local and global node interactions in a flexible manner;
- We introduce a novel node-partitioning strategy, which enables the network to efficiently extract relational dependencies from correlated nodes during the attention computation process;
- Experimental results demonstrate that the proposed approach consistently outperforms current state-of-the-art methods across multiple key metrics.
2. Related Work
2.1. Skeleton-Based Action Recognition Using Graph Convolutional Networks
2.2. Skeleton-Based Action Recognition Using Transformer
3. Method
3.1. Architecture of DGAN
3.2. DGAN Module
Algorithm 1 Pytorch-like pseudocode of DGAN |
|
3.3. Partition Strategy
4. Experiments
4.1. Datasets
- X-Sub: The 40 subjects are split into two distinct groups, used respectively as training and testing sets. This setting primarily assesses the generalization capability across different subjects.
- X-View: Approximately 37,920 sequences captured from cameras 2 and 3 are used for training, and around 18,960 sequences captured by camera 1 are reserved for testing, thereby assessing model robustness under viewpoint variations.
4.2. Implementation Details
4.3. Comparison with State-of-the-Art Methods
4.4. Ablation Studies
4.4.1. Effectiveness of Each Component
4.4.2. Importance of Partition Strategy
4.4.3. Visualization Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
- Ren, B.; Liu, M.; Ding, R.; Liu, H. A survey on 3d skeleton-based action recognition using learning method. Cyborg Bionic Syst. 2024, 5, 0100. [Google Scholar] [CrossRef] [PubMed]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the ICML, Online, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Nguyen, H.P.; Ribeiro, B. Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer. Sci. Rep. 2023, 13, 14624. [Google Scholar] [CrossRef] [PubMed]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 597–600. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Liu, H.; Tu, J.; Liu, M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv 2017, arXiv:1705.08106. [Google Scholar]
- Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 601–604. [Google Scholar]
- Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
- Huang, L.; Huang, Y.; Ouyang, W.; Wang, L. Part-level graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11045–11052. [Google Scholar]
- Chi, H.-G.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20186–20196. [Google Scholar]
- Zhou, Y.; Cheng, Z.Q.; Li, C.; Fang, Y.; Geng, Y.; Xie, X.; Keuper, M. Hypergraph transformer for skeleton-based action recognition. arXiv 2022, arXiv:2211.09590. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Park, K.; Patten, T.; Vincze, M. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7668–7677. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Ding, Z.; Wang, P.; Ogunbona, P.O.; Li, W. Investigation of different skeleton features for cnn-based 3d action recognition. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 617–622. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
- Hu, L.; Liu, S.; Feng, W. Spatial temporal graph attention network for skeleton-based action recognition. arXiv 2022, arXiv:2208.08599. [Google Scholar]
- Rahevar, M.; Ganatra, A.; Saba, T.; Rehman, A.; Bahaj, S.A. Spatial–temporal dynamic graph attention network for skeleton-based action recognition. IEEE Access 2023, 11, 21546–21553. [Google Scholar] [CrossRef]
- Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2021; pp. 694–701. [Google Scholar]
- Xin, W.; Liu, R.; Liu, Y.; Chen, Y.; Yu, W.; Miao, Q. Transformer for skeleton-based action recognition: A review of recent advances. Neurocomputing 2023, 537, 164–186. [Google Scholar] [CrossRef]
- Gao, Z.; Wang, P.; Lv, P.; Jiang, X.; Liu, Q.; Wang, P.; Xu, M.; Li, W. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 382–398. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 10–16 October 2020; pp. 55–63. [Google Scholar]
- Lu, C.; Chen, H.; Li, M.; Jing, L. Attention-guided and topology-enhanced shift graph convolutional network for skeleton-based action recognition. Electronics 2024, 13, 3737. [Google Scholar] [CrossRef]
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
- Cui, H.; Hayama, T. Joint-Partition Group Attention for skeleton-based action recognition. Signal Process. 2024, 224, 109592. [Google Scholar] [CrossRef]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
- Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5457–5466. [Google Scholar]
- Soo Kim, T.; Reiter, A. Interpretable 3d human action analysis with temporal convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
- Xu, K.; Ye, F.; Zhong, Q.; Xie, D. Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2866–2874. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
- Shao, Y.; Mao, L.; Ye, L.; Li, J.; Yang, P.; Ji, C.; Wu, Z. H2GCN: A hybrid hypergraph convolution network for skeleton-based action recognition. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102072. [Google Scholar] [CrossRef]
- Cho, S.; Maqbool, M.; Liu, F.; Foroosh, H. Self-attention network for skeleton-based human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 635–644. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
- Qiu, H.; Hou, B.; Ren, B.; Zhang, X. Spatio-temporal segments attention for skeleton-based action recognition. Neurocomputing 2023, 518, 30–38. [Google Scholar] [CrossRef]
Type | Method | X-Sub (%) | X-View (%) |
---|---|---|---|
RNN | ST-LSTM [8] | 69.2 | 77.7 |
VA-LSTM [34] | 79.2 | 87.7 | |
IndRNN [35] | 81.8 | 88.0 | |
AGC-LSTM [18] | 89.2 | 95.0 | |
CNN | TCN [36] | 74.3 | 83.1 |
HCN [37] | 86.5 | 91.1 | |
Ta-CNN+ [38] | 90.7 | 95.1 | |
GCN | ST-GCN [21] | 81.5 | 88.3 |
AS-GCN [39] | 86.8 | 94.2 | |
2s-AGCN [40] | 88.5 | 95.2 | |
DGNN [41] | 89.9 | 96.1 | |
Shift-GCN [42] | 90.7 | 96.5 | |
CTR-GCN [32] | 92.4 | 96.8 | |
Info-GCN [14] | 92.7 | 96.9 | |
ST-DGAT [24] | 91.1 | 96.4 | |
H2GCN [43] | 92.5 | 96.7 | |
AT-Shift-GCN [31] | 91.7 | 97.1 | |
Transformer | TS-SAN [44] | 87.2 | 92.7 |
STTR [25] | 89.9 | 96.1 | |
DSTA-Net [45] | 91.5 | 96.4 | |
HyperFormer [15] | 92.9 | 96.5 | |
STSA-Net [46] | 92.7 | 96.7 | |
DGAN (Ours) | 93.1 | 96.7 |
Type | Method | X-Sub (%) | X-Set (%) |
---|---|---|---|
CNN | ST-LSTM [8] | 55.7 | 57.9 |
GCA-LSTM [12] | 61.2 | 63.3 | |
AGC-LSTM [18] | 89.2 | 90.3 | |
Ta-CNN+ [38] | 85.7 | 90.3 | |
GCN | ST-GCN [21] | 70.7 | 73.2 |
AS-GCN [39] | 77.7 | 78.9 | |
2s-AGCN [40] | 82.9 | 84.9 | |
Shift-GCN [42] | 85.9 | 87.6 | |
CTR-GCN [32] | 88.9 | 90.6 | |
Info-GCN [14] | 89.4 | 90.7 | |
ST-DGAT [24] | 86.5 | 88.8 | |
H2GCN [43] | 87.4 | 89.8 | |
AT-Shift-GCN [31] | 88.5 | 89.0 | |
Transformer | DSTA-Net [45] | 86.6 | 89.0 |
STTR [25] | 82.7 | 84.8 | |
STSA-Net [46] | 88.5 | 90.7 | |
DGAN (Ours) | 89.7 | 90.9 |
Model | Accuracy (%) |
---|---|
Attention+TCN | 88.93 |
GAT+TCN | 88.64 |
DGAN+TCN | 89.21 |
Attention+MSTC | 90.62 |
GAT+MSTC | 90.46 |
DGAN+MSTC | 90.90 |
Model | Accuracy (%) |
---|---|
DGAN+DGAN | 88.26 |
DGAN+MSTC+MLP | 89.27 |
MSTC+DGAN | 88.63 |
DGAN+MSTC | 90.90 |
Partition Strategy | Accuracy (%) |
---|---|
None | 90.32 |
Learned Partition | 90.36 |
Empirical Partition | 90.68 |
Proposed Partition | 90.90 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Li, F.; Hua, G. Dynamic Graph Attention Network for Skeleton-Based Action Recognition. Appl. Sci. 2025, 15, 4929. https://doi.org/10.3390/app15094929
Li Z, Li F, Hua G. Dynamic Graph Attention Network for Skeleton-Based Action Recognition. Applied Sciences. 2025; 15(9):4929. https://doi.org/10.3390/app15094929
Chicago/Turabian StyleLi, Zhenhua, Fanjia Li, and Gang Hua. 2025. "Dynamic Graph Attention Network for Skeleton-Based Action Recognition" Applied Sciences 15, no. 9: 4929. https://doi.org/10.3390/app15094929
APA StyleLi, Z., Li, F., & Hua, G. (2025). Dynamic Graph Attention Network for Skeleton-Based Action Recognition. Applied Sciences, 15(9), 4929. https://doi.org/10.3390/app15094929