Next Article in Journal
New Functional MRI Experiments Based on Fractional Diffusion Representation Show Independent and Complementary Contrast to Diffusion-Weighted and Blood-Oxygen-Level-Dependent Functional MRI
Previous Article in Journal
Super Twisted Sliding Mode Observer for Enhancing Ventilation Drive Performance
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamic Graph Attention Network for Skeleton-Based Action Recognition

by
Zhenhua Li
1,
Fanjia Li
2 and
Gang Hua
1,*
1
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
2
School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou 221018, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(9), 4929; https://doi.org/10.3390/app15094929
Submission received: 12 March 2025 / Revised: 22 April 2025 / Accepted: 25 April 2025 / Published: 29 April 2025

Abstract

:
Skeleton-based human action recognition has garnered significant attention for its robustness to background noise and illumination variations. However, existing methods relying on Graph Convolutional Networks (GCNs) and Transformers exhibit inherent limitations: GCNs struggle to model interactions between non-adjacent joints due to predefined skeletal topology, while Transformers accumulate noise through unrestricted global dependency modeling. To address these challenges, we propose a Dynamic Graph Attention Network (DGAN) that dynamically integrates local structural features and global spatiotemporal dependencies. DGAN employs a masked attention mechanism to adaptively adjust node connectivity, forming a dynamic adjacency matrix that extends beyond physical skeletal constraints by selectively incorporating highly correlated joints. Additionally, a node-partition bias strategy is introduced to prioritize attention on collaboratively moving body parts, thereby enhancing discriminative feature extraction. Extensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 datasets validate the effectiveness of DGAN, which outperforms state-of-the-art methods by achieving a balance between local topology preservation and global interaction modeling. Our approach provides a robust framework for skeleton-driven action recognition, demonstrating superior generalization across diverse scenarios.

1. Introduction

In recent years, there has been an increasing demand for human action recognition (HAR) in various domains such as video surveillance, smart home systems, and human–computer interaction [1,2]. Compared to RGB-based approaches [3,4,5,6], skeleton-based methods have received extensive attention owing to their robustness against background clutter, scale variations, and illumination changes [7,8,9,10,11,12,13,14,15]. Skeleton data represent human bodies using 2D or 3D joint coordinates, which can either be directly captured by depth sensors [16] or estimated from image sequences using pose estimation algorithms [17]. This skeleton-based representation simplifies human actions into graph structures composed of joints and their connecting edges, effectively eliminating irrelevant background information and providing a high-level semantic abstraction. Early studies on skeleton-based HAR primarily employed Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). Long Short-Term Memory (LSTM) networks were widely utilized for modeling temporal dynamics in sequential data [8,18,19]. Meanwhile, CNN-based methods transformed skeleton sequences into pseudo-images and extracted spatial-temporal features using 2D CNNs, achieving improved performance [7,10,11,20]. However, both RNN- and CNN-based approaches have limitations in fully leveraging the inherent topological structure of skeleton data, as skeleton data intrinsically represent graph structures rather than simple sequential vectors or 2D grids. Although RNNs and CNNs provided effective solutions for skeleton-based action recognition, they still face challenges in effectively modeling complex spatiotemporal characteristics.
Given that skeleton data naturally fits a graph structure, GCNs have rapidly gained popularity in the HAR domain [21]. GCNs effectively capture joint movements and interactions among joints through convolution operations on graph-structured data, thus accurately modeling the posture variations and dynamic evolution in human actions. To overcome the limitations associated with predefined weighting mechanisms of standard GCNs and to enhance the flexibility of weight learning, Graph Attention Networks (GATs) were subsequently introduced for skeleton-based action recognition [22,23,24]. GATs dynamically learn node importance via attention mechanisms, thus efficiently aggregating neighboring node information. However, due to the constraints of fixed-topology graphs, GAT-based methods only utilize information from physically connected joints, making it challenging to fully explore and extract latent features between non-adjacent nodes, such as those involved in actions like clapping. To address these issues, Transformer-based approaches have recently emerged and been increasingly adopted in the HAR field [15,25,26,27].
Unlike methods relying on local convolutional operations, Transformers directly model the relationships between nodes through attention mechanisms, calculating attention weights among all nodes [28]. This global-awareness capability enables Transformers to flexibly model long-range dependencies, overcoming the limited receptive field inherent in traditional GCNs. With multi-head self-attention mechanisms, Transformers effectively capture complex, multi-level dependencies between joints, including cooperative movements among non-adjacent joints as seen in actions like clapping or jumping. Nonetheless, due to the absence of inductive biases, Transformers typically require large amounts of training data to learn relational biases, and they can also introduce noise through global dependency computations, leading to computational inefficiencies [26].
To address the limitations of existing GCN- and Transformer-based methods, this paper proposes a DGAN that effectively integrates local neighborhood information with long-range dependencies. Specifically, DGAN constructs dynamic node relationships by allowing additional nodes beyond the original adjacency matrix, thereby enabling flexible modeling of long-distance dependencies. During the attention computation, nodes exhibiting high correlation with the target node are identified based on attention scores; nodes that exceed the minimum attention score with respect to the target node and its neighbors are dynamically added into the adjacency matrix, thereby forming a dynamic adjacency matrix. As illustrated in Figure 1, blue nodes indicate original adjacency connections, whereas red nodes denote newly added nodes through dynamic updates. Subsequently, a mask matrix is generated from the dynamic adjacency matrix to achieve a dynamic balance between local and global attentions via a masked attention mechanism. Furthermore, to effectively model cooperative interactions among multiple joints in human actions, a node-partition bias is introduced in the attention module, enabling the network to more effectively extract localized features from relevant body regions. Extensive evaluations on the NTU RGB+D datasets validate that the proposed DGAN, through its dynamic modeling of node dependencies, surpasses state-of-the-art methods in terms of accuracy.
The main contributions of this paper include the following:
  • We propose a Dynamic Graph Attention Network, which leverages dynamic node selection and a masked attention mechanism to effectively model both local and global node interactions in a flexible manner;
  • We introduce a novel node-partitioning strategy, which enables the network to efficiently extract relational dependencies from correlated nodes during the attention computation process;
  • Experimental results demonstrate that the proposed approach consistently outperforms current state-of-the-art methods across multiple key metrics.

2. Related Work

This section reviews two categories of skeleton-based human action recognition methods closely related to our work.

2.1. Skeleton-Based Action Recognition Using Graph Convolutional Networks

GCNs were initially proposed by Kipf et al. [29], whose core concept involves treating graph nodes and edges as local neighborhoods, aggregating features of each node through weighted convolution operations to extract spatial structural information.
In the scenario of skeleton-based action recognition, joints are treated as graph nodes, while connections between joints correspond to edges derived from the physical topology of human skeletons, resulting in the adjacency matrix. Because GCNs effectively leverage such topological structure information, they can derive highly discriminative node features. Compared to traditional CNN- and RNN-based methods, GCN-based approaches demonstrate superior capabilities in processing high-dimensional structured data.
Yan et al. [21] first introduced the Spatial-Temporal Graph Convolutional Network (ST-GCN) to model skeleton sequences, significantly improving recognition accuracy. With the advancement of GCNs, several improved methods have emerged. For example, Dynamic GCN integrates dynamic graph structures that adaptively modify joint connections, effectively capturing features specific to different actions [30]. Additionally, attention mechanisms were introduced into GCNs, such as Graph Attention Networks (GATs), which dynamically learn node importance within graph structures, assigning varying attention weights to neighboring nodes and efficiently aggregating information from neighborhoods of variable sizes [22]. Furthermore, Info-GCN integrates an attention mechanism within GCN frameworks, enhancing the model’s ability to focus on critical joints, thereby improving fine-grained action recognition performance [14]. AT-Shift-GCN proposes a lightweight graph convolutional network that integrates non-shared topology structures with attention-guided multi-scale modeling. This design significantly improves the accuracy of skeleton-based action recognition while maintaining low computational cost [31].

2.2. Skeleton-Based Action Recognition Using Transformer

The Transformer, originally proposed by Vaswani et al. [28], employs a multi-head self-attention mechanism to replace conventional recurrent structures for sequential data modeling. Due to its capability of parallelizing computation along the sequence dimension, the Transformer greatly enhances both training and inference efficiency. Additionally, its self-attention mechanism captures global dependencies across different temporal or spatial positions, making it particularly suitable for extracting features from long sequential data.
In skeleton-based action recognition, the introduction of Transformer architectures not only effectively captures long-term dependencies and global spatial relationships but also facilitates effective multi-scale feature extraction from skeleton data. Particularly for action sequences that are longer or structurally more complex, the multi-head self-attention mechanism of Transformers simultaneously attends to interactions across various time steps and crucial joints, significantly enhancing the representation and fusion of local and global information.
Due to their powerful sequential modeling capabilities and computational parallelism, Transformers have progressively gained prominence in the domain of skeleton-based human action recognition. Transformer-based approaches, leveraging self-attention mechanisms, effectively capture global inter-node relationships [25]. For instance, HyperFormer employs relative position encoding based on shortest graph paths and a hypergraph self-attention mechanism, enabling the attention module to recognize high-order relationships shared among joint groups connected via hyperedges [15]. FG-STFormer combines local and global spatial transformers alongside a joint-part cross-attention mechanism, allowing the attention module to discern discriminative local relationships between joints and cooperative global spatial relationships connected through body parts, significantly improving spatiotemporal feature modeling capability [27].

3. Method

This section provides a detailed description of the overall concept and implementation specifics of the proposed DGAN. First, we summarize the general architecture of our method. Next, we introduce each component of DGAN, explaining the network’s spatiotemporal encoding strategy. Finally, we comprehensively describe the node-partitioning strategy employed in the proposed method.

3.1. Architecture of DGAN

The overall framework of DGAN is illustrated in Figure 2. For each skeleton action sequence with variable length, the sequence is normalized to a fixed length T, where each frame contains M persons, each person has V joints, and each joint has a feature dimension of C i n . The skeleton sequence is represented as X R C i n × T × V × M . After initial preprocessing, the dimension of X is transformed to ( M V C , T ) , followed by batch normalization and reshaping to ( M , C , T , V ) . The reshaped sequence is then fed into the DGAN backbone to extract spatio-temporal features. Subsequently, the residual connections and layer-wise downsampling are applied to obtain multi-scale temporal receptive fields. These features are further fed into the classification module after processing through the backbone.
In the DGAN module, the data are first processed by the dynamic attention graph learning module, which calculates attention weights between joints to dynamically learn spatial correlations. Meanwhile, point-wise attention is computed to enhance the model’s ability to focus on key joints, allowing the model to adaptively capture both local and global spatial dependencies within the skeleton structure.
For the MSTC module, the input features are fed into several temporal convolution branches with different dilation rates and kernel sizes, enabling the model to capture temporal dependencies at different scales. This design enhances the model’s ability to extract both short- and long-range temporal features. The extracted features are fused through a summation operation, and global average pooling is applied to reduce the dimensionality. The output is then passed through a fully connected layer and softmax activation function for final classification.
The entire DGAN model is trained under a cross-entropy loss function in a supervised manner, using the predicted action category and the ground-truth label for supervision.

3.2. DGAN Module

To address the limitations of GCN- and Transformer-based approaches, we propose DGAN, a Dynamic Graph Attention Network, which integrates local and global feature modeling while adapting to the dynamic spatiotemporal structures of skeleton data. The GCN captures local topology but struggles with non-adjacent interactions due to its reliance on a predefined graph. Transformer models long-range dependencies but are prone to noise from global self-attention. DGAN balances these strengths, dynamically adjusting to skeleton data’s evolving structure.
As shown in Algorithm 1, DGAN initializes its adjacency information using the basic skeletal topology, ensuring that connections between neighboring joints are prioritized, thereby preserving the continuity of local structures. Concurrently, DGAN employs attention mechanisms to dynamically determine the correlations between nodes within a specific action scenario, adaptively adding or suppressing node connections accordingly.
Formally, for a given graph G = ( V , E ) , where V represents nodes and E represents all edges, the adjacency matrix A R | V | × | V | characterizes neighbor relations. The node attention scores are computed as A t t e n t i o n ( G ) R | V | × | V | , from which each node identifies its most correlated distant neighbors. For each node, the minimum attention score among its adjacent neighbors is defined as a threshold, forming a threshold matrix V m i n R | V | × 1 . For each node v V , if the attention score with node v exceeds the threshold V m i n , the influence of node v is preserved; otherwise, it is masked, treating it as a negligible contextual connection for the current node.
Algorithm 1 Pytorch-like pseudocode of DGAN
  • Input:
  •     A: Initial adjacency matrix, shape ( 25 , 25 )
  •     sim: Similarity score matrix, shape ( N , 25 , 25 )
  • Output:
  •     Modified similarity matrix sim, shape ( N , 25 , 25 )
1:
function soft_A(A, sim)
2:
        A A . expand ( sim . shape [ 0 ] , 1 , 1 )          ▹ Expand to batch size N
3:
        inf max _ value ( sim . dtype )
4:
        masked _ C where ( A = = 1 , sim , inf )          ▹ Non-adjacent nodes
5:
        min _ values masked _ C . min ( dim = 2 )
6:
        min _ expanded min _ values . unsqueeze ( 1 )       ▹ Min neighbor similarity
7:
        sim _ ge _ min ( sim min _ expanded) )
8:
        mask where ( sim _ ge _ min = = 0 , , 0 )
9:
        sim sim + mask                   ▹ Suppress weak connections
10:
     return sim
11:
end function
Based on this mechanism, the node attention values below the threshold are masked, preserving only the most discriminative interactions, thus dynamically forming adaptive connections. Specifically, nodes with attention scores greater than V m i n are regarded as softly connected to the current node, thereby mitigating irrelevant noise and redundancy commonly encountered in the global attention mechanism of Transformer models.
DGAN adjusts the attention mechanism matrix as follows:
MaskedAttention Q , K , V = soft max M + Q K d k
M denotes the attention mask, whose value at each feature position ( x , y ) is defined as:
M ( x , y ) = if Q K ( x , y ) < V m i n ( x , 0 ) , 0 otherwise .
Here, Q R | V | × d k , K R | V | × d k , and V R | V | × d v are obtained by projecting the initial features X R | V | × d through projection matrices W q , W k , W v , and d k denotes the scaling factor. Such threshold-based attention enables more targeted node screening beyond global attention.
The multi-head attention mechanism in DGAN inherits the computational advantages of the Transformer, effectively attending to spatial–temporal relationships across multiple projected spaces and varying node dependencies. This comprehensively captures interactions inherent in skeleton data while introducing residual connections to further enhance the representation capability of the model.
We denote this multi-head masked self-attention as:
X l = MaskedAttention ( X l 1 ) = softmax M l + Q l K l d k V l + X l 1 .
For temporal encoding, we employ Multi-Scale Temporal Convolution (MSTC) [32], which, combined with the previously described dynamic graph attention mechanism, achieves coordinated encoding of spatiotemporal features. Specifically, the spatiotemporal encoding strategy is formulated as follows:
X l = ReLU MSTC softmax M l + Q l K l d k V l + X l 1 + X l 1 .

3.3. Partition Strategy

In skeleton-based human action recognition, human actions typically involve coordinated movements of multiple joints [15,21,33]. To effectively capture interactions among critical joints and incorporate spatial priors into the attention mechanism, we employ a spatial encoding strategy based on node partitioning. Specifically, we define a binary partition matrix P R | V | × P num , where P num represents the number of partitions. Each row of P contains exactly one element equal to 1, with all others set to zero, indicating the spatial partition assignment for each node. Based on this matrix, we construct the corresponding degree matrix D R P num × P num to derive the partition bias K p :
K P = P D 1 P X W p ,
where W p R d × d is a projection matrix for node features X R V × d from the previous layer, and d denotes feature dimensionality. K P encodes spatial biases for each node within its corresponding partition, highlighting the most collaborative joints or body parts relevant to the action context.
Figure 3a illustrates the node partition strategy employed in our method. To incorporate this partition bias into the masked attention mechanism, we extend Equations (3) and (4) by introducing the partition bias K P into the original key vectors. This extension explicitly differentiates between partitions during attention computation.
The final spatial encoding strategy and the corresponding spatiotemporal encoding strategy are formulated as follows:
X l = softmax M l + Q l ( K l + K P T ) d k V l + X l 1 ,
X l = ReLU MSTC softmax M l + Q l ( K l + K P l ) d k V l + X l 1 + X l 1 ,
where Q l , K l , and V l denote query, key, and value matrices at layer l, d k represents the feature dimension, and M l denotes the attention mask matrix. X l represents the node features after attention encoding at layer l.

4. Experiments

This chapter first introduces the datasets used and the details of the experimental setup. Subsequently, we compare our approach with state-of-the-art methods to evaluate its performance on the NTU RGB+D dataset. Finally, ablation studies are conducted to gain deeper insights into the proposed DGAN.

4.1. Datasets

The NTU RGB+D 60 dataset is one of the most widely used indoor benchmarks for skeleton-based human action recognition. It contains a total of 56,880 action sequences across 60 action categories, collected using Kinect v2 depth sensors (Microsoft Inc., Redmond, WA, USA). These sequences were performed by 40 volunteers, where the last 10 action categories involve interactions between two subjects, while the remaining are single-person actions. Actions were captured simultaneously from three different viewpoints using three Kinect v2 sensors, enabling multi-view analysis of each action sequence. To evaluate the generalization capability of algorithms under various conditions, the dataset authors provide two standard evaluation benchmarks:
  • X-Sub: The 40 subjects are split into two distinct groups, used respectively as training and testing sets. This setting primarily assesses the generalization capability across different subjects.
  • X-View: Approximately 37,920 sequences captured from cameras 2 and 3 are used for training, and around 18,960 sequences captured by camera 1 are reserved for testing, thereby assessing model robustness under viewpoint variations.
The NTU RGB+D 120 dataset is an extended version of NTU RGB+D 60 and currently represents the largest indoor skeleton-based action recognition dataset. It includes 120 action categories and comprises a total of 114,480 action sequences, recorded by 106 subjects in diverse scenarios. Like the original dataset, actions are synchronously captured using three Kinect v2 sensors, but NTU RGB+D 120 significantly improves upon NTU RGB+D 60 in terms of quantity and action diversity, including a broader variety of everyday activities and two-person interactions.
Similar to NTU RGB+D 60, when utilizing this dataset, specific abnormal samples are excluded, and sequences with insufficient frames are either zero-padded or looped to reach the target length. Additionally, if the second subject is missing from a sequence, zero-padding is applied accordingly.

4.2. Implementation Details

The network utilized in this study comprises ten layers, arranged into basic blocks to integrate spatial–temporal feature extraction. The channel dimension is set to 240, and the number of attention heads is set to 10.
All experiments in this study are implemented using the PyTorch 2.0 deep learning framework. The maximum number of training epochs is set to 150, with a warmup strategy introduced during the initial five epochs to stabilize the early training phase. The optimizer employed is stochastic gradient descent (SGD) with Nesterov momentum, applying a weight decay of 1 × 10 4 . The initial learning rate is 0.025, decayed by a factor of 0.1 at epochs 100 and 120, respectively. Cross-entropy loss is used for model training via backpropagation.
For skeleton sequences in both NTU RGB+D 60 and NTU RGB+D 120 datasets, each sample is typically standardized to 64 frames, with a batch size of 64. Sequences shorter than this length are padded with zeros or looped until reaching the desired length. Missing data for the second subject in two-person sequences is replaced with zero-padding. All experiments are conducted using three NVIDIA GeForce GTX 1080Ti GPUs.

4.3. Comparison with State-of-the-Art Methods

To comprehensively evaluate the effectiveness of the proposed DGAN, we conduct experiments on two widely used benchmark datasets for skeleton-based action recognition, NTU RGB+D 60 and NTU RGB+D 120, and compare our approach with current state-of-the-art methods. Experimental results are summarized in Table 1 and Table 2. For objective analysis, we categorize the comparative methods into three groups: CNN/RNN-based methods, GCN-based methods, and Transformer-based methods.
On the NTU RGB+D 60 dataset, our proposed DGAN achieves accuracies of 93.2% and 96.9% in the Cross-Subject (X-Sub) and Cross-View (X-View) benchmarks, respectively. These results significantly outperform traditional CNN/RNN methods, which typically fail to effectively leverage the structural characteristics of skeleton data. Additionally, our model surpasses classical GCN models, such as 2s-AGCN (88.5%) and CTR-GCN (90.7%), primarily due to DGAN’s ability to dynamically adjust node connections, thereby overcoming the limitations of fixed topology in traditional GCN methods and effectively modeling long-range node dependencies. Although ST-DGAT has achieved competitive results on benchmark datasets, it still adopts a fixed human-skeletal topology within each attention head, which limits its ability to model cross-topology dependencies. In contrast, DGAN employs dynamically learned topologies for each attention head and incorporates a partition-aware bias mechanism. This design not only offers superior accuracy but also provides greater structural flexibility and interpretability. Furthermore, DGAN attains an accuracy of 96.7% on the X-View benchmark, matching or slightly exceeding advanced Transformer-based methods such as ST-SA-Net (96.7%) and Hyperformer (96.5%) demonstrate DGAN’s effectiveness in integrating local and global information.
On the more challenging NTU RGB+D 120 dataset, DGAN achieves accuracies of 89.7% and 90.9% in the X-Sub and X-Set benchmarks, respectively. This performance surpasses traditional CNN/LSTM models, such as Ta-CNN+ (85.7%, 90.3%), and exceeds classical graph convolutional models like Shift-GCN (85.9%, 87.6%) and 2s-AGCN (82.9%, 84.9%). These results suggest that, through dynamic attention and node partitioning strategies, DGAN can effectively extract action semantics from different nodes, particularly demonstrating superior performance on the NTU RGB+D 120 dataset, which features larger-scale and more complex actions.
Overall, the analysis demonstrates that our proposed DGAN, through integrating dynamic graph attention mechanisms and node-partitioning strategies, effectively combines the advantages of GCN and Transformer approaches. It achieves superior accuracy across multiple public benchmark datasets for skeleton-based human action recognition, thereby validating its competitiveness and generalization capabilities in the field.

4.4. Ablation Studies

In this subsection, we perform ablation studies to validate the effectiveness of different components and node-partitioning strategies of the proposed DGAN.

4.4.1. Effectiveness of Each Component

Firstly, we evaluate the performance of the model under various combinations of the two main components: spatial and temporal modules. As shown in Table 3, we compare combinations based on attention and temporal convolution (TCN), Graph Attention Network with TCN, DGAN + TCN, and their corresponding variants using MSTC. The results indicate that employing MSTC significantly enhances performance compared to conventional single-scale temporal convolution. Regardless of the spatial attention mechanism used, the accuracy improves by approximately 1.7–1.8% after integrating MSTC, highlighting MSTC’s effectiveness in capturing both short-term and long-term dependencies within action sequences. On the other hand, given the same temporal module, different spatial modeling strategies also exhibit performance variations. The proposed DGAN achieves the best performance in spatial relationship modeling by employing a dynamically learned adjacency matrix. When combined with the MSTC module, it reaches an accuracy of 90.90%. In comparison, the GAT+MSTC model, which relies on a static baseline adjacency matrix, achieves 90.46%, while the Attention+MSTC model, based on global self-attention, achieves 90.62%. Similarly, when using standard TCN alone, DGAN+TCN outperforms GAT+TCN and attention+TCN by approximately 0.5% and 0.3%, respectively. These findings suggest that DGAN can more effectively exploit spatial dependencies inherent in skeleton data compared to conventional graph convolution or pure self-attention mechanisms, thereby enhancing discriminative feature representation.
To determine the optimal model combination, we conducted additional experiments based on the DGAN+MSTC combination, recording the performance variations as presented in Table 4. First, introducing an additional MLP module DGAN+MSTC+MLP reduced the accuracy from 90.9% to 89.27%, indicating that simply stacking fully connected layers does not yield additional benefits and instead introduces unnecessary parameters and potential overfitting risks. Secondly, we tested a temporal modeling strategy using a DGAN-like mechanism instead of MSTC, resulting in a significant accuracy decrease to 88.26%, underscoring that multi-scale convolution in the temporal dimension is essential for effectively capturing action sequences; relying solely on graph attention is insufficient for temporal feature extraction. Therefore, DGAN and MSTC complement each other, synergistically enhancing overall performance. Inspired by the Timesformer model [3], we also experimented with reversing the order of temporal and spatial modeling—performing temporal modeling before spatial attention—but observed a notable accuracy decline. Consequently, this study adopts the spatial-before-temporal modeling strategy, which empirically demonstrates superior effectiveness.

4.4.2. Importance of Partition Strategy

To assess the effectiveness of the proposed node-partition bias strategy introduced into the attention module, we design three comparative schemes: (a) no partition, which removes all partition-based biasing; (b) empirical partition, which relies on prior knowledge of human skeleton structure to divide joints into several predefined body parts, injecting corresponding biases during attention computation; and (c) learned partition, where partitions are adaptively learned during training, allowing flexible node grouping.
As illustrated in Figure 3, Figure 3a presents the final adopted partition strategy (proposed partition), Figure 3b shows the learned partition, and Figure 3c depicts the empirical partition strategy, which yields relatively inferior performance. As shown in Table 5, the proposed partition achieves the highest accuracy. Compared with the empirical partition, the proposed partition assigns the shoulders and hips to the corresponding arm and leg regions, which enhances the representation of motion coordination and benefits action recognition. In contrast, the learned partition lacks explicit prior constraints, making it more susceptible to data noise during training. This often leads to ambiguous partition boundaries or incorrect semantic grouping, thereby limiting the model’s potential for performance improvement. These observations confirm that introducing a partition bias effectively enables the model to focus specifically on dynamic interactions between distinct body parts, thus enhancing performance. Different partition implementations offer varying trade-offs between accuracy and flexibility, and thus their selection should depend on the specific application scenario and data characteristics.

4.4.3. Visualization Analysis

To qualitatively assess the effectiveness of the proposed node-partition bias within the attention mechanism, we visualize the learned attention weights between joints. As shown in Figure 4, DGAN assigns significantly higher attention weights between nodes representing the left hand and head during the action “drinking," while assigning lower attention scores to joints in the lower body. This behavior allows the model to concentrate effectively on coordinated movements between the hand and head, capturing crucial discriminative features of the drinking action. Similar patterns are observed for other actions, demonstrating that the dynamic attention mechanism of DGAN selectively captures essential global dependencies while maintaining local structural integrity. This selective attention pattern provides intuitive interpretability for model decisions, confirming that DGAN indeed learns to emphasize critical body parts and interactions crucial for accurate action classification.
As shown in Figure 5, the model achieves high recognition accuracy across most action categories. For the few cases where errors occur, such as in the reading or writing actions, occlusion of key joints or significant noise in the skeleton data may lead the dynamic attention mechanism in DGAN to mistakenly focus on less informative joints, resulting in misclassification. For instance, inaccurate localization of limb keypoints in some samples may cause the model to introduce spurious connections due to noise. Nevertheless, the overall impact of such errors is limited, owing to the presence of the masking mechanism which effectively suppresses unreliable connections.
Short-term actions are typically characterized by rapid, instantaneous completion (e.g., drop), whereas long-term actions often exhibit longer temporal duration (e.g., drink). From the accuracy distribution across different action types in Figure 5, it can be observed that there is no significant performance gap between the recognition of short-term and long-term actions. This indicates that the model maintains consistent recognition performance for both high-dynamic, short-duration movements and stable, temporally extended actions. Therefore, action duration is not a dominant factor affecting the model’s recognition performance. Instead, the model appears to rely more heavily on spatial structural features and discriminative pose patterns, rather than the temporal extent of the actions.

5. Conclusions

In this paper, we propose a novel Dynamic Graph Attention Network to address the limitations associated with predefined graph topology in GCNs and the noise accumulation issue inherent in global dependency modeling of Transformer-based methods. Specifically, our proposed DGAN dynamically constructs adjacency matrices by adaptively adjusting node connections, thereby alleviating the constraints imposed by predefined topologies in traditional GCNs, as well as mitigating the noise accumulation associated with global dependency modeling in Transformers. Additionally, we introduce an explicit node-partitioning strategy designed to reinforce the learning of critical joint and body-part information, enabling the network to more precisely focus on crucial spatial features associated with human actions.
Extensive experiments on two mainstream benchmark datasets, NTU RGB+D and NTU RGB+D 120, demonstrate that our proposed DGAN consistently outperforms existing state-of-the-art methods under various evaluation protocols. Experimental results clearly indicate that DGAN can robustly model dynamic spatial–temporal dependencies, effectively balancing local structural preservation with selective global attention, thus demonstrating superior accuracy and strong generalization capabilities.
In future work, we will further investigate dynamic graph attention strategies to enhance model adaptability and explore practical applications in broader scenarios.

Author Contributions

Conceptualization, Z.L. and G.H.; methodology, Z.L.; software, Z.L.; validation, Z.L., F.L. and G.H.; formal analysis, Z.L.; investigation, Z.L.; resources, G.H.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, G.H. and F.L.; visualization, Z.L.; supervision, G.H.; project administration, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Xuzhou Science and Technology Plan Project [Grant Number KC23317].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available at https://rose1.ntu.edu.sg/dataset/actionRecognition.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable and helpful comments, which substantially improved this paper. Finally, we would also like to thank all of the editors for their professional advice and help.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef] [PubMed]
  2. Ren, B.; Liu, M.; Ding, R.; Liu, H. A survey on 3d skeleton-based action recognition using learning method. Cyborg Bionic Syst. 2024, 5, 0100. [Google Scholar] [CrossRef] [PubMed]
  3. Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the ICML, Online, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
  4. Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
  5. Nguyen, H.P.; Ribeiro, B. Video action recognition collaborative learning with dynamics via PSO-ConvNet Transformer. Sci. Rep. 2023, 13, 14624. [Google Scholar] [CrossRef] [PubMed]
  6. Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
  7. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 597–600. [Google Scholar]
  8. Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
  9. Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  10. Liu, H.; Tu, J.; Liu, M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv 2017, arXiv:1705.08106. [Google Scholar]
  11. Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 601–604. [Google Scholar]
  12. Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
  13. Huang, L.; Huang, Y.; Ouyang, W.; Wang, L. Part-level graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11045–11052. [Google Scholar]
  14. Chi, H.-G.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20186–20196. [Google Scholar]
  15. Zhou, Y.; Cheng, Z.Q.; Li, C.; Fang, Y.; Geng, Y.; Xie, X.; Keuper, M. Hypergraph transformer for skeleton-based action recognition. arXiv 2022, arXiv:2211.09590. [Google Scholar]
  16. Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
  17. Park, K.; Patten, T.; Vincze, M. Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7668–7677. [Google Scholar]
  18. Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar]
  19. Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  20. Ding, Z.; Wang, P.; Ogunbona, P.O.; Li, W. Investigation of different skeleton features for cnn-based 3d action recognition. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, 10–14 July 2017; IEEE: New York, NY, USA, 2017; pp. 617–622. [Google Scholar]
  21. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  22. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
  23. Hu, L.; Liu, S.; Feng, W. Spatial temporal graph attention network for skeleton-based action recognition. arXiv 2022, arXiv:2208.08599. [Google Scholar]
  24. Rahevar, M.; Ganatra, A.; Saba, T.; Rehman, A.; Bahaj, S.A. Spatial–temporal dynamic graph attention network for skeleton-based action recognition. IEEE Access 2023, 11, 21546–21553. [Google Scholar] [CrossRef]
  25. Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In Proceedings of the Pattern Recognition. ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2021; pp. 694–701. [Google Scholar]
  26. Xin, W.; Liu, R.; Liu, Y.; Chen, Y.; Yu, W.; Miao, Q. Transformer for skeleton-based action recognition: A review of recent advances. Neurocomputing 2023, 537, 164–186. [Google Scholar] [CrossRef]
  27. Gao, Z.; Wang, P.; Lv, P.; Jiang, X.; Liu, Q.; Wang, P.; Xu, M.; Li, W. Focal and global spatial-temporal transformer for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022; pp. 382–398. [Google Scholar]
  28. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  29. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  30. Ye, F.; Pu, S.; Zhong, Q.; Li, C.; Xie, D.; Tang, H. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 10–16 October 2020; pp. 55–63. [Google Scholar]
  31. Lu, C.; Chen, H.; Li, M.; Jing, L. Attention-guided and topology-enhanced shift graph convolutional network for skeleton-based action recognition. Electronics 2024, 13, 3737. [Google Scholar] [CrossRef]
  32. Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
  33. Cui, H.; Hayama, T. Joint-Partition Group Attention for skeleton-based action recognition. Signal Process. 2024, 224, 109592. [Google Scholar] [CrossRef]
  34. Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
  35. Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5457–5466. [Google Scholar]
  36. Soo Kim, T.; Reiter, A. Interpretable 3d human action analysis with temporal convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
  37. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
  38. Xu, K.; Ye, F.; Zhong, Q.; Xie, D. Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2866–2874. [Google Scholar]
  39. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
  40. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
  41. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar]
  42. Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
  43. Shao, Y.; Mao, L.; Ye, L.; Li, J.; Yang, P.; Ji, C.; Wu, Z. H2GCN: A hybrid hypergraph convolution network for skeleton-based action recognition. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102072. [Google Scholar] [CrossRef]
  44. Cho, S.; Maqbool, M.; Liu, F.; Foroosh, H. Self-attention network for skeleton-based human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 635–644. [Google Scholar]
  45. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  46. Qiu, H.; Hou, B.; Ren, B.; Zhang, X. Spatio-temporal segments attention for skeleton-based action recognition. Neurocomputing 2023, 518, 30–38. [Google Scholar] [CrossRef]
Figure 1. Comparison between (a) the baseline adjacency matrix and (b) the proposed dynamic adjacency matrix.
Figure 1. Comparison between (a) the baseline adjacency matrix and (b) the proposed dynamic adjacency matrix.
Applsci 15 04929 g001
Figure 2. Model architecture of DGAN. The DGAN model is composed of L stacked DGAN modules and L stacked MSTC modules, followed by a final classification module.
Figure 2. Model architecture of DGAN. The DGAN model is composed of L stacked DGAN modules and L stacked MSTC modules, followed by a final classification module.
Applsci 15 04929 g002
Figure 3. Comparison of different partition strategies: (a) proposed partition, (b) learned partition, (c) empirical partition based on symmetry.
Figure 3. Comparison of different partition strategies: (a) proposed partition, (b) learned partition, (c) empirical partition based on symmetry.
Applsci 15 04929 g003
Figure 4. Attention visualization analysis of the drinking.
Figure 4. Attention visualization analysis of the drinking.
Applsci 15 04929 g004
Figure 5. NTU-RGB+D 60 action classes.
Figure 5. NTU-RGB+D 60 action classes.
Applsci 15 04929 g005
Table 1. Comparison with state-of-the-art methods on NTU RGB+D dataset.
Table 1. Comparison with state-of-the-art methods on NTU RGB+D dataset.
TypeMethodX-Sub (%)X-View (%)
RNNST-LSTM [8]69.277.7
VA-LSTM [34]79.287.7
IndRNN [35]81.888.0
AGC-LSTM [18]89.295.0
CNNTCN [36]74.383.1
HCN [37]86.591.1
Ta-CNN+ [38]90.795.1
GCNST-GCN [21]81.588.3
AS-GCN [39]86.894.2
2s-AGCN [40]88.595.2
DGNN [41]89.996.1
Shift-GCN [42]90.796.5
CTR-GCN [32]92.496.8
Info-GCN [14]92.796.9
ST-DGAT [24]91.196.4
H2GCN [43]92.596.7
AT-Shift-GCN [31]91.797.1
TransformerTS-SAN [44]87.292.7
STTR [25]89.996.1
DSTA-Net [45]91.596.4
HyperFormer [15]92.996.5
STSA-Net [46]92.796.7
DGAN (Ours)93.196.7
Table 2. Comparison with state-of-the-art methods on NTU RGB+D 120 dataset.
Table 2. Comparison with state-of-the-art methods on NTU RGB+D 120 dataset.
TypeMethodX-Sub (%)X-Set (%)
CNNST-LSTM [8]55.757.9
GCA-LSTM [12]61.263.3
AGC-LSTM [18]89.290.3
Ta-CNN+ [38]85.790.3
GCNST-GCN [21]70.773.2
AS-GCN [39]77.778.9
2s-AGCN [40]82.984.9
Shift-GCN [42]85.987.6
CTR-GCN [32]88.990.6
Info-GCN [14]89.490.7
ST-DGAT [24]86.588.8
H2GCN [43]87.489.8
AT-Shift-GCN [31]88.589.0
TransformerDSTA-Net [45]86.689.0
STTR [25]82.784.8
STSA-Net [46]88.590.7
DGAN (Ours)89.790.9
Table 3. Effectiveness of each component of DGAN on NTU RGB+D bone modality X-Sub.
Table 3. Effectiveness of each component of DGAN on NTU RGB+D bone modality X-Sub.
ModelAccuracy (%)
Attention+TCN88.93
GAT+TCN88.64
DGAN+TCN89.21
Attention+MSTC90.62
GAT+MSTC90.46
DGAN+MSTC90.90
Table 4. Optimization exploration based on DGAN.
Table 4. Optimization exploration based on DGAN.
ModelAccuracy (%)
DGAN+DGAN88.26
DGAN+MSTC+MLP89.27
MSTC+DGAN88.63
DGAN+MSTC90.90
Table 5. Comparison of different partition strategies.
Table 5. Comparison of different partition strategies.
Partition StrategyAccuracy (%)
None90.32
Learned Partition90.36
Empirical Partition90.68
Proposed Partition90.90
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Li, F.; Hua, G. Dynamic Graph Attention Network for Skeleton-Based Action Recognition. Appl. Sci. 2025, 15, 4929. https://doi.org/10.3390/app15094929

AMA Style

Li Z, Li F, Hua G. Dynamic Graph Attention Network for Skeleton-Based Action Recognition. Applied Sciences. 2025; 15(9):4929. https://doi.org/10.3390/app15094929

Chicago/Turabian Style

Li, Zhenhua, Fanjia Li, and Gang Hua. 2025. "Dynamic Graph Attention Network for Skeleton-Based Action Recognition" Applied Sciences 15, no. 9: 4929. https://doi.org/10.3390/app15094929

APA Style

Li, Z., Li, F., & Hua, G. (2025). Dynamic Graph Attention Network for Skeleton-Based Action Recognition. Applied Sciences, 15(9), 4929. https://doi.org/10.3390/app15094929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop