Dual Attention-Guided Multiscale Dynamic Aggregate Graph Convolutional Networks for Skeleton-Based Human Action Recognition

: Traditional convolution neural networks have achieved great success in human action recognition. However, it is challenging to establish effective associations between different human bone nodes to capture detailed information. In this paper, we propose a dual attention-guided multiscale dynamic aggregate graph convolution neural network (DAG-GCN) for skeleton-based human action recognition. Our goal is to explore the best correlation and determine high-level semantic features. First, a multiscale dynamic aggregate GCN module is used to capture important semantic information and to establish dependence relationships for different bone nodes. Second, the higher level semantic feature is further refined, and the semantic relevance is emphasized through a dual attention guidance module. In addition, we exploit the relationship of joints hierarchically and the spatial temporal correlations through two modules. Experiments with the DAG-GCN method result in good performance on the NTU-60-RGB+D and NTU-120-RGB+D datasets. The accuracy is 95.76% and 90.01%, respectively, for the cross (X)-View and X-Subon the NTU60dataset.


Introduction
Human action recognition is widely used in many scenarios, such as human-computer interaction [1], video retrieval [2], and medical treatment security [3]. In recent years, with the development of deep learning technology, human skeleton action recognition based on joint type, frame index, and 3D position identification has been widely studied. Moreover, the continuous progress of posture estimation technology [4] makes it easier to acquire bone posture and action data. Second, compared with RGB human action video, skeleton data are more robust and computationally efficient. In addition, skeletal action recognition is a supplement to RGB video action recognition [5] and is a higher level abstract representation of human posture and action.
Although these methods achieve good recognition effects, they cannot effectively recommend long-term dependencies and deep semantic information due to limited acceptance domains. Moreover, these methods increase the complexity of the model. In short, the key to human motion recognition involves the key type and frame index, namely the semantic information. Semantic information can effectively reveal the temporal and spatial structure of human joints, that is the semantic meaning of two different joints is very different, e.g., sitting and standing can vary in the sequence of frames.
Most of the above research methods ignore the importance of semantic information (such as previous CNN-based works [12,20,21]). These methods typically overlook the semantics by implicitly hiding them in the 2D skeleton map (e.g., with rows corresponding to different types of joints and columns corresponding to frame indexes).
To address the limitations of these methods, we propose a dual attention-guided multiscale dynamic aggregate graph convolutional network (DAG-GCN), which makes full use of high-level semantics to realize skeleton-based human motion recognition. First, a hierarchical framework is constructed by exploring the association and motion frame correlation between bone nodes. To better model the skeleton node association, in addition to dynamics, a graph convolutional layer is superimposed to aggregate the deep semantics of key nodes. This approach realizes the adaptive updating of nodes and the information transmission between nodes. Second, in order to better model the correlation of action frames, we perform a space maximum pooling operation on all related node features in the same frame to obtain the frame-level feature representation. Finally, the frame indexing information is embedded in the dual attention-guiding mechanism; the frame sequence is indexed, and thus, the higher level semantics are determined.
To summarize, the main contributions of this paper are as follows: (1) We propose a dual attention-guided multiscale dynamic aggregate graph convolutional network for skeleton-based human action recognition. We aim to explore the importance of joint semantics and enhance the dependencies between the semantics of different modules. (2) The proposed DAG-GCN uses a node-level module and guided-level module to hierarchically mine the spatial-temporal correlation of frames and strengthen the semantic dependency between them. (3) The node-level module performs multilayer graph convolution, which captures the position and velocity information of bone joints through the graph nodes. This information passes through the multilayer transmission to constitute the deep-layer semantics of bone joints. The guided-level module is composed of positional attention and channel attention, and this module refines the joint semantics step-by-step and establishes and strengthens the dependencies between frames.
The rest of this paper is organized as follows: In Section 2, we review related work on human action recognition using deep learning techniques. In Section 3, we introduce the dual attention-guided multiscale dynamic aggregate graph convolutional network (DAG-GCN). Section 4 gives the test results of the DAG-GCN human action recognition framework on datasets. A summary and future research plan are given in Section 5.

Related Work
The human skeleton action recognition method based on deep learning is obviously better than the traditional manual method. For example, Amir Shahroudy et al. [22] divided LSTM cells into five subcells corresponding to five parts of the human body: trunk, arms, and legs. The recognition effect was relatively good. Liu Jun et al. [23] designed a spatiotemporal LSTM network to capture the context dependence of joints in the spatiotemporal domain, in which the joints provide information for different types of joints at each step. To some extent, the researchers were effective at distinguishing between different types of joints. Majd M et al. [24] proposed a CLSTM framework for perceptive motion data, as well as spatial features and temporal dependencies. Gemmule H et al. [25] focused on learning salient spatial features using a CNN and then mapped their temporal relationship with the aid of LSTM networks. Zufan Zhang et al. [26] addressed the human action recognition issue by using a Conv-LSTM and fully connected LSTM with different attentions. However, convolutional neural networks demonstrated their superiority in both accuracy and parallelism [27][28][29]. These CNN-based networks transform the skeleton sequence to a skeleton map of some target size and explore the spatial and temporal features. For example, Torpey D et al. [30] separately extracted local appearance and motion features using a 3D-CNN from sampled snippets of a video.
In recent years, graph convolutional networks (GCNs) have proven effective in structured and unstructured Euclidean data processing and have been widely used to model structured human skeleton action data. Yan et al. [15] proposed a spatial-temporal graph convolutional network and treated each bone joint as a node of the graph. Tang et al. [31] enhanced a predefined graph by defining the edges to better construct the graph. To consider these interactions of different human parts, Liu R et al. [18] proposed a novel structure-induced graph convolutional network (Si-GCN) framework to enhance the performance of the skeleton-based action recognition task.
Li F et al. [32] stated that traditional graph convolution cannot completely cover each action execution and proposed a novel multi-stream and enhanced spatial-temporal graph convolutional network to aggregate more temporal features. Shiraki K et al. [33] proposed an acquisition of optimal connection patterns for skeleton-based action recognition with graph convolutional networks. To capture more semantics, Xiaolu Ding et al. [19] proposed a semantics-guided graph convolutional network with three types of semantic graph modules to capture action-specific latent dependencies.
Although the above GCN methods [34][35][36][37] effectively capture the semantic information of the skeleton joints, they do not refine and fuse the semantic information, thus reusing redundant information, increasing the operating efficiency of the network and reducing the accuracy of the network's recognition of human skeleton movements. For the proposed DAG-GCN, we design two modules and a boot module (node) to better capture the skeleton's key points of semantic information, pass this information through the multilayer figure convolution superposition, and realize the node semantic information between the layers. Second, a boot module is used to analyze the frame index and to gradually refine the bone joint points of semantic filtered redundant information. This approach preserves the important spatial and temporal structural information and builds dependencies.

Dual Attention-Guided Dynamic Aggregate Graph Convolution
Proof of overview. In this section, we consider the correlation between the joints and the dependence between the motion frames. We propose a dual attention-guided multiscale dynamic aggregate graph convolution model (DAG-GCN) that is composed of bone joint-level modules and guided-level modules. The goal is to exploit the semantics of bone joints and movement frames for skeleton-based human action recognition. This approach removes redundant dependencies between node features from different neighborhoods, which allows for powerful multiscale aggregators to effectively capture graph-wide joint relationships on human skeletons. A guided-level module operator facilitates direct information flow across spacetime for effective feature learning. Integrating the dynamic aggregation scheme with a guided-level module provides a powerful feature extractor (dual attention-guided multiscale dynamic aggregate graph convolutional networks (DAG-GCNs)) with multiscale receptive fields across both spatial and temporal dimensions. The direct multiscale aggregation of features in spacetime further enhances the model performance. Figure 1 shows the DAG-GCN model recognition framework.
The skeleton-based human recognition framework of the proposed dual attention-guided graph convolutional network consists of a bone joint-level module and a frame guided-level module. In "concatenation", we learn the two stream representation of a bone joint by concatenating the spatial and temporal information of a skeleton joint. In the bone joint-level module, we use three multiscale graph convolution layers (MS-GCNs) to build the dependencies of different bone joints. To gradually refine the spatial and temporal two stream semantic information of bone joints and strengthen the dependence between motion frames, we use dual attention for guidance. In the recognition framework of the DAG-GCN model, BJL is a bone joint-level module based on multiscale dynamic aggregate graph convolutional networks that describes and aggregates the bone joint semantic information. GL is a guided-level module based on dual attention that describes and gradually refines the semantic information of the bone joint and motion frames. "CA" is a channel attention module; "PA" is a position attention module. MS, multiscale.

Multi-Scale Dynamic Aggregates
Proof of traditional GCNs. Graph convolutional networks [36], which have been proven to be effective for processing structured data, have also been used to model the structured skeleton data. The human skeleton graph is denoted as ζ = (ϑ,ε), where ϑ = (ϑ 1 ,ϑ 2 ,· · ·,ϑ N ) is the set of N bone joint nodes and ε is the edge set of bone joints. A ∈ R N×N is an adjacency matrix from graph ζ, where A i,j = 0 has no edges from ϑ i to ϑ j ; otherwise, A i,j = 1 means there are edges, and this node is self-connecting. For a human action skeleton sequence, we denote all bone joint node feature sets as χ = {χ t,k ∈ R C |t = 1, · · · , T; k = 1, · · · N}, where C is the dimensional feature vector for bone joint node ϑ n at time t and T is the total of movement frames. Thus, the input skeleton action sequence is adequately expressed structurally by the graph; the layer-wise update rule of traditional GCN layers [34][35][36][37] can describe features at time t as: where D is the diagonal degree matrix, A is a Laplacian normalized self-connecting adjacency matrix, σ(•) is an activation function of graph convolution layers, and W is a weight matrix. The traditional GCN captures semantics information of bone joint nodes by aggregating and transmitting the features of neighbor nodes.
Proof of dynamic aggregate GCNs. Although this method effectively aggregates the structural information of neighbor nodes, this leads to the captured semantic information being transmitted only in local regions, making it impossible to capture the long-distance dependence relationship and multiscale structure information of moving joints. In addition, in traditional GCNs, the graph is fixed throughout the convolution process, which reduces the performance of features if the input graph structure is not suitable. Thus, we propose a multiscale dynamic aggregate GCN [36][37][38] in which the graph structure can be gradually refined during the process. The ability of the graph structure to capture multiscale features is improved by fusing the currently embedded feature and the graph information used in the previous layer. The dynamic aggregate process is as follows: Step 1. We denote the adjacency matrix A as: where ϑ i represents a bone joint node and N(ϑ i ) represents the set of neighbors of the node ϑ i .
Step 2. Based on the definition, we denote the aggregate kernel ψ of the lth layer as: where A (l) is the lth layer of the adjacency matrix. is the weight of the kernel Step 3. We can see the advantages of the aggregate strategy. The introduction of embedded information aims to explore more accurate graph structures, and the detailed information contained in the graph structure is fully utilized. Thus, the graph structure can be dynamically updated as: where A is the adjacency matrix of the initial graph convolution layer. According to Equation (1), we can denote the adjacency matrix A (l+1) as: where < • > is the inner product vectors. θ (l) , I i,j is the relative importance and identity matrix from bone joint nodes of ϑ i and ϑ j .
Step 4. The contextual information revealed by different scales is helpful to describe the rich local features of the skeleton action region from different levels. We capture multiscale spatial-temporal information by constructing the graph structure of different neighborhood scales. Specifically, at scale S, each bone joint node ϑ i is connected to its S − hop neighbor. Then, the bone joint nodes on scale S can be denote as: where V S (ϑ i ) is the set of S − hop neighbor bone joint nodes, and the values of S are 1, 2, and 3.

Proof of the MS-GCN module aggregate:
To obtain more effective multiscale spatial and temporal semantic information of skeleton-based actions, the dependence between different scales of semantic information and the ability to represent relevant information are strengthened. This reduces the use of redundant information. We reaggregate multiple multi-scale graph convolutional network (MS-GCN) modules at time t as: where MGs is the multiscale graph convolution module. According to Equation (1), the multiscale graph convolutional network (MS-GCN) can be denoted as: where D i,j = ∑ j A i,j and χ (l) t ∈ R T×N×C (l) . To summarize, the association between joints is captured by the joint node-level module, and the context multiscale structure semantics of different levels are captured. In addition, the aggregated multiscale graph convolutional network (MS-GCN) is used to explore the correlation of structural skeleton actions and the dependence between joints. We use the semantics of the joint type and the dynamics to learn the graph connections among the nodes (different joints) within a frame. The joint type information is helpful for learning the suitable adjacency matrix (i.e., relations between joints in terms of connecting weights). Take two source joints, foot and hand, and a target joint, head, as an example: intuitively, the connection weight value from foot to head should be different from the value from hand to head even when the dynamics of foot and hand are the same. Second, as part of the information of a joint, the semantics of joint types takes part in the message passing process in MS-GCN layers.

Guided-Level Module
When the node-level module encodes skeleton data, it will ignore the correlation between moving frames, weaken the representation effect of related information, and strengthen the representation ability of unrelated features. In addition, to further refine the multiscale semantic information and strengthen the dependency between frames, we introduced self-guided attention to gradually refine the multiscale semantic information of joint nodes, which helps to encode local and global semantics, establish long-term dependency relationships between moving frames adaptively, eliminate redundant information, and highlight the representation of relevant information on joint nodes.
First, we transfer the multiscale semantics captured by the node-level module to the maximum pooling layer. Second, we input the pooling feature map into the self-guided attention module to generate detailed attention features. We denote the attention feature GAtt t at time t as: where GAttModule(•) is the attention-guided module. MP(•) is the maximum pooling layers. The attention-guided module is composed of positional attention (PA) and channel attention (CA) [39,40]. Positional attention can further obtain rich context representation, while channel attention can be regarded as a response of specific classes to strengthen the correlation and dependence between different semantic information and motion frames. Suppose the feature graph of attention input is F = MP(χ t ) ∈ R T×N×C , where T is the frame dimension, N is the bone joint dimension, and C is the channel dimension.
Proof of Positional Attention. The input feature map F is reconstructed by convolution block transfer to generate a new feature map F PA 0 ∈ R (T×N)×C . In the other branch, F follows the same operations and is then reconstructed, generating another new feature map F PA 1 ∈ R C ×(T×N) . The two new feature maps are multiplied and generate the positional attention coefficient α PA i,j as: where α PA i,j is the effect of the ith position on the jth position. In the third branch, we again reconstruct the input feature F and generate new feature map F PA 2 ∈ R C×(T×N) . Thus, the positional attention feature map F PA is: where the value of λ is gradually increased by learning.
In summary, the positional attention further selectively aggregates joints and frames the global context information to the captured features.
Proof of channel attention. Channel attention aims to reconstruct the channel of input feature map F. The channel attention coefficient α CA i,j is: where F CA 0 ∈ R (T×N)×C and F CA 1 ∈ R C×(T×N) are the reconstructed feature maps by different branches. Thus, the channel attention feature map F CA,j is: where the value of π is gradually increased by learning. The feature map F CA 2,j ∈ R C×(T×N) is reconstructed by other branches.
In summary, channel attention reaggregates captured semantic information into original features, highlights dependencies between same-type nodes and motion frames, and increases the ability to distinguish properties between classes.
Finally, we transfer and transform the captured multiscale spatial and temporal semantic information through the global average pooling layer (GAP), and we input the fully connected layer (FC layer) to complete the skeleton action recognition.

Experimental Results and Analysis
In this section, we conduct correlation experiments to verify the effectiveness of the proposed DAG-GCN framework for skeleton-based human action recognition, and we provide the experimental results and analysis. More precisely, we compare the DAG-GCN method with other state-of-the-art approaches on NTU60-RGB-D, NTU120-RGB-D, and other public datasets. Then, we present the results of ablation experiments on multiscale dynamic aggregate operations and show the efficiency of the DAG-GCN recognition framework.

Datasets
Proof of the NTU60-RGB+D (NTU60) dataset [22,31]: This dataset contains 60 preformed action classes with 56,880 skeleton sequences from 40 different subjects. Each human skeleton graph is represented by 25 (in this paper, N = 25) body bone joint nodes. Each movement frame contains one to two action subjects. In the cross-view (X-View), the 40 action subjects are randomly divided into 60% and 40% for the training and test sequences, respectively. For the cross-Sub(X-Sub), the action subjects sequence contains 40,091 training and 16,487 testing examples.
Proof of the NTU120-RGB+D (NTU120) dataset [23,32]: This dataset was collected using the Kinect camera and is an extension of NTU60-RGB+D with 113,945 examples and 120 action classes. For the cross-view (X-View) and cross-Sub (X-Sub), half of the 106 human action subjects are used for training, and the remaining parts are used for testing.

Training and Implementation Details
Proof of the framework setting. To gather as much multiscale and context semantic information as possible, in the bone joint node-level module, the number of MS-GCN layers was set to two, and the scale (S) was set to three (1 ≤ S ≤ 3). Before each MS-GCN and input CNN layer, the number of CNN neurons was set to 64, and the number of MS-GCN blocks was set to three (r = 3). The guided-level module contains position attention and channel attention. Finally, we used the Adam optimizer. The initial learning rate was set to 0.005, and the batch size was 64. The weight decay was set to 0.0005, and the epochs were set to 400.

Proof of the training environment.
All the experiments were processed on Pytorch Version 1.3.0, and the Python Version was 3.6, built on two NVidia Tesla P100 GPUs. To improve the accuracy, we used the process mode of label smoothing, and the smoothing factor was set to 0.1. Then, we used the cross-entropy loss to train the DAG-GCN model for human skeleton action recognition.

Ablation Experiments
To verify the influence of each component in the DAG-GCN model with regard to recognition accuracy, we present and analyze the experimental results of each component on the NTU60 and NTU120 public datasets.

Ablation Study on the Proposed Node Module
Multiscale structure features contain important semantic information of a skeleton action sequence. To further verify the effectiveness of different numbers of scale dynamic graph structures, we built multiple graph convolution models and discuss the experimental results for the NTU60 and NTU120 datasets. Table 1 lists the comparison results for different scales of graph models, where the evaluation index is accuracy (%). Table 1. The comparison results on NTU60/120 with the cross (X)-View and X-Sub(%). In Table 1, S = 1, 2, 3 is the scale of the multiscale graph convolution layers (MS-GCN layers). m = 1, 3, 6 is the number of MS-GCN blocks. GL is a guided-level module that contains position attention and channel attention.

Model X-View (NTU60) X-Sub (NTU60) X-View (NTU120) X-Sub (NTU120)
According to the experimental results in Table 1, we draw multiple conclusions as follows: (1) To learn the multiscale structure information of a skeleton sequence, the number of scales and blocks of the bone joint node module are changed. To verify the results, we find that (m = 1, S = 3)/GL outperforms (m = 1, S = 1)/GL by 3.89% and 9.34%, respectively, for the X-View and X-Sub accuracy on the MTU60dataset. However, the accuracy for the NTU120 was improved by 14.59% and 23.98%, respectively. The reason is that more scale structural information can be captured, and the structure information of different scales is complementary. This alleviates the deficiency of the single scale information in describing the joints and motion frames of bones.
(2) Multiscale local and global contest semantic information is beneficial for message passing in MS-GCN blocks. This information also improves the ability of semantic information to represent bone joints and motion frames. For example, the experiment results of (m = 3, S = 3)/GL are superior to those of (m = 1, S = 1)/GL. The accuracy was improved by 11.28% and 15.29%, respectively, for the X-View and X-Sub accuracy on the MTU120 dataset.
The main reason is that it is difficult for single MS-GCN blocks to explore semantic features of the skeleton actions with high-order structural information. For example, in the messaging process, the semantic information expressed should be different even if the 3D coordinates corresponding to different joints are the same. Introducing a multiscale graph convolution (MS-GCN) block allows the semantic information to represent bone joint and movement frames.
(3) Using the large-scale ((m = 6, S = 6)/GL), two stream joint semantics for the expressions of bone joint and movement frames at the same time, neither (m = 6, S = 3)/GL, nor (m = 6, S = 6)/GL produce further benefits in comparison with (m = 3, S = 3)/GL. The main reason is that the large-scale structure contains too much redundant information, and the detailed information of the graph structure is lost, although multiple scales capture bone joints and motion frames from more levels.

Comparison to the State-of-the-Art
To further verify that the proposed model can better capture the multiscale and global contextual semantic information of bone nodes and motion frames compared with the state-of-the-art skeleton recognition model, we used the NTU60 and NTU120 public datasets in a comparison study. The experimental results obtained by many state-of-the-art human skeleton action recognition models are listed in Tables 2 and 3. At present, the state-of-the-art recognition methods contain graph and other (non-graph) models. Table 2. Comparison of the results on NTU60/120 with X-View and X-Sub (%).

Models
X-View (NTU60) X-Sub (NTU60) X-View (NTU120) X-Sub (NTU120) In Table 2, a dash "-" indicates that these methods have not been experimentally tested on the dataset. A bold value indicates the highest value on the dataset.
According to the experimental results in Table 2, we draw the following conclusions: (1) For the NTU60 dataset, the proposed DAG-GCN recognition method obtained the optimal results. DAG-GCN outperformed 2S-GCN and SGNby 0.66% and 1.26%, respectively, in the X-View accuracy. However, 2S-GCN outperformed DAG-GCN by 0.46% and 3.87%, respectively, for the X-View and X-Sub accuracy on the NTU120 dataset. This result indicates that the adaptive graph structure (2S-GCN) has a better embedding effect on large-scale input data. Moreover, the second-order information of the skeleton data is explicitly formulated and combined with the first-order information using a two stream framework, which brings a notable improvement in the recognition performance. The topology of the graph is adaptively learned for different GCN layers and skeleton samples in an end-to-end manner, which can better suit the action recognition task and the hierarchical structure of the GCNs.
(2) For the overall experimental results, we observe that the graph structure methods outperform non-graph methods with regard to accuracy on the NTU120 and NTU60 datasets. The main reason is that the graph-based structure methods can better obtain the deep semantic information of bone joints by aggregating and transferring neighbor node information. In addition, the correlation between adjacent nodes is used to strengthen the dependence between moving frames.

Performance of the Guided-Level Module
In this section, we discuss the different attention module performances of the skeleton action recognition framework. The performance results are listed in Table 3. Table 3. Performance results of guided-level modules on NTU60/120 with the X-view and X-Sub (%). In Table 3, BJL indicates the bone joint-level module, NG the non-guided module, and PA and CA indicate position attention and channel attention, respectively.

Models X-View (NTU60) X-Sub (NTU60) X-View (NTU120) X-Sub (NTU120)
According to the performance results in Table 3, we draw the following conclusions: (1) For the NTU60 and NTU120 datasets, the proposed BJL/GL (DAG-GCN) recognition method obtained the optimal results, and compared to the non-guided module methods, we observe that by integrating either a position attention (PA) or channel attention (CA) module in the multiscale aggregate graph convolution (BJL), the performance improves by 11.13% and 16.46%, respectively, for the X-Sub accuracy, and by 5.07% and 16.95%, respectively, for the X-View accuracy. This result occurs because features generated by the proposed multiscale aggregate graph convolution are refined in the guided-level module frameworks in the proposed DAG-GCN, and the dependencies between motion frames are highlighted.
(2) In both datasets, compared to position attention (BJL/PA), the channel attention (BJL/CA) module has better performance in recognition accuracy. This finding demonstrates that even though both the position attention and channel attention modules exhibit a benign performance in recognition accuracy, the channel attention performance is better than that of position attention. In other words, the channel attention (CA) outperforms the position attention (PA) module when they are combined.

Visualization of the Recognition Results
Showing that the performance results are different through a correlation ablation experiment may not be enough to fully understand the advantage of the proposed DAG-GCN recognition framework. Although this framework improves the performance of skeleton action recognition (the results are shown in Tables 1-3), the training process and the impacts of different components are ignored. To this end, we visualize the impact of different components, and the training and testing process on the NUT60 datasets is shown below.
According to the training and testing process in Figure 2, we observe that the value of loss for the X-View on NTU60 dataset tends to be stable after being reduced by a certain degree.
The visualization recognition results of the consecutive frames from the NTU120 dataset are shown in Figure 3. According to the visualization results in Figure 3, we observe that both the "check time" and "clapping" actions are accurately recognized. Similarly, the semantics of continuous frames (in red) are better captured by the proposed recognition framework (DAG-GCN); that is, we can more effectively explore the detailed changes of the bone joints, especially when the region has a complex or subtle change (in green). These visualization results indicate that the DAG-GCNs can capture the finer semantic details of skeleton action while avoiding semantic loss and enhanced dependencies of consecutive frames. The selective aggregates of temporal and spatial information from a dual attention guided-level module help to capture contextual information, and the semantic information of the joints is gradually refined.  The visualization results of different attention modules (PA and CA) on the NTU60 datasets are shown in Figure 4. We observe that the response of semantic classes is more noticeable than that of channel attention (CA). Although position attention can effectively enhance specific semantics and locate key bone joint points, some non-keybone joint regions are still highlighted on the semantic maps. By contrast, we propose a guided-level (GL) module to capture semantic information that better focuses on the specific bone joints of the key regions of interest. In particular, we observe that captured semantic maps whose highlighted areas focus on a few key continuous bone joints can avoid similar continuous movements that might result in misrecognition of some actions.

Conclusions and Next Research Works
We propose a dual attention-guided graph convolutional network for the task of human skeleton action recognition. The recognition framework incorporates a multiscale aggregate graph convolution strategy to combine multiscale semantic information of bone joints and consecutive frames at different levels and use guided attention to gradually aggregate and refine relevant local and global context semantic information. Then, the guided-level module filters irrelevant noisy information and improves the frameworks to focus on relevant class-specific bone joints in the skeleton. To validate our proposed DAG-GCN method, we conducted experiments on the NTU60 and NTU120 datasets in an action recognition task, and we presented the results of extensive ablation experiments to evaluate the effectiveness of the proposed methods.
The experimental results showed that the proposed recognition frameworks outperformed other models on the NTU60 datasets, mainly due to the rich multiscale contextual dependencies of local and global information. In addition, we enhanced the correlation between frames and bone joints to be more suitable for small-scale data. However, this approach was not superior to the 2S-GCN method on the NTU120 data, mainly because the three 2S-GCN types of semantic graph modules (structural graph extraction module, actional graph inference module, and attention graph iteration module) were employed in Sem-GCN to aggregate L-hop joint neighbor information in order to capture action-specific latent dependencies and to distribute importance levels. Compared with DAG-GCN, this approach is more suitable for large-scale data. In the future, we will explore a faster and more powerful semantic network for human skeleton action recognition. DAG-GCNs dual attention-guided multi-scale dynamic aggregate graph convolutional networks GL guided-level module BJL bone joint-level module NG non-graph