1. Introduction
Research on human action recognition has become one of the most active issues in the computer vision area in recent years. It has been widely used in security surveillance [
1], robot vision [
2], motion-sensing games, virtual reality [
3], etc. The existing methods [
4,
5,
6] for action recognition are often conducted on RGB video or skeleton data. Although these methods have impressive performance, they are often affected by dynamic circumstances, viewpoints, illumination changes and complicated backgrounds, etc.
A human skeleton sequence can express discriminative and robust dynamic information for action recognition tasks. Skeleton data is composed of coordinates of human body key joints. It can effectively describe the body structure and dynamic of actions. Due to the absence of color information, the skeleton data is more robust to the variants of appearance and environment and more efficient in computation [
7,
8]. In addition, skeleton data is extremely compact in terms of data size. It is not only an effective representation of human body structure but also easy to be obtained by pose estimation algorithms [
9,
10] or depth sensors, such as Microsoft Kinect, Asus Xtion and Intel RealSense. Because of the development of these low-cost sensors, they are widely used in many fields to capture the motion of human joints. These sensors could also acquire different modalities of data, such as depth and RGB video.
Earlier skeleton-based methods [
11,
12,
13,
14,
15,
16] usually arranged human body joints as a chain and employed their coordinates as a vector sequence. Considering the sequence property, Recurrent Neural Networks (RNNs) are naturally applied to the action recognition task [
11,
12,
15]. On the other hand, some works considered the skeleton data as a pseudo-image. Then Convolutional Neural Networks (CNNs) were employed to recognize the actions [
13]. However, these methods ignore the latent spatial structure of the body, which is beneficial for action recognition. Due to the topological structure of the human skeleton, modeling skeleton data as a graph is more suitable than a vector sequence or pseudo-image [
17,
18,
19,
20]. Conventional deep learning methods [
21,
22,
23] that conduct convolution on graphs have shown impressive success in many applications. Inspired by the truth that human skeleton data is naturally a topological graph, Yan et al. [
20] first introduced Graph Convolutional Networks (GCNs) [
21] and proposed a spatial-temporal graph convolutional network (ST-GCN) for the action recognition task based on the human skeleton. They constructed a spatial-temporal graph and conducted convolution on it. After that, many GCN-based methods [
17,
18,
19,
24,
25,
26] were proposed and achieved superior performance than traditional methods. Nevertheless, the graph structures in most of the GCN-based methods are constructed manually and would be fixed during the training process. For instance, the skeleton graph in [
20] merely represented the physical structure of the human body. The relations of joints would be built only in a local context. It may not be suitable for some actions that happened between the joints that are not directly connected with physical connection, such as “clapping”, “talking on the phone”, “comb one’s hair”, etc. To capture the relations between distant joints, some approaches extracted multi-scale structural features via higher-order polynomials of the skeleton adjacency matrix. Li et al. [
17] introduced multiple-hop modules to establish correlations between distant joints. However, it still had limitations for establishing long-range joints dependencies. On the other hand, there were redundant connections between further and closer neighbors. To capture both nearby joints dependencies and distant joints relations, we designed a simple yet efficient adaptive adjacent matrix.
In addition, some works [
19,
20,
27] simply conducted 2D convolution on temporal dimensions that cannot effectively capture long-range temporal features. Examples such as “put on hat” and “take off hat” are very ambiguous in a short time observation. It requires capturing long-range temporal information to distinguish them. Therefore, we construct a memory module to remember the long-range temporal information by GRU. Furthermore, not only what happened in the past is critical, but also the information in the future is important for the action recognition task. It is necessary to memorize this periodic information. Moreover, there always exists redundant information in an action sequence. That is to say, not every frame in a video sequence is important for action recognition. Thus, how to select key frames and extract discriminative features also needs to be carefully considered.
In this paper, we propose an end-to-end network architecture, termed as Adaptive Attention Memory Graph Convolutional Networks (AAM-GCN) for human skeleton action recognition. It employs graph convolution to adaptively construct the spatial configuration in one frame and uses multiple bidirectional GRU layers to extract temporal information. Concretely, we construct an adaptive spatial graph convolution (ASGC) module to explore the spatial configuration feature. When constructing the spatial graph, an adaptive matrix is designed that could infer relations between any joints in a data-driven manner without weakening the relation of physically connected joints according to the actions. Then, the topology of the constructed graph is dynamically established during the training process. The attention memory module is implemented via multi-bidirectional GRU layers to build an attention-enhanced memory. It could remember the long-range temporal context before and after the actions. In addition, not all frames in the action sequence are informative, and there are lots of redundancies. The key frames carrying more discriminative information for action recognition should be enhanced during training. Therefore, we incorporate attention mechanisms to the memory for selecting the important frames. Moreover, the attention-weighted memory is refined progressively by multiple iterations for recognition.
The advantage of our proposed method could be described as: (1) The constructed adaptive graph can effectively capture the latent dependencies between arbitrary joints, including the ones which do not have a physical connection but have strong correlations in the actions, which is more suitable for real actions that need the collaboration of different body parts. (2) The memory model is capable of modeling long-range temporal relationships over distant frames, which is beneficial for extracting discriminative features because it can provide a temporal context for the actions. (3) Due to the forward and backward directions, the bidirectional GRU could encode both what happened in the past and what will happen in the future, which can eliminate ambiguities of the actions, such as “put on hat” and “take off hat” via long time observation. (4) The attention could help the model to select key frames in the action sequence.
The main contributions of our work are summarized as follows:
We propose an AAM-GCN network to model dynamic skeletons for action recognition, which can construct the graph structure adaptively during the training process and explicitly explore the latent dependency among the joints.
By constructing an attention-enhanced memory, AAM-GCN can selectively focus on key frames and capture both long-range discriminative temporal features in the past and the future.
We evaluate the proposed model on three large datasets: NTU RGB+D [
28], Kinetics [
29] and Motion Capture Dataset HDM05 [
30]. It exhibits superior performance than some other state-of-the-art methods in both constrained and unconstrained environments. Furthermore, we conduct an ablation study to demonstrate the effectiveness of each individual part of our model.
2. Related Works
The human skeleton can effectively represent the human body structure and movement due to the fact that it is robust to dynamic circumstances and irrelevant to appearance. For action recognition tasks, early works [
14,
16] often used handcrafted features to design the model. In recent years, deep learning has received more attention, many Convolutional Neural Network (CNNs) and Recurrent Neural Network (RNNs)-based methods have been proposed for skeleton action recognition [
13,
31,
32]. The models based on CNN usually consider the joint coordinates as a grid structure and implement convolution to extract discriminative features [
13,
33,
34,
35,
36]. Xie et al. [
13] proposed a memory attention network that took each element in the coordinate vector of all the joints as 2D grid data and then conducted convolution on it. Li et al. [
33] designed a hierarchical structure to extract co-occurrence features of skeleton joints using CNN.
Compared with CNN, RNN has a stronger ability to model sequence data. Various RNN-based models have been applied to the action recognition field [
37]. Models based on RNN usually use the coordinates of joints and consider the skeleton data as a sequence of vectors along a temporal dimension [
6,
11,
28,
38,
39]. As a variant of RNN, Long Short-Term Memory (LSTM) has also been widely employed to learn the temporal dynamics of skeleton sequences [
12,
18,
40]. Liu et al. [
12] arranged the joints as a chain along the spatial dimension and then employed the coordinates with LSTM to build a global context memory for action recognition. Du et al. [
11] constructed a part-based hierarchical RNN architecture in which the skeleton was divided into several parts and then fed to a hierarchical Recurrent Neural Network. Song et al. [
38] adopted an attention mechanism to select key joints and focused on discriminative features via LSTM. However, due to the fully connection structure within it, the LSTM-based networks suffered from high computation complexity problems. Consequently, the Gated Recurrent Unit (GRU), a simplified variant of LSTM, was proposed by Cho et al. [
41], which is efficient and computationally inexpensive.
Although the existing CNN and RNN-based methods have achieved remarkable results, there still exist limitations for using CNNs and RNNs to model the skeleton data. First, CNN and RNN cannot exactly describe the spatial-temporal structure of the body. Second, the aforementioned CNN and RNN methods all characterize the topology of the human skeleton joints as pseudo images or vector sequences rather than a graph. Thus, the spatial structure of the body skeleton is neglected in them.
Graph Convolutional Neural Networks (GCN) have been widely used in many areas, such as social networks and biological data analysis [
2]. Due to the topology of the human skeleton structure, it can be naturally represented by a graph; the graph convolution has been applied to a skeleton action recognition task recently. Yan et al. [
20] first applied GCN to the skeleton data for human action recognition. After that, a series of GCN-based methods have emerged. Shi et al. [
19] proposed a Two-stream Adaptive Graph Convolutional Network (2s-AGCN). They used both joints and bones to build a two-stream work separately and fused the sores for final recognition. Li et al. [
17] introduced an Actional-Structural Graph Convolutional Networks (AS-GCN) and designed two types of links: actional link and structural link for the human skeleton. Combining GCN and LSTM, Si et al. [
18] presented an attention human action recognition model based on hierarchical spatial reasoning and temporal stack learning network.
More recently, Liu et al. [
42] proposed a feature extractor (MS-G3D) that modeled cross space-time joint dependencies. However, only the neighbor nodes in a few adjacent frames will be convoluted for the central node. The information in the temporal field is relatively local. Dynamic GCN [
43] is a hybrid GCN-CNN framework. They designed a dynamic graph topology for different input samples as well as graph convolutional layers of various depths. Nevertheless, the generality of the graph structure and the collaboration of joints are still to be considered. Plizzari et al. [
44] proposed a Spatial-Temporal Transformer network (ST-TR), which modeled dependencies between joints using the transformer self-attention operator. It combined the Spatial Self-Attention module and the Temporal Self-Attention module in a two-stream way; they got better results than models that used both joint and bone information as the input data. Aside from that, the input data in ST-TR were pre-processed by [
25]. Chen et al. [
45] proposed a Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels. They employed four modalities’ data (joint, bone, joint motion and bone motion) as inputs. On the other hand, the input data of CTR-GC and ST-TR are both pre-processed before training. During the training process, CTR-GC adopted a warm-up strategy at the beginning of the training process.
Although the previous methods have remarkable performance, how to construct a specific graph to effectively represent the human skeleton data and extract discriminate features are still challenging issues.
The attention mechanism has received extensive attention in many applications [
46,
47]. It can help the network to focus on key information while neglecting the inessential information. Xu et al. [
48] employed an attention mechanism in their image caption generation work. Yao et al. [
49] applied attention to the temporal dimension for their video captioning task. Stollenga et al. [
50] adopted an attention mechanism to the image classification field. Luong et al. [
47] combined global attention and local attention to computing different attention weights for neural machine translation. Some deep learning-based works have adopted an attention mechanism in action recognition [
46,
51]. The hidden information of the previous time step in LSTM is often used for calculating attention. Liu et al. [
12] proposed a global context-aware attention LSTM network (GCA-LSTM) for skeleton action recognition. They introduced an iterative attention mechanism using global context information to choose the informative joints in an action video clip. Similarly, Song et al. [
38] computed attention weights to focus on key frames and more informative joints. Xie et al. [
13] designed a temporal attention recalibration module to weigh each frame in the skeleton action sequence. Si et al. [
18] proposed an attention-enhanced graph convolutional LSTM network that incorporated soft attention to automatically measure the importance of each joint and adaptively selected the key ones.
3. Graph Convolutional Networks
The skeleton data can be acquired by depth sensors or generated by pose estimation algorithms from a raw RGB video clip. The position of each joint is represented as a 2D or 3D coordinate vector. As a result, the skeleton in a frame is represented as a set of joint coordinates. In GCN-based methods [
18,
20], the human skeleton dynamics in an action video clip are often modeled as a spatial-temporal graph. In the graph, the joints are considered as nodes, and physical connections between joints in a frame are considered as spatial edges. Additionally, temporal edges are added to connect the same joint between consecutive frames. The joint coordinates are considered as the attribute of the corresponding node. We follow [
18,
20] as our basic approach for constructing the graph. The detailed process is described below:
In GCN-based action recognition works [
18,
20], the dynamics of the human skeleton sequence for performing actions with
joints and
frames are denoted as a spatial-temporal graph
. The node set
contains all the joints in the skeleton sequence. The edge set
contains two subsets: the intra-frame edge set
and the inter-frame edge set
. For clear demonstration, an example of a spatial-temporal graph is illustrated in
Figure 1. An action clip is composed with
frames, and the person in each frame has
joints, and each joint is represented by 2D or 3D coordinates. Thus, the action clip can be represented by the joint-coordinates vector
, where
denotes the dimension of coordinates.
For the arbitrary node
in
, its neighbor set in the current frame is defined as
, where
is the minimum path length from
to
and
is set to 1. According to the strategies designed for partitioning the neighbor set [
20], different labeling functions
are designed to partition the neighbor set
of node
into
subsets at the
frame. Each node in the neighbor set is assigned a label from
. The nodes in the same subset have the same label and share the convolution weights, and then the graph convolution is conducted as:
where
represents the feature of neighbor node
.
is a weight function, which allocates weights according to different labels. The label of each node
that in neighbor set
is computed by
, according to the partitioning strategy. The neighbor nodes that have the same label will be allocated the same weights.
is the number of nodes of each subset according to the label, which is considered as the normalizing term and
represents the result of graph convolution computation at node
. The graph convolution could be implemented as in [
21]:
where
is a degree matrix and the diagonal element
,
is the adjacency matrix for encoding the connections of nodes before partitioning the neighbor sets and
is the identity matrix representing self-connections. The topology of a human skeleton graph in a frame is described by
and
. According to the partitioning strategy, the labeling function divides each neighbor set into
subsets. Therefore, the adjacency matrix can be dismantled as
,
, where
represents the adjacency matrix of each subset and
is the label.
has the same size as the original
normalized adjacency matrix,
is the number of joints. Given
, Equation (2) can be represented as:
Taking the spatial configuration partitioning, for instance, see
Figure 2, the neighbor set of the root node is partitioned into
subsets: (1) the root node itself; (2) the centrifugal group: the neighbor nodes that are further to the gravity center of the skeleton than the root node; or (3) the centripetal group. Therefore, the dismantled adjacent matrix
represents correlations of nodes in each subset.
For extracting temporal information, Yan et al. [
20] simply conducted temporal convolution along the temporal dimension with a kernel size of nine. Nevertheless, it ignored the correlation of temporal context and could not focus on discriminate information.
4. Adaptive Attention Memory Graph Convolutional Network
In view of the disadvantages of previously analyzed methods above, we propose a novel Adaptive Attention Memory Graph Convolutional Network (AAM-GCN). In this section, we will introduce the individual component of our model in detail. The overall architecture of AAM-GCN is shown in
Figure 3. The skeleton action sequence is represented as a multi-frame sequence of the joint-coordinates vector and fed to the adaptive spatial graph convolution (ASGC) module to extract the spatial feature. After Batch Normalization (BN) and ReLU, the spatial feature will be sent to the attention memory module to obtain the attention-enhanced temporal feature. To extract effective temporal information without weakening the spatial configuration information, we concatenate the spatial and temporal features as the final discriminative feature for action recognition. More specifically, the AAM-GCN not only uses the temporal attention-enhanced memory information but also delivers the spatial configuration information of ASGC by the identity shortcut.
4.1. Adaptive Spatial Graph Convolution
The adaptive spatial graph convolution (ASGC) is designed to extract the spatial information of the human skeleton sequence adaptively. The spatial-temporal graph is often designed to describe the structure of the human body and the dynamics of actions. The spatial graph structure [
18,
20] introduced in
Section 3 is predefined and set manually, which will be fixed during the training process. The adjacency matrix
encodes the connections of nodes and decides the topology of the spatial graph according to the partitioning strategy. However, it can only capture relations between the joints connected by physical connections [
20]. If the element
in
, it means that joint
and
will never establish correlation in the action representation. This may not be suitable for real actions that happen with the collaboration of arbitrary joints. For example, the two hands have strong correlations when clapping. The hand and head will move collaboratively when brushing teeth and talking on the phone. In fact, these joints are not directly connected by physical connections but have strong correlations.
To solve this problem and construct a more generalized skeleton graph, we design an adaptive data-driven graph structure to capture rich dependencies among arbitrary joints. That is, we add an adaptive matrix
to the original adjacency matrix
for getting an adaptive adjacency matrix
without weakening correlations of physical connecting joints. Consequently, the original item
in Equation (3) will be changed by
, and Equation (3) will be modified as:
where
is a learnable parameter matrix with initial value of all zeros; it has the same size of
. In contrast to
, the elements in
are parameterized and will be optimized according to the training data. During the training process,
will update its elements according to different actions and capture the correlations between any joints. The graph learns according to the training data in a data-driven manner. The elements in
not only represent the connection between each pair of joints, but also indicate the intensity of the correlation.
As a result, the adaptive graph adjacent matrix could establish correlations between arbitrary joints performing collaboratively in actions. It is more flexible than the predefined fixed one.
To eliminate inter-frame redundancy and increase the temporal receptive field, we conduct a temporal average pooling operation. It can not only improve the perception ability of the module but also reduce the computation cost. After temporal pooling, we obtain a multi-frame clip feature instead of a single frame, which will be more effective for perceiving temporal dynamics.
In the computation process, the input skeleton data of the ASGC module is a 2D or 3D joint-coordinate vector sequence represented by , where is the input coordinate dimension, is the number of frames and is the number of joints. After spatial graph convolution and temporal pooling, we can acquire the spatial configuration feature , where is the number of channels and is the temporal dimension after ASGC. The acquired spatial feature will be fed into the attention memory module to learn the temporal relations. The attention memory module will be introduced in the next section.
4.2. Attention Memory Module
The attention memory module is implemented via multi-bidirectional GRU layers to establish an attention-enhanced memory. It can effectively capture the temporal dynamics of the human skeleton action sequence. Due to the forward and backward directions, bidirectional GRU can not only remember what happened in the past but also employ the information in the future. In addition, not all frames in the action sequence are informative and there are lots of redundancies. The key frames carrying more discriminative information for action recognition should be enhanced during training. Therefore, we incorporate the attention mechanism to the memory for choosing the important frames.
The structure of the attention memory module is shown in
Figure 4, which contains two bidirectional GRU layers. The first layer encodes the spatial feature produced by ASGC and generates the initial memory. Then this initial memory matrix is weighted by the attention procedure. Since the attention-enhanced memory is achieved, we feed it to the second layer to refine the memory. We propose a recurrent mechanism to optimize the memory. Multiple iterations can be carried out to refine the memory progressively. Finally, the output of the last bidirectional GRU layer combined with the spatial structure acquired in ASGC will be concatenated to predict the action label. The detailed process is described below:
After the spatial graph convolution operation that has been introduced in
Section 4.1, we get the spatial feature matrix
. Then, it will be fed into the attention memory module to learn the temporal features. The attention memory module is composed of multi-bidirectional GRU layers. The number of hidden neurons in each GRU is denoted by
. The intermediate hidden states of the forward and backward GRU are full of spatial and temporal information, which is useful for selecting the key frames. In particular, we adopt a soft attention mechanism to adaptively focus on the key frames that are beneficial for action recognition.
Let
denote the output hidden state of the
bidirectional GRU layer. For each time step, we aggregate the row of
and generate a query vector
as:
where
. The attention score
of the
layer will be calculated as:
where
are learnable parameter matrices and
are the bias.
The memory
can be computed from the hidden states of GRU from two different directions. At first, the initial memory
in the first layer is computed as:
where
and
represent the computation of hidden states by forward and backward GRU, respectively. We concatenate the output hidden state of each direction to construct the initial memory matrix
in Equation (7),
denotes the concatenating operation.
Let
denote the memory of the
bidirectional GRU layer. After the attention matrix
is achieved in Equation (6), the attention-enhanced memory
would be calculated as:
where
denotes the element-wise multiplication.
and
in attention memory module could be jointly learned during training.
To get a more reliable memory, we propose a recurrent mechanism to iteratively refine the memory. The attention-enhanced memory
of the current layer will be fed into the next bidirectional GRU layer to refine the memory iteratively:
where
represents the refined memory computed by the next layer. After
iterations, the final memory of skeleton joints
is achieved. For simplicity, we use
instead of
to represent the final memory. Note that, the output memory of the last layer aggregates all node features for classification. Consequently, the multi-bidirectional GRU layers extract the temporal information from the spatial feature of skeleton and summarize memory
across the input sequence.
To extract the temporal information of key frames without weakening the information of the spatial structure, the spatial features produced by ASGC are delivered by the identity shortcut to concatenate with temporal features. Finally, the concatenated features are used for action recognition.
4.3. Model Architecture and Training Detail
To integrally capture temporal features and the spatial relationships among arbitrary joints, we combine the adaptive spatial graph convolution (ASGC) and attention memory module to develop the Adaptive Attention Memory Graph Convolutional Network (AAM-GCN) for skeleton action recognition.
We first normalized the input data via a batch normalization (BN) layer at the beginning and then adopted ASGC to obtain the spatial features of joints. The ASGC module is composed of nine graph convolution blocks; the feature dimensions are 64, 128 and 256 in the first, second and last three blocks. In each block, the graph convolution layer is followed by a BN and a non-linear activation ReLU layer. We also add a temporal pooling operation to improve the efficiency after the convolution layer. To avoid overfitting, we set a dropout layer with a probability of 0.5 in each block. For the neighbor selection method in ASGC, the neighbor nodes connected directly with the root node (
) will constitute the neighbor set. Similarly, the spatial configuration partitioning strategy is adopted as the graph labeling function as [
20]. Therefore, the neighbor set will be partitioned into
subsets: the centrifugal group, the centripetal group and the root node itself. The number of hidden neurons in the first and second bidirectional layers is 128 and 64, respectively. Extensive experiments are conducted on PyTorch1.2.0 using one GTX-1080 GPU. The model is trained for 50 epochs for each dataset. The batch size is 32. Stochastic Gradient Descent (SGD) is used as the optimization function, and the initial learning rate is 0.01. It is divided by 10 after every 10 epochs. To avoid overfitting, we adopt the data-augmentation method as ST-GCN [
20]. Cross-entropy [
52] is chosen as the loss function.
6. Conclusions and Future Work
In this paper, we proposed an Adaptive Attention Memory Convolutional Network (AAM-GCN) for skeleton action recognition. We designed an adaptive adjacent matrix to construct the spatial graph and adopt graph convolution to extract spatial features. The adaptive adjacent matrix will be optimized in a data-driven manner according to the actions during the training process and will better describe the actions in the real world. To extract the inter-frame action features, we constructed an attention-enhanced memory via bidirectional GRU. It can effectively capture temporal relations by the attention-enhanced memory. Due to the bidirectional characteristics, it can not only remember what happened in the past but also employ the information for the future. We evaluate our model on three large datasets: NTU RGB+D, Kinetics and HDM05. The experimental results demonstrate that our AAM-GCN achieves better performance than some other state-of-the-art methods.
However, there are still some limitations of our method. For the action recognition task, the scene and the objects that interacted with the person play an important role. This useful information is ignored by the skeleton data, which only employed the location information of joints. There are also redundant joints in the skeleton that could be merged to reduce the complexity and make the graph structure more robust. The future work includes the following issues: constructing a more general network architecture for different types of action recognition. Other auxiliary information, such as the objects that the person interacted with, and the scene where the action took place, would contribute to the action recognition task. Further study in action anticipating is also considered in future work.