1. Introduction
With the development of Internet technology and the popularization of video acquisition equipment, video has become the main carrier of information. The amount of video data is exploding, and how to analyze and understand the content of videos becomes more and more important. As one of the important topics of video understanding, human action recognition has become the focus of research in the field of computer vision.
Action recognition learns the representation and motion information in the video by modeling the spatial and temporal information of the time sequence in the video, which can establish the mapping relationship between the video content and the action category and help a computer effectively be competent for the task of video understanding. Action recognition has extensive application prospects in motion analysis, intelligent monitoring, human-computer interaction, video information retrieval and so on.
In recent years, skeleton-based data has been used more and more widely in human motion recognition. The skeleton data is a compact representation of human motion information, usually composed of a series of time series of 3D coordinates of human body joints, which can significantly reduce the redundant information in calculations. Additionally, in complex scenes, compared with traditional RGB video or optical flow data, human skeleton data is more robust because it can ignore the interference of background information; therefore, the recognition performance will be better.
Earlier skeleton-based action recognition methods mainly used joint coordinates to form feature vectors by hand-designed approaches [
1,
2,
3], and then aggregated these features for action recognition. However, it ignored the internal association of joint information, resulting in insufficient recognition accuracy.
In traditional deep learning methods, the skeleton sequence is usually fed into recurrent neural networks (RNN) or convolutional neural networks (CNN) for analysis, and its features are captured to predict action labels [
4,
5,
6,
7,
8,
9,
10,
11,
12]. However, the representation of skeleton data as pseudo-images or coordinate vectors cannot well express the interdependence between the joints of the human body, which is particularly important for human body action recognition. In fact, it is more natural to regard human skeleton information as a graph, with human joints as nodes of the graph and human bones as edges. In recent years, with the development of graph convolutional networks, it has begun to be successfully applied to various scenarios.
Yan et al. [
13] first used graph structure to model skeleton data and construct a series of spatio-temporal graph convolutions on the skeleton sequence to extract features for skeleton-based action recognition. Initially, the graph convolution is performed to make the joints capture the local structure information of its neighbor nodes. With the increase of GCN layers, the receptive field of joints will expand to capture a larger range of neighborhood information. In this way, the spatio-temporal GCN finally obtains the global structure information of the skeleton data.
However, there are some disadvantages in this method: 1. It only pays attention to the physical connection between joints but ignores the remote dependence of joints. There may also be potential connections between disconnected joints. For example, there is no direct connection between the hand and the head in the action of making a call, which is very important to recognizing the action. Ignoring the connection between these disconnected joints will have an impact on the performance of action recognition; 2. It uses a fixed 1D convolution to perform the temporal convolution operation. The receptive field of the temporal graph is predefined and not flexible enough, which will also affect the accuracy of action recognition; 3. A skeleton action sequence contains many frames, and each frame contains multiple joint points. For action recognition, we often only need to focus on the key frame and joint information, and too much redundant information will lead to a decrease in recognition performance.
In order to solve the above problems, we use the high-order polynomial of the skeleton adjacency matrix to perform graph convolution to realize the extraction of the long-distance dependence relationship between the joints and the multi-scale structural feature. The adjacency polynomial makes the distant neighbor joints reachable. We introduce high-order approximation to increase the receptive field of graph convolution and build a multi-scale adjacency matrix according to the hop distances between each joint and its neighbors, which can help to better explore the joints in the spatial feature information. For the modeling of the skeleton sequence in the temporal dimension, the temporal graph is composed of the corresponding joints in consecutive frames connected by the temporal edges. Existing methods [
4,
13,
14,
15,
16] usually set a fixed temporal window length and use 1D convolution to complete the temporal convolution operation. Since the time window is predefined, it lacks flexibility when recognizing different actions. Thus, we propose a multi-scale temporal convolution module that contains three scales: long, medium and short, to obtain the time characteristics of different temporal domain receptive fields. Additionally, we introduce weights to adaptively adjust and dynamically combine the used temporal domain receptive fields, which can more effectively capture temporal features of the skeleton data.
With the rise of the attention mechanism [
17,
18,
19,
20], we can introduce attention modules into the network. From the spatial dimension, an action often only has close connections with a few joints. From the temporal dimension, an action sequence has a series of frames, each of which has a different importance for the final recognition accuracy. From the feature dimension, each channel also contains different semantic information. Therefore, we added spatial attention, temporal attention and channel attention modules to focus on key joints, frames and channels that have a great impact on action recognition.
We propose a novel architecture for skeleton-based action recognition; the basic structure in our proposed model is the spatio-temporal-attention module (STAM), which is composed of a multi-scale aggregate graph convolution network (MSAGCN) module, a multi-scale adaptive temporal convolution network (MSATCN) module and a spatio-temporal-channel attention (STCAtt) module, as shown in
Figure 1. We propose a multi-scale GCN to construct a multi-order adjacency matrix to obtain the dependence of joints and their further neighbors in the spatial domain, thereby effectively extracting wider spatial information features. Then we propose a MSATCN containing three temporal convolution kernels to model temporal features in three different ranges of long, medium and short. We introduce the STCAtt module to focus on more meaningful frames and joints from the spatio-temporal and channel dimensions, respectively. Additionally, we have added residual links between the modules, which can effectively reuse the spatio-temporal and channel features, to capture the local and global features of the skeleton sequence. This network architecture can better model the spatial and temporal features of human actions for skeleton-based action recognition. In order to verify the effectiveness of our proposed GCN model, extensive experiments were performed on three large-scale datasets: NTU-RGBD, NTU-RGBD120 and Kinetics, and our model achieved great performance on these datasets.
The main contributions of our work are as follows: (1) We propose a multi-scale aggregate GCN, which increases the connection between disconnected joints to better extract the spatial features of the skeleton data. (2) We design a TCN with a multi-scale convolution kernel to adaptively capture the temporal correlation of the skeleton sequence in different temporal length ranges and improve the generalization ability of the model. (3) We introduced an attention mechanism to focus on more meaningful joints, frames and channel features. (4) Extensive experiments have been performed on three public datasets: NTU-RGBD, NTU-RGBD120 and Kinetics, and our MSAAGCN has achieved very good performance.
3. Multi-Scale Adaptive Aggregate Graph Convolutional Network
The data of the human skeleton is usually obtained from video by multimodal sensors or a pose estimation algorithm. The skeleton data is a series of frames containing the corresponding joint coordinates. The 2D or 3D coordinate sequences of the joints are fed into the network, and we construct a spatio-temporal graph. The joints are regarded as nodes in the graph, and the bones connected by adjacent joints are regarded as edges in the graph. We design a multi-scale adaptive GCN, which can model the long-range spatial dependence of the human body structure and extract richer spatial action features. Then we concatenate the multi-order feature maps obtained by the spatial graph convolution operation and then feed them into temporal module, whose purpose is to process temporal information and learn the joint movement information in the temporal sequences to add temporal features for the graph. Additionally, the MSATCN we proposed can extract three forms of different temporal range feature information and aggregate them to obtain high-level feature maps. Moreover, we apply the STCAtt to enhance the representation of features.
3.1. Multi-Scale Aggregate Graph Convolution Module
For better action recognition based on skeleton graphs, multi-scale spatial structural features need to be paid attention to, while traditional methods only consider the physical connections between joints, ignoring that there are also potential correlations between disconnected joints in the structure. For example, there are multi-hop connections between the left hand and the right hand, but it is very important to recognize the action of clapping. Therefore, we propose a multi-scale aggregate graph convolution module to explore the relevance of human skeleton data in the same frame. As shown in
Figure 2, we build the multi-order adjacency matrices and adopt an adaptive method in each adjacency matrix to learn the topological structure of the graph to improve the flexibility of modeling semantic information, and finally we aggregate the obtained multi-order information to explore the spatial feature information of the skeleton sequence.
In the graph convolutional network, we construct a graph
to represent human skeleton data, where
is the set of nodes that symbolizes the joints of the human body;
is the set of edges of the skeleton connected by adjacent joints; and
represents the adjacency matrix of the edge, which encodes the connection of the edge, that is, if the node
and
are connected, then
; otherwise,
. We define
, where
represents the identity matrix, and
is the adjacency matrix with self-connection. Since
cannot be multiplied directly, we first need to standardize it. According to [
39], the standardized Laplacian L can be formulated as
, where the diagonal matrix
,
.
For the input
, the graph convolution with filter
can be expressed as
;
represents the eigenvector matrix of
;
is the Fourier transform of the graph of
; and
can be regarded as a function of the eigenvector of
. Hammond et al. [
40] have mentioned that
can be approximated by Chebyshev polynomials, which is formulated as
where
, and
represents the
-order Chebyshev coefficient. Moreover, a Chebyshev polynomial is defined as
,
,
.
According to [
39], for the signal
, it satisfies
means to calculate the difference between each vertex and its one-hop neighbor nodes. When
,
means the k-order connection, and
means the difference between each vertex and its k-hop neighbor nodes, so it is possible to perform convolution operations on more neighbor nodes in a larger receptive field. Moreover, the spatial graph convolution with a receptive field of k can also be regarded as a linear transformation of the k-order Chebyshev polynomial. Finally, we define the spatial graph convolution operation as follows:
where
is the weight matrix of
, which can be learned by training in the network, and
represents the bias;
denotes the
activation function.
We have observed that after each convolution operation in GCN, the network will aggregate the information of the joint and its neighbors connected via bones. This means that after multiple convolution operations, each node finally contains global information, but the early local information will be ignored. He et al. [
41] proposed the concept of a residual link by introducing the residual structure, adding identity mapping to connect different layers of the network, and the input data information can be directly added to the output of the residual function. We consider adding residual links to reuse the local information of the node to obtain richer semantic information.
As shown in
Figure 2, we define graph transformations of different orders according to the distance of hop connections between nodes, where 1-hop connection indicates a physical connection between nodes, and we aggregate multi-scale information through concatenate operations. The identity transformation is passed through the residual connection.
In order to improve the flexibility of the convolutional layer, inspired by the data-driven method from [
15], we add the global graph and individual graph learned from the data to each order of the adjacency matrix and apply data-dependent and layer-dependent biases to the graph transformations of the matrix for more flexible connections so as to better learn the adaptive graph topology. Moreover, we add a graph mask, which is initialized to 1 and updated with the training process.
Feature aggregation: After the feature extraction of the adjacency matrix of each order is performed through the graph convolution transformation, the number of output features becomes 1/n, and then we aggregate them together through a concatenation layer, which consists of a batch normalization layer, a layer and a convolutional layer to maintain the original feature number.
3.2. Multi-Scale Adaptive Temporal Convolution Module
In the defined graph structure, the same joints between consecutive frames are connected by temporal edges (green lines in
Figure 1). We choose to decompose the spatio-temporal graph into two spatial and temporal sub-modules instead of directly performing graph convolution operations on the spatio-temporal graph. This is because the strategy of the spatio-temporal decomposition method is easier to learn [
14]. For the graph of frame t, we first feed it into the spatial graph convolution and then concatenate its output information with the time axis to obtain a 3D tensor, which is fed into the temporal module for action recognition.
For temporal convolution, current methods [
13] usually use 1D convolution to obtain temporal features. Since the convolution kernel size is fixed to
, it is often difficult to flexibly capture some temporal correlations of the skeleton data. Moreover, different action classes in the dataset may require different temporal receptive fields. For example, there are actions that can be judged within a short time range, such as falling down, and actions that can be judged within a longer time range, such as carrying a plate. Therefore, we designed a multi-scale temporal convolution module to adaptively extract the feature correlations on the temporal sequence. We use three convolution kernels with kernel sizes of
,
and
to model the features in three different temporal scales of short, middle and long terms, respectively, and add the weight coefficients to adaptively adjust the scales of different importance. The proposed multi-scale temporal convolution operation can be formulated as
where
denotes the concatenate operation.
denotes the weight, which is used to adjust the importance of time features at different scales. Through the dynamic adjustment of the weight, the model can adaptively extract temporal features in different scales to improve the generalization ability of the temporal convolution module.
3.3. Spatial-Temporal-Channel Attention Module
The attention mechanism is a resource allocation mechanism, which redistributes the originally evenly allocated resources according to the importance of the focus. Inspired by [
42,
43], here we propose an attention mechanism, which contains three modules of spatial attention, temporal attention and channel attention mask.
Spatial attention module: In the spatial dimension, the movement range and dynamics of each joint in different actions are different. We can use the attention mechanism to help the model adaptively capture the dynamic correlation of each joint in the spatial dimension based on different levels of attention. By using the spatial relationship of features to generate the spatial attention mask
, average pooling can learn from the target object and calculate spatial statistics [
19,
44], and maximum pooling can extract another important clue of object features to infer more refined attention. Using both types of pooling can improve the network representation ability [
42]. It is formulated as
where
is the input feature map,
represent the number of channels, frames and joints, respectively;
and
represent the average-pooling and max-pooling operation;
represents the connection operation;
represents the 2D convolution operation; and
represents the
activation function.
, as shown in the attention module in
Figure 1; we calculate the product of the attention map
and the input feature map, and then the input is connected with the result to refine.
where
denotes element-wise multiplication.
Temporal attention module: In the time dimension, there is a correlation between the joint points of different frames, and the correlation between frames of different actions is also different. Similarly, we can use the attention mechanism to adaptively give them different weights. It is formulated as
where
, and other symbols are similar to the spatial attention module.
Channel attention module: according to [
19], each layer of the convolutional network has multiple convolution kernels, and each convolution kernel corresponds to a feature channel. The channel attention mechanism allocates resources between each convolution channel. We first perform feature compression through the squeeze operation, use global average-pooling to generate the statistics of each channel and then utilize the excitation operation to explicitly model the correlation between the feature channels through parameters and generate weights representing importance for each feature channel. Finally, we reweight the feature map, which is weighted to the previous feature channel by channel through multiplication. Similarly, we can formulate it as
where
and
are learnable weights;
is used for dimensionality reduction;
is used to calculate attention weights; and
represents the
activation function.
3.4. Multi-Stream Framework
The skeleton-based action recognition methods usually utilize the coordinate in the joint information of the input skeleton sequence. Shi et al. [
15] found that the bone information connected by the joints is also very helpful for the action recognition and proposed a two-stream framework using joint and bone information. Moreover, the motion information of the joints has a potential effect on action recognition. For example, standing up and sitting down have no difference only from the joint coordinates and bone information, but it is easy to recognize by adding the motion information. In our work, the input data is mainly divided into node information, bone information and motion information.
First, we define the original 3D coordinate set of a skeleton sequence as , where represents the joint coordinates, and and denote the numbers of frames and joints, respectively.
We suppose that the joint feature set is , where represents the coordinates of the v-th joint in the t-th frame. Additionally, the relative coordinate set is , where , and denotes the center joint of the skeleton sequence. We concatenated these two sets into a sequence as the input of the joint information stream.
Then we calculate the bone and the motion information based on the joint information. Additionally, these two information streams expand the matrix by filling in 0 elements to compensate for the change in dimensionality. Suppose that the bone length information set is
, where
denotes the neighbors of joint
. The bone length angle set is
, and the angle
can be calculated by
where
represents the joint coordinates. We concatenate these two sets into a sequence as the input of the bone information stream.
Additionally, we suppose that the motion velocity information set , where , and the motion velocity difference information set is , where . We concatenate these two sets into a sequence as the input of the motion information stream.
We separately input the joints, bones and motion data into our network, and we add the scores of these streams to obtain the final classification result.
3.5. Network Architecture
As shown in
Figure 3, the main architecture of our network is a stack of 9 basic modules named STAM (as shown in
Figure 1), which contains MSAGCN, MSATCN and STCAtt blocks. Additionally, we introduced residual links to connect each module and each block in the module to capture more abundant context information. First, the original skeleton data passes through the batch normalization (BN) layer for data normalization, and then the normalized data is input into the MSAGCN block. The input and output channels of these modules are (6, 64), (64, 64), (64, 64), (64, 128), (128, 128), (128, 128), (128, 256), (256, 256) and (256, 256). The features extracted from the last module are sent to the global average-pooling layer, and then the
classifier is used to obtain the final classification result.