Multi-Term Attention Networks for Skeleton-Based Action Recognition

: The same action takes different time in different cases. This difference will affect the accuracy of action recognition to a certain extent. We propose an end-to-end deep neural network called “Multi-Term Attention Networks” (MTANs), which solves the above problem by extracting temporal features with different time scales. The network consists of a Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). In MTA-RNN, a method for fusing multi-term temporal features are proposed to extract the temporal dependence of different time scales, and the weighted fusion temporal feature is recalibrated by the attention mechanism. Ablation research proves that this network has powerful spatio-temporal dynamic modeling capabilities for actions with different time scales. We perform extensive experiments on four challenging benchmark datasets, including the NTU RGB+D dataset, UT-Kinect dataset, Northwestern-UCLA dataset, and UWA3DII dataset. Our method achieves better results than the state-of-the-art benchmarks, which demonstrates the effectiveness of MTANs.


Introduction
Human Action Recognition (HAR) has attracted the attention of research communities in the computer vision area in recent years. Researchers have made significant achievements in action recognition by video-based methods. Generally, RGB videos andRGB-D videos are the most used for this task. For the methods based on RGB videos [1], researchers mainly extracted features from video frames directly by image processing techniques, and then applied various machine learning algorithms for modeling. However, color information in the video is susceptible to external factors that are not related to human actions, such as shooting environments, illumination conditions, and clothing textures [2]. Depth information in the video is less influenced by noises because of appearance independence and light independence, which can be obtained in real-time by depth cameras such as Microsoft Kinect [3]. Besides, RGB-D video-based methods consider the spatial position relationship of the human body, which can help to extract more spatial features in action recognition. Based on that, more and more researchers [4] apply depth information for action recognition.
Skeleton-based methods are also widely applied for action recognition in recent years. Comparing with the RGB-D videos that are still containing the noise from the video background, skeleton data only contains the position coordinates of joints, so it avoids not only the influence of the variation of illumination and appearance but also the influence of the background. Meanwhile, benefit with applying the depth information of the human joints in skeleton data, skeleton-based methods are able to extract depth features from human actions compared to RGB-based methods. On the other hand, the skeleton-based methods are more lightweight than RGB-D video-based methods. Skeleton datasets take up less memory than RGB-D video datasets, and the cost for training and deploying skeleton-based action recognition methods is lower.
Due to the spatial position relationship is already included in the skeleton sequence, skeleton-based action recognition [5] was initially implemented using only Recurrent Neural Networks (RNNs). Later, some researchers [6] tried to obtain spatial features by adding Convolutional Neural Networks (CNNs). It is worth noting that straightforward RNN algorithms [7] generally extracted only the long-term temporal dependence of human actions, which is unable to solve problems that the time taken by different objects for completing the same action is different.
In actual situations, the time scales for different people to complete the same action may vary greatly. Take the example of "stand up" and "put on a shoe", the samples of the first action have almost the same time scale, and the recognition accuracy is higher; the difference between the time scale of the second action samples is larger, and the classification accuracy is lower. This difference will affect the accuracy of action recognition to a certain extent. Therefore, the temporal difference in actions should be considered as a feature to improve the accuracy of action recognition. The problem is that the general action recognition method can only extract single-term temporal features, and the ability to model the spatio-temporal dynamics of actions with different time scale is limited. This will result in lower accuracy.
To resolve this problem, we propose a new general deep learning framework called Multi-Term Attention Networks (MTANs, see the supplementary materials) for skeleton-based action recognition, which is inspired by the Ensemble Temporal Sliding Long Short Term Memory networks (Ensemble TS-LSTM) [8] and Memory Attention Networks (MAN) [9]. An overview of the proposed framework is shown in Figure 1. The MTANs consists of three Multi-Term Attention Recurrent Neural Network (MTA-RNN) and a Spatio-Temporal Convolutional Neural Network (ST-CNN). Firstly, the MTA-RNN is trained by three skeleton coordinate sequence subsets aligned X, Y, and Z-axis respectively to obtain temporal features. The Multi-Term Temporal Sliding LSTM (MT-TS-LSTM) is introduced in MTA-RNN for extracting features with different time scales. The Attention Recalibration Module (ARM) is employed to find keyframes. Then, the output of the three MTA-RNN will be treated as a 3-channel image and fed to the ST-CNN for extracting the spatio-temporal features. The MT-TS-LSTM includes long-term, medium-term, and short-term TS-LSTM networks, which helps to obtain the temporal dependencies from different types of continuous actions. After weighted fusion of temporal features based on Analytic Hierarchy Process (AHP) [10], the fusion features are recalibrated through the attention mechanism. When facing the situation that the time taken by different persons for completing the same action is different, the MT-TS-LSTM can extract richer temporal features than straightforward RNNs. Besides, the adding of the attention mechanism also makes the obtained temporal features robust.
The main contributions of this work are summarized as follows: • We propose a general Multi-Term Attention Networks (MTANs) for skeleton-based action recognition. It introduces a method for fusing multi-term temporal features to solve the action recognitions problem of actions with large time-scale differences. • We propose a novel strategy for applying attention recalibration method to fusion features. The extracted temporal features with different time scales are fused according to the corresponding weights and recalibrated by attention mechanism. Networks with our strategy are able to reinforce temporal features for classifying actions.

Research Significance
The main purpose of action recognition is classifying human actions from media like videos. This is an important and challenging task that can be applied in many scenarios, such as Virtual Reality, human-computer interaction, video surveillance, military training, self-driving, and so forth.
Compared with video-based action recognition, skeleton-based action recognition can get rid of the influence of background, lighting, texture and other factors, and improve the accuracy of recognition. Meanwhile, the training cost is lower, which helps it gets better results in real-time action recognition. Therefore, it has received extensive attention from researchers.
Since the same action takes different time in different situations, it brings difficulties to action recognition. This difference will affect the recognition accuracy to a certain extent. How to model various temporal dynamics to extract effective temporal dependence is one of the main challenges in this task. Therefore, we conducted an in-depth study based on this problem.

Related Work
Due to the presence of highly accurate depth sensor and attitude estimation algorithm, the skeleton information of human action is easy to obtain [11]. Therefore, considering the advantage of the skeleton information, skeleton-based action recognition method is gaining popularity in recent years. In this section, we reviewed the literature, which is closely related to the methods proposed in this paper.
Skeleton information is a kind of time sequence for position coordinates of skeleton joints. According to this, a straightforward idea is applying recurrent neural networks for extracting actions from time sequences, such as RNN [12], LSTM [13], and Gator recursion unit (GRU) [14]. Researchers have tried to apply these methods to skeleton-based action recognition. A hierarchical RNN network [5] was proposed to classify actions based on skeleton data. An advanced LSTM network [13] that fully connected and added the regularization scheme is designed to obtain the high-level temporal features of skeleton information. A novel gating mechanism was introduced by Liu et al. [15] to advance the reliability of the learning sequence in LSTM, which can adjust the effect of updating long-term context information stored in the storage. The methods above are helpful when dealing with short skeleton sequences. However, these methods still have some drawbacks-(i) the dependencies between poses are not considered, which lead the low action recognition accuracy when dealing with long skeleton sequences. (ii) the position relation dependencies of joints are not considered when dealing with single pose in skeleton sequences.
To overcome the drawback, (i) attention mechanism was introduced for achieving human action recognition, which was widely used in deep-learning-based machine translation [16]. Xie et al. [9] introduced the attention mechanism on an RNN-based model to recalibrate the temporal attention of the frames in the skeleton sequence. Song et al. [17] used the attention mechanism for skeleton-based action recognition, and the spatio-temporal features can be obtained by assigning different weights to each joint and assigning different weights to each frame. Tang et al. [18] allocated the most informative frames through the attention mechanism and discarded the ambiguous frames in the sequence to recognize human actions. Due to the model with the attention mechanism being able to build the distribution of frames in a skeleton sequence, different weights can be assigned to different frames, and the dependency of frames in a sequence can be extracted to increase action recognition accuracy.
As a solution of drawback (ii), there are two classes of methods to extract the spatial features of human actions from the skeleton information. One class is to design handcrafted features to capture the dynamics of joints. These methods are capturing the moving of joints according to the handcrafted features, such as the covariance matrixes of joint trajectories [19], the relative positions of joints [20], or rotations and translations between body parts [21]. However, these methods are limited by the specific positional relationship in the action, which not only leads to a decrease in expressive power and a reduction in accuracy, but also is not easy to transfer to different environment. The other class is to apply CNNs for to extract the position relation dependencies of joints. For example, Xie et al. [9] proposed a framework called MAN combining CNNs and RNNs to extract spatio-temporal features in skeletons and calibrate the features with an attention mechanism. Li et al. [22] proposed a CNN-based framework for action classification, which helps to determine the importance of skeleton joints and rearrange them automatically. Ke et al. [6] introduced an algorithm that uses deep neural networks to extract spatio-temporal features for 3D action recognition based on the skeleton.
The above works are dedicated to exploring how to extract effective features from skeleton data. We analyzed the experimental results of MAN [9] and found that although it was effective in extracting the spatio-temporal features of actions, there was still a problem that they cannot perform well to classify actions with large time scale difference. Taking the NTU RGB+D dataset [23] as an example, its recognition accuracy for "stand up" and "sit down" can be close to 100%, but the recognition accuracy for "put on jacket" and "take off jacket" is lower. Observing the dataset, it can be found that the difference in the time scale of some actions like "stand up" and "sit down" is small, the completion time of these actions is usually about 2 s, and the maximum is not more than 3 s. Actions like "put on the coat" and "take off the coat" have larger difference in time scale, the shortest time is less than 2 s, the longest will exceed 8 s. This shows that the ability of the previous model to learn the spatio-temporal dynamics of actions with different time scale is limited. To resolve this problem, we will propose a novel method for skeleton-based action recognition and try to extract diverse and effective features, including long-term, medium-term and short-term temporal features.

Methods
In this section, we elaborate the proposed MTANs, which contains three Multi-Term Attention Recurrent Neural Network (MTA-RNN), and a Spatio-Temporal Convolutional Neural Network (ST-CNN). The overall structure of the MTANs is shown in Figure 1. We will make a detailed introduction to our method in the following.  Figure 2 shows the structure of the Multi-Term Attention Recurrent Neural Networks.

Multi-Term Attention Recurrent Neural Network
In MTA-RNN, the input skeleton data is 3D joint coordinates, which is a series of multi-frame sequences constituting the action. It is divided into three subsets aligned with the X, Y, and Z-axis, respectively, and is trained with MTA-RNN. The MT-TS-LSTM is introduced in MTA-RNN for extracting various temporal features and fuse them according to weights, and the ARM is employed to enhance the key temporal features. For the input skeleton sequences, a multi-LSTM module based on temporal sliding is used to capture temporal features information for input action sequences at different terms, and the temporal feature is calibrated by the ARM.
We process the input data in the MTA-RNN module to better obtain the temporal features of skeleton information. A sequence of multi-frame 3D joint coordinates is divided into three skeleton coordinate sequence subsets aligned X, Y, and Z-axis. By learning from the three channels, the temporal feature information in three dimensions: X, Y, and Z-axis can be obtained. The input of MTANs is set as where T represents the number of frames included in an action, and N represents the number of human joints included in the input data.
which is the set of X coordinate, Y coordinate and Z coordinate of N joints along T frames. The following will take X as an example to explain the MTA-RNN. X ∈ R (T×N) can be expressed as: The design of the MTA-RNN is based on residual block X = X + F(X), in which F(X) is a recalibrated function. Taking original residual block in ResNets as the reference, the MTA-RNN is constructed by mapping with the transformation from input X i to output X o to capture the richer temporal information, and the formula is as follows: where X o is the final output of MTA-RNN, and F(X i ) is the temporal feature recalibrated by the ARM. As is shown in Figure 2, F(X i ) is based on two branches: F MT (X i ) and F A (X i ). The MT-TS-LSTM and the ARM are integrated into a unified framework, and two branches F MT (X i ) and F A (X i ) in the network respectively represent the temporal information of MT-TS-LSTM after fusion and attention weight. Therefore, the recalibrated temporal feature is as follows: where denotes the element-wise multiplication,  Figure 3. We can adapt TS-LSTM into various dependencies through controlling of temporal stride and internal LSTM time-step size. To minimize the complexity of the network and reduce the GPU occupancy, we apply three parallel TS-LSTM networks, including long-term TS-LSTM, medium-term TS-LSTM, and short-term TS-LSTM networks, to obtain the temporal features simultaneously from skeleton sequences. Compared with straightforward LSTM, MT-TS-LSTM can capture various temporal dependencies, including long-term, medium-term, and short-term dependencies, which is effective for classifying various actions. We can set various parameters for the lth TS-LSTM, including the LSTM network numbers (N l ), the size of LSTM window (W l ), and temporal strides (TS l ). In the process of the variable sequence of action, more temporal features can be obtained by adjusting the size of the window and the size of the temporal strides of the TS-LSTM. For example, when N l = 3, W l = 5 and TS l = 2, the structure conceptual diagram of the TS-LSTM is shown in Figure 3b.
The input X i of TS-LSTM is expressed as X l t in the MT-TS-LSTM, where l is TS-LSTM module number, t represents the tth frame. Lee et al. [8] showed the specific calculation formula of this method, including the memory cell, input gate, output gate, forgetting gate and output vector of tth frame of nth LSTM of lth TS-LSTM, and so forth.
In the MTA-RNN, the size and shape of X ∈ R (T×N) will be updated through an FC layer as X i ← FC(X) ∈ R (T×K) , where K represents the number of neuron units in a TS-LSTM, and n denotes the number of TS-LSTM in the MT-TS-LSTM. X i ∈ R (T×K) can be expressed as: . . .
Different from other models, we extract a variety of action features with different time scales through multiple LSTM networks with different parameters for enriching the types and quantity of features. There are three parallel TS-LSTM networks applied in one MT-TS-LSTM: long-term TS-LSTM networks, medium-term TS-LSTM networks, and short-term TS-LSTM networks. For example, the output of the long-term TS-LSTM networks is F L (X i ) ∈ R (T×K) . The temporal features information for the skeleton joint is expressed as: where F L (X i ) summarizes memory information of skeleton and joint in the whole sequence of TS-LSTM. Similarly, the final output of medium-term and short-term TS-LSTM network is:

Attention Recalibration Based on Fusion Features
After MTA-RNN obtains the multi-scale temporal features captured by the long-term, medium-term, and short-term TS-LSTM networks, respectively, we fuse the three different scale features by weights. It hopes to obtain robust temporal features from multi-scale information.
In order to find reasonable weights for temporal features and obtain better experimental results, we set multiple sets of different weights and analyzed by the Analytic Hierarchy Process (AHP) [10]. According to the experiment, p L = 3, p M = 2, and p S = 1 are selected as the weight of F L (X i ), F M (X i ), F S (X i ), respectively. This process is to better allocate the three LSTMs with different terms, so as to obtain more effective fusion temporal features. This laid the foundation for the attention recalibration of the fusion features later. Then their weighted average is calculated, which is the final output F MT (X i ): The action features of different time scales are considered after weighted fusion processing, which is the uniqueness of our method compared to others. F MT (X i ) ∈ R (T×K) summarizes the temporal information in the MTA-RNN for the skeleton joints across the sequence, which can be expressed as: where x(t) is the row vecach frame has different effects on whether the action can be correctly classified. Most frames in the sequence provide contextual information, and only keyframes contain the discriminative and important information, which play a crucial role in action recognition. Based on this observation, we add the attention mechanism in MTA-RNN called ARM to pay different levels of attention to different frames in the fusion temporal features.
To recalibrate the fusion temporal features information F MT , the attention weight F A is exploited as F(X i ) = F MT (X i ) F A (X i ). The attention recalibration scheme is able to capture global frame-wise depenence across T frame. Firstly, summing up every row vector of X in the Equation (9), through the average pooling operation, a vector of T × 1 is generated: As shown in Figure 2, the under branch of MTA-RNN is the structure of the attention module. Two FC layers provide the non-linear interaction between frames, and a dimensionality-reduction layer and a dimensionality-increasing layer are added to carry out denoising and excitation operations respectively, and thus enhance the discriminant ability of features. The output of the attention branch F A (X i ) is: where θ(·) denotes ReLU activation function, σ(·) refers to the sigmoid function, W 1 is the parameter of dimensionality-reduction layer, and W 2 is the parameter of dimensionality-increasing layer. Finally, by the element-wise multiplication of F MT and F A , F(X i ) is obtained. As shown in Figure 2, to calculate the output feature map of the MTA-RNN, F(X i ) is updated by an FC layer as X FC ← FC(X i ) ∈ R (T×K) . F MT and F A can be jointly learned during training, which is similar to the residual block.
Similarly, the output of Y-axis and Z-axis temporal features is obtained as Y o , Z o based on Y i and Z i in its corresponding MTA-RNN, and the final output is

Spatio-Temporal Convolution Neural Network
Some skeleton-based action recognition methods are limited by the modeling capacity, because they only apply RNNs for learning temporal features which only depend on the inherent spatial relations in the joint coordinates. The ST-CNN is proposed to further extract the spatio-temporal features of skeleton information.
In theory, any CNNs can be selected in the ST-CNN. In our work, the ResNets [24] are used for training, and ReLU is selected as the activation function. O C is used to represent the output of the ST-CNN of the softmax classifier, whose high-level spatial structures can be interpreted as follows: (Conv(O)))))).
Then, O C is taken into softmax classifier to predict action classification.
where W C represents the weights in the softmax layer, andŷ represents the predicted action label. Finally, the cross-entropy loss function [25] is adopted to measure the difference between the true class label y and the predicted resultŷ.

Experiments and Analysis
In this section, evaluation is made on the proposed method, and comparison is made with several latest methods on the four benchmark datasets: NTU RGB+D dataset [23], UT-Kinect Action dataset [26], Northwestern UCLA dataset [27] and UWA3DII dataset [28]. The relationship between recognized actions and the features obtained from MTANs is also analyzed. In addition, we performed ablation experiments to analyze whether each module performed well.

NTU RGB+D Dataset
NTU RGB+D [23] dataset is one of the largest action recognition datasets based on skeleton currently. It contains 56,880 sequences with 4 million frames, captured by three Microsoft Kinect v2 cameras. There are three types with a total of 60 action classes. Each sequence of the skeleton has 25 joints. It is a challenging dataset due to the many viewpoints and large action classes.
The Cross-Subject (CS) protocols and Cross-View (CV) protocols are two standard evaluation protocols defined by Shahroudy et al. [23] for this dataset. The evaluations are carried out in accordance with these protocols. In the CS evaluation, subjects in the training group and the test group are assigned by the designated number, with 20 subjects in each group. In CV evaluation, samples for cameras 2 and 3 are used as train sets, and samples for camera 1 are used as test sets.

UT-Kinect Action Dataset
UT-Kinect Action dataset [26] was captured by a single fixed Kinect. This dataset has a very noisy skeleton sequence, and contains 199 action sequences in total: 10 subjects perform 10 action classes, and the same subject performs each action twice. In accordance with the protocol [26], 50% of the subjects are selected for training and the other 50% for testing.

Northwestern-UCLA Dataset
Northwestern-UCLA dataset [27] contains 10 action classes with a total of 1494 sequences, which was captured by three Microsoft Kinect v1 cameras from different viewpoints simultaneously. 10 subjects perform each action one to six times. Based on evaluation protocol [27], camera samples numbered 1 and 2 were used as training sets, and the camera samples numbered 3 were used for testing.

UWA3DII Dataset
UWA3DII dataset [28] contains 30 human actions, and 10 subjects perform each action four times. Each action is observed from four viewpoints, including the front, left, right, and top views and captured by four Microsoft Kinect v1 cameras. Following the cross-view protocol [28], the samples of two views are used for training, and samples of the other two views are used for testing.

Experiment Design
For all the datasets, all frames of the skeleton sequence are divided into three subsets that generated into three input matrices as X i , Y i , and Z i . We can obtain three temporal features from the skeleton sequences, denoted as  The number of units in the last FC layer (the output layer) is the same as the number of action classes in each dataset. The Root Mean Square Prop (RMSprop) algorithm is selected to train MTANs. It alleviated the problem of excessive swing amplitude and further accelerated the convergence of functions. The attenuation values of the learning rate and attenuation factor are set as 0.01 and 0.9, respectively. The performance of MTANs on each dataset is compared with existing methods that use the same evaluation protocol for a fair comparison. All experiments are carried out with an NVIDIA Quadro P5000 GPU (NVIDIA, Litai, Taiwan, China) and based on Keras2 and Tensorflow backend.

Ablation Experiment
A wide range of ablation studies has been carried out on NTU RGB+D dataset for different units in MTANs, including five parts.   Table 3. "1C (one channel)" means the input is not divided into three subsets aligned with the X, Y, and Z-axis and trained together, and "3C (three channel)" means the input is divided into three subsets aligned with the X, Y, and Z-axis and trained separately. "Single" indicates that there is only a single long-term TS-LSTM in MTA-RNN, while "Multiple" applies multi-term TS-LSTM, including long-term, medium-term and short-term TS-LSTM. The last setting is whether to use the ARM module to recalibrate the temporal features. We apply ResNet-18 in the ST-CNN for all experiments because compared with the deeper residual network like ResNet-50, ResNet-18 can achieve similar classification accuracy but relatively faster training speed.  As shown in Table 3, it is easy to see that: (1) The overall accuracy of one channle is lower than that of three channel, which indicates that modeling by three subsets aligned with the X, Y, and Z-axis respectively is better than modeling by all skeleton coordinate data. (2) The accuracy of applying a single long-term TS-LSTM is lower than that of multi-term TS-LSTM, which indicates that using multi-term TS-LSTM in the MTA-RNN can get better temporal dependence between actions. (3) The addition of the ARM module makes the model get higher recognition accuracy. This situation reveals the effectiveness of attention recalibration for temporal features. Based on the above experiments, it well validates the superiority of MTANs, and proved that the training of three subsets aligned with the X, Y, and Z-axis, the selection of MT-TS-LSTM and the addition of ARM are indispensable.
In order to better show the effect of the ARM model, we visualized the attention weight to analyze which frames play a role in the result of action recognition. Figure 4 shows temporal attention weights and the visual skeleton diagram of keyframes. In the above attention weight heatmap, the green color from light to dark represents its temporal attention weight from low to high. we can see that the keyframes (frames with top 3 weights) "lift the hand", "move hand left" and "move hand right" play a key role in action recognition.

Feature Concatenation and Weighted Feature Fusion
In MTANs, we select the appropriate weights to fuse temporal features with different time scales, which is extracted from multiple LSTMs with different terms, respectively. We designed an experiment to compare feature concatenation method and the weighted feature fusion method. Table 4 shows the performance of the model with feature concatenation and the model with weighted feature fusion on the NTU RGB+D dataset when processing the temporal features with different time scales. It indicates that when concatenating multiple temporal features with different time scales in our model, the accuracy is lower than the model based on weighted feature fusion. This proves the superiority of the weighted feature fusion method.

Direct Attention Recalibration and Fusion-Feature-Based Attention Recalibration
In MTANs, we apply the attention mechanism to recalibrate the fusion features extracted from LSTMs with different scales. We designed a set of experiments to compare it with the method of directly recalibrating the temporal features obtained by LSTM.
As shown in Table 5, the accuracy of the model with fusion-feature-based attention recalibration method is higher than that of the model with direct attention recalibration method, which demonstrates the feasibility and good performance of applying attention calibration to fusion features.

Combined Model and MTANs
The framework of MTANs is inspired by MAN [9] and Ensemble TS-LSTM [8], but they are different. MTANs are not a simple combination of the above two networks. We do innovative design in multi-scale network settings, feature fusion schemes, and apply attention recalibration on fusion features. In order to prove the effectiveness of the MTANs, we conducted an experiment on the simple combination of the unchanged Ensemble TS-LSTM and MAN, called combined model.
We compared the accuracy of the combined model and MTANs. It can be seen from Table 6 that the simple combination of the two networks cannot obtain satisfactory results. In addition, the training of combined models is time costly. These show that the changes we made are meaningful.

NTU RGB+D Dataset Results
As is shown in Table 7, MTANs achieves better results than other methods in CS and CV evaluation. For CS evaluation, MTANs can achieve 84.74% accuracy; for CV evaluation, MTANs can achieve 93.23% accuracy. Our model outperforms the MAN [9] by 1.73% and 2.57% in CS and CV evaluation, respectively. Experiments show that the multi-term TS-LSTM in MTA-RNN in MTANs can extract the temporal features of actions with different time scales. And the attention recalibration module for the temporal features fusion can significantly improve the effectiveness of temporal-dependent acquisition. This proves the superiority of MTANs in skeleton-based action recognition. Table 7. Comparison on the NTU RGB+D dataset with CS evaluation and CV evaluation in accuracy.

UWA3DII Dataset Results
Due to the viewpoint variations and the high similarity among actions, UWA3DII dataset is challenging. But our proposed MTANs still performs well. By comparing with the previous best method Ensemble TS-LSTM (75.6%) [8], the average accuracy of our method is improved and achieves to 81%.
The fact that the proposed MTANs can work well on the four datasets demonstrates the effectiveness of the proposed MT-TS-LSTM and the fusion-feature-based attention recalibration module.

Analysis of Results
In order to evaluate the proposed MTANs in detail, the UCLA dataset is taken as an example. Figure 5 shows the confusion matrix obtained from a random test in the experiment. We can find that although the dataset contains actions with different time scales, and many participants completed the same action at different times, almost all actions can be correctly classified. The classification of actions such as "walk around", "sit down", "stand up" can achieve 100% accuracy. The accuracy of the two actions "donning" and "doffing" are both 95%, far exceeding the 75% and 84% in MAN [9] (this accuracy is obtained by our reproduction of MAN). This shows that for actions with different time scales, our method can extract high-level temporal and spatial features and achieve excellent results. At the same time, these results demonstrate that the action features of different time scales are considered in our model after weighted fusion processing, which helps to improve the accuracy of action recognition. The value in each column is the probability that the real data is predicted to be that category.
For the actions that were classified less accurately, one reason is using a two-side view as the training set, self-shielding of the human body is easy to occur. For example, "pick up with two hands" is easy to be confused with the action "pick up with one hand". The other reason is that some actions are overlapped quite a lot in the skeleton sequence. For example, the action "drop trash" and the action "throw" are easy to confuse, even for human perception. The ST-CNN in the proposed MTANs can extract richer spatio-temporal features compared to the method containing only RNNs. Therefore, the MTANs can carry out the correct classification for these similar actions to some extent.

Conclusions
In this paper, we propose a novel method termed Multi-Term Attention Networks (MTANs) for skeleton-based action recognition. There three MTA-RNN are designed to extract detailed temporal dynamics. In each MTA-RNN, the Multi-Term Temporal Sliding LSTM (MT-TS-LSTM) can capture the high-level fusion temporal features, including long-term, medium-term, and short-term temporal dependencies of actions, and the Attention Recalibration Module (ARM) is employed to recalibrate the weighted fusion features to enhance the temporal features. The ST-CNN further models the spatial structure and temporal dependence of the skeleton sequences. MTANs can solve the action recognitions problem of actions with large time-scale differences by extracting temporal features with different time scales. Experiment results demonstrate the effectiveness of our proposed method, especially in handling actions with different time scales. This method achieves better results than the state-of-the-art methods on four challenging benchmarks.