A Multi-Feature Representation of Skeleton Sequences for Human Interaction Recognition

: Inspired from the promising performances achieved by recurrent neural networks (RNN) and convolutional neural networks (CNN) in action recognition based on skeleton, this paper presents a deep network structure which combines both CNN for classiﬁcation and RNN to achieve attention mechanism for human interaction recognition. Speciﬁcally, the attention module in this structure is utilized to give various levels of attention to various frames by di ﬀ erent weights, and the CNN is employed to extract the high-level spatial and temporal information of skeleton data. These two modules seamlessly form a single network architecture. In addition, to eliminate the impact of di ﬀ erent locations and orientations, a coordinate transformation is conducted from the original coordinate system to the human-centric coordinate system. Furthermore, three di ﬀ erent features are extracted from the skeleton data as the inputs of three subnetworks, respectively. Eventually, these subnetworks fed with di ﬀ erent features are fused as an integrated network. The experimental result shows the validity of the proposed approach on two widely used human interaction datasets.


Introduction
Human action recognition and interaction recognition have recently attracted the intensive attention of researchers in computer vision field due to its extensive application prospects, such as intelligent surveillance, human-machine interaction, and so on. Most previous methods are devoted to the human action recognition in two-dimensional RGB data [1,2]. However, due to the high sensitivity to environmental variability of the RGB data, precise action recognition is also a challenging task. Some previous works have made some contributions to overcome these challenges [3,4]. In [3], Rezazadegan et al. proposed an action region proposal method that, informed by optical flow to extract image regions likely to contain actions, which can eliminate the influence of background. Besides, this problem could be overcome by using cost-efficient RGB-D (i.e., color plus depth) sensors [5]. Generally, depth sensors can provide three-dimensional (3D) information of human body in more detail and with high robustness to variations of perspectives [6]. Therefore, human action recognition based on 3D skeleton has become a hot research topic.
Human action can be described by a series of time sequences of skeleton. The temporal information and spatial information are significant for skeleton based action recognition. Several types of recurrent neural networks (RNN), including long short-term memory (LSTM) [7] and gated recurrent units (GRU) [8], showed great advantages on processing sequence data. However, the temporal modeling is not always suitable for the skeleton sequences. On the other hand, the RNN lacks the ability of spatial modeling, while convolutional neural networks (CNN) have natural advantages in extracting spatial features. In this work, considering that RNN and CNN have their own advantages in skeleton based action recognition, we construct a deep neural network, through merging the RNN with CNN.
To be specific, Bidirectional gated recurrent units (BIGRU) is used to achieve attention mechanism, and convolutional network for classification. An ensemble network including three subnets with the same structure is presented to learn diverse features for better accuracy. With the proposed method assessed on two classic benchmark datasets, namely, SBU Interaction Dataset [9] and NTU RGB + D Dataset [10], promising performance is achieved.
The main contributions of this paper are listed in the following two aspects:

1.
A multi-feature representation method of interaction skeleton sequence is proposed for extracting various and complementary features. Specifically, three subnets fed with these features are fused into an ensemble network for recognition.

2.
A framework combining RNN with CNN is designed for skeleton based interaction recognition, which can model the complex spatio-temporal variations in skeleton joints.

Related Work
In this section, the work related to the proposed method is briefly reviewed, including RNN based methods and CNN based methods for skeleton-based 3-D action recognition and interaction recognition.
RNN based methods: Some previous works have successfully applied RNN to skeleton based action recognition [11][12][13]. In [11] Du et al. divided the whole skeleton body into five parts according to the physical structure of human body, and fed them into five bidirectional LSTM jointly to make the decision of recognition. Zhu et al. [12] proposed an end-to-end fully connected deep LSTM network with a novel regularization scheme to learn the co-occurrence features of skeleton joints. In addition, they applied a new dropout algorithm to train the network. Liu et al. [13] proposed a spatio-temporal LSTM network which can model both temporal and spatial information. Based on LSTM, Song et al. [14] proposed a spatio-temporal attention model, which can automatically focus on the discriminative joints and pay different attention weights to each frame. CNN based methods: Some previous works have employed CNN for skeleton based action recognition and achieved great success [15][16][17][18][19][20]. Ke et al. [19] represented the sequence as three clips for each channel of the 3D coordinates, which reflects the temporal information of the skeleton sequence and spatial relationship. Li et al. [21] proposed multiple views from skeleton sequences to learn the discriminative features including spatial domain feature and temporal domain feature, and multi-stream CNN fusion method was adopted to combine the recognition scores of all views. To exploit the spatio-temporal information from skeleton sequences, Kim and Reiter [22] used temporal convolutional neural networks (TCN) for skeleton based action recognition, which provided a way to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition.
The research of human action recognition brought about many surprising results with the development of RGB-D sensors [23]. However, few works talked about the human interaction recognition. Compared with single person's action recognition, two persons' interaction recognition is more complex and difficult [24,25]. Some early works [24] proposed a scheme of decomposing human interaction into single person's actions for recognition. Actually, lots of features in human interaction behavior which include both individual action information and mutual relations which can be utilized to obtain better recognition results.

Proposed Method
As shown in Figure 1, the proposed basic framework of the subnet is composed of two models: attention module and classification module. The input skeleton sequence consists of multiple frames, one column vector in the image matrix denotes one frame. Every frame consists of 3-dimensional joint coordinates. We separate these coordinates to x, y, z dimensions, which mean the R, G, B channels, respectively. For each channel, attention mechanism is used to learn the temporal weights of frames. After that, the three channels are concatenated to one tensor which is fed into classification module for classification. This section is organized as follows. Firstly, we introduce some different processes of Electronics 2020, 9, 187 3 of 12 transforming skeleton sequences to color images with RGB three channels, and these images are used as different inputs of three subnetworks. Then the attention mechanism in action recognition is presented. Finally, an ensemble network with attention mechanism is constructed for interaction recognition.
Electronics 2020, 9, x FOR PEER REVIEW 3 of 12 module for classification. This section is organized as follows. Firstly, we introduce some different processes of transforming skeleton sequences to color images with RGB three channels, and these images are used as different inputs of three subnetworks. Then the attention mechanism in action recognition is presented. Finally, an ensemble network with attention mechanism is constructed for interaction recognition.

Multi-Feature Representation from Skeleton Sequences
Like the previous work in [26,27], the coordinates of the joint in one skeleton sequence can be arranged in a matrix, and three coordinates , , of each joint represent the corresponding channels R, G, B of each color image. Then the coordinates of the j-th joint in each frame can be denoted as where ∈ 1, 2, ⋯ , }, and m denotes the number of joints in each frame. For each frame, assume that there are two subjects and each subject has m joints. Then there are total 2 × m joints for all subjects in each frame, let the j-th joint of the i-th performer map be , and these joints at the t-th frame can be represented as Equation (2) = , , ⋯ , , , , ⋯ , } In order not to destroy the relation among these joints, the joints are numbered in a fixed order which determines the arrangements of all the joints. Considering that the human skeleton consists of such five parts as: one trunk, two arms, and two legs, we adopt two kinds of orders for skeleton arrangement: part-based order and traversal-based order [26]. Figure 2 shows two different orders on NTU RGB + D dataset [9] where one skeleton is composed of 25 joints. Then the whole interaction sequence can be represented as Equation (3) = , , ⋯ , where T denotes the number of frames in an interaction sequence.

Multi-Feature Representation from Skeleton Sequences
Like the previous work in [26,27], the coordinates of the joint in one skeleton sequence can be arranged in a matrix, and three coordinates x, y, z of each joint represent the corresponding channels R, G, B of each color image. Then the coordinates of the j-th joint in each frame can be denoted as Equation (1) P j = P xj , P yj , P zj (1) where j ∈ {1, 2, · · · , m}, and m denotes the number of joints in each frame. For each frame, assume that there are two subjects and each subject has m joints. Then there are total 2 × m joints for all subjects in each frame, let the j-th joint of the i-th performer map be P i j , and these joints at the t-th frame can be represented as Equation (2) P t = P 1 1 , P 1 2 , · · · , p 1 m , P 2 1 , P 2 2 , · · · , p 2 m In order not to destroy the relation among these joints, the joints are numbered in a fixed order which determines the arrangements of all the joints. Considering that the human skeleton consists of such five parts as: one trunk, two arms, and two legs, we adopt two kinds of orders for skeleton arrangement: part-based order and traversal-based order [26]. Figure 2 shows two different orders on NTU RGB + D dataset [9] where one skeleton is composed of 25 joints. Then the whole interaction sequence can be represented as Equation (3) where T denotes the number of frames in an interaction sequence. These coordinate elements can be regarded as RGB elements in images. In this way, we transform the original skeletal data to 3D tensors which can be sent to neural networks for training. We process the converted skeleton data to three different features for better performance.
Feature 1: we separated the two subjects and arrange them on one matrix as an image. For each subject, we adopted the part-based order. Figure 3 shows some feature images generated from sample actions.  These coordinate elements can be regarded as RGB elements in images. In this way, we transform the original skeletal data to 3D tensors which can be sent to neural networks for training. We process the converted skeleton data to three different features for better performance.
Feature 1: we separated the two subjects and arrange them on one matrix as an image. For each subject, we adopted the part-based order. Figure 3 shows some feature images generated from sample actions. Feature 2: for different kinds of interaction, the relationship information between two subjects is distinct, which can be represented by the distances between the values at corresponding skeleton joints of two subjects in the same frame. Let ( ) and ( ) denote the i-th joint coordinates of the first and second player at the t-th frame. The Euclidean distance ( ) between the corresponding joints of two subjects is denoted as Equation (4) where m denotes the number of joints and T refers to the total number of frames. In this feature mode, we adopt traversal-based order. Then an interaction instance D of all T frames can be represented as Equation (5) = , , ⋯ , } Feature 3: we enhance the relationship information represented by the distances between the values at different skeleton joints in two subjects in the same frame. The enhanced cross joint distances ( ) between joint j and i of two performers at frame t can be represented as Equation (6) ( ) = ( ) − ( ) , ∈ 1,2, … , } where i and j denote the joint number of the performers independently. For this feature, there are m m joint distances for each frame. Then an interaction instance of T frames can then be represented as Equation (7)   These coordinate elements can be regarded as RGB elements in images. In this way, we transform the original skeletal data to 3D tensors which can be sent to neural networks for training. We process the converted skeleton data to three different features for better performance.
Feature 1: we separated the two subjects and arrange them on one matrix as an image. For each subject, we adopted the part-based order. Figure 3 shows some feature images generated from sample actions. Feature 2: for different kinds of interaction, the relationship information between two subjects is distinct, which can be represented by the distances between the values at corresponding skeleton joints of two subjects in the same frame. Let ( ) and ( ) denote the i-th joint coordinates of the first and second player at the t-th frame. The Euclidean distance ( ) between the corresponding joints of two subjects is denoted as Equation (4) ( ) = || ( ) − ( )|| ∈ 1,2, … , }and ∈ 1,2, … , } where m denotes the number of joints and T refers to the total number of frames. In this feature mode, we adopt traversal-based order. Then an interaction instance D of all T frames can be represented as Equation (5) = , , ⋯ , } Feature 3: we enhance the relationship information represented by the distances between the values at different skeleton joints in two subjects in the same frame. The enhanced cross joint distances ( ) between joint j and i of two performers at frame t can be represented as Equation (6) where i and j denote the joint number of the performers independently. For this feature, there are m m joint distances for each frame. Then an interaction instance of T frames can then be represented as Equation (7)  Feature 2: for different kinds of interaction, the relationship information between two subjects is distinct, which can be represented by the distances between the values at corresponding skeleton joints of two subjects in the same frame. Let P 1 i (t) and P 2 i (t) denote the i-th joint coordinates of the first and second player at the t-th frame. The Euclidean distance D i (t) between the corresponding joints of two subjects is denoted as Equation (4) where m denotes the number of joints and T refers to the total number of frames. In this feature mode, we adopt traversal-based order. Then an interaction instance D of all T frames can be represented as Feature 3: we enhance the relationship information represented by the distances between the values at different skeleton joints in two subjects in the same frame. The enhanced cross joint distances D ij (t) between joint j and i of two performers at frame t can be represented as Equation (6) where i and j denote the joint number of the performers independently. For this feature, there are m × m joint distances for each frame. Then an interaction instance D of T frames can then be represented as Equation (7) It can be seen all these features include single person's action feature as well as the relationship of human interaction.

Attention Mechanism
In terms of human attention mechanism, we design an attention mechanism in our ensemble network. Human usually focus on specific parts they are interested in for their visual subject and critical moments when behavior occurs. The skeleton data are a time sequence of multi-frame 3D joint coordinates forming an action. For different frames, it is of different levels of significance for recognition. For example, for the interaction punching and handshaking, the actions in most of the frames are similar so we should pay more attention on the key frames which carry more effective information. Inspired by the attention mechanism [14,28], we design an attention mechanism where each frame is assigned a different attention weight in order to emphasize key frames which contain important and discriminative information.
The learning of attention mechanism pursuits a specific attention based on BiGRU in memory cell to capture the temporal memory information across the input interaction sequence. As is shown in Figure 3, the output of BiGRU is determined by the forward GRU and backward GRU, so it can pay specific attention on the skeleton sequence by the context information. More specifically, the output of the attention module can be represented by Equation (8) F where X is a column vector in Equations (3), (5), and (7) for three features, which means one frame in the action, F A (X) is the weight of the frame vector X to enhance temporal information, and • refers to element-wise multiplication. F A (X) can be computed as Equation (9) where σ(·) refers to sigmoid activation function, → GRU(x t ) and ← GRU(x t ) denote the hidden variables of the forward GRU and backward GRU at t frame. The attention module can automatically learn the attention weight F A of different frames from the output F A (X t ) in BiGRU. Among these frames, the larger the value of activation function, the more important this frame is for determining the category of interaction.

Ensemble Network
Our network consists of two modules: bidirectional gated recurrent units for attention module and convolutional neural networks for classification module. For the attention module, the number of units in BiGRU is set to be 128, and the recurrent dropout rate is set to be 0.5.
By utilizing the robustness of CNN to deformation, high-level feature representations can be extracted in the classification to better cope with spatio-temporal variations of skeleton joints. In principle, any CNN can be used in classification module, e.g., DenseNet and ResNet. In our method, we use the AlexNet [15] as our basic convolutional network, which is a very simple but effective network structure. Figure 4 shows the proposed convolutional module. We stack 3 Conv-ReLU-BN blocks. The convolutional strides and pooling strides are (2,2). In convolutional layer, we use ReLU activation function. After the blocks, we add dropout layer and two FC layers, the number of the units for the last FC layer (i.e., the output layer) is the number of the action classes in each dataset. With different features as inputs of three subnetworks, we train these subnetworks both independently and globally. Cross-entropy is taken as the cost function, which can be described as Equation (10) where y i is the one hot vector of true label,ŷ i is the prediction vector, and n is the number of interaction classes.
where is the one hot vector of true label, is the prediction vector, and n is the number of interaction classes. The ensemble network framework is shown in Figure 5. We train the ensemble network end to end, all outputs of the three subnetworks are joined to determine the recognition result of interaction classes. We apply two fusion methods [29] in our ensemble network. Based on the product rule, the three subnetworks' output score vectors are element-wise multiplied. Based on the highest score, the predicted class can be expressed as Equation (11) = ( ∘ ∘ ) where v is a score vector of three subnetworks' outputs, ∘ means element-wise multiplication, and argmax(·) refers to looking for the class index of the element with the highest score.

(ii) Sum fusion
In the similar way of the afore-mentioned product rule, the input to class label is assigned as Equation (12) where + denotes element-wise addition. The ensemble network framework is shown in Figure 5. We train the ensemble network end to end, all outputs of the three subnetworks are joined to determine the recognition result of interaction classes. We apply two fusion methods [29] in our ensemble network. where is the one hot vector of true label, is the prediction vector, and n is the number of interaction classes. The ensemble network framework is shown in Figure 5. We train the ensemble network end to end, all outputs of the three subnetworks are joined to determine the recognition result of interaction classes. We apply two fusion methods [29] in our ensemble network. Based on the product rule, the three subnetworks' output score vectors are element-wise multiplied. Based on the highest score, the predicted class can be expressed as Equation (11) = ( ∘ ∘ ) where v is a score vector of three subnetworks' outputs, ∘ means element-wise multiplication, and argmax(·) refers to looking for the class index of the element with the highest score.

(ii) Sum fusion
In the similar way of the afore-mentioned product rule, the input to class label is assigned as Equation (12) through the sum rule where + denotes element-wise addition. (i) Product fusion Based on the product rule, the three subnetworks' output score vectors are element-wise multiplied. Based on the highest score, the predicted class can be expressed as Equation (11) where v is a score vector of three subnetworks' outputs, • means element-wise multiplication, and argmax(·) refers to looking for the class index of the element with the highest score.
(ii) Sum fusion In the similar way of the afore-mentioned product rule, the input to class label is assigned as Equation (12) through the sum rule where + denotes element-wise addition.

Dataset
In the following experiments, we assessed our proposed method on two public widely used datasets: SBU Interaction Dataset [9] and NTU RGB + D Dataset [10].
SBU-Kinect dataset. This dataset is a human interaction recognition dataset captured by Kinect and depended by two-person interaction. It contains 282 skeleton sequences and 6822 frames of eight classes. There are 15 joints for each skeleton. For fair comparison, we adopt five-fold cross validation protocols as suggested in [9].
NTU RGB + D Dataset. The NTU dataset is a high-quality action recognition dataset consisting of more than 56,000 action samples. It provides 3-dimensional coordinates of skeleton joints. There are total 60 classes of actions carried out by 40 subjects, where the ratio of interaction behaviors to all classes of action behaviors is 11/60. The large variations on viewpoint, intra class and sequence length determine its demandingness. In fairness, we follow the standard cross-subject and cross-view evaluation protocols in [10].

Implementation Details
For the original NTU RGB+D dataset, we transposed the original coordinate system to human-centric coordinate system. Different from [30], we always chose the first person's body center as the center of the coordinate system in order to better express the relative position between two subjects. Furthermore, coordinate transformation can eliminate the influence of different perspectives of actions. Figure 6 shows the proposed human-centric coordinate system. The formula of calculating transformation of coordinates is shown as (13) where → H and → H are the original coordinate and the converted coordinate, and C is the coordinate of the body center of the first person. R is the rotation matrix.

Dataset
In the following experiments, we assessed our proposed method on two public widely used datasets: SBU Interaction Dataset [9] and NTU RGB + D Dataset [10].
SBU-Kinect dataset. This dataset is a human interaction recognition dataset captured by Kinect and depended by two-person interaction. It contains 282 skeleton sequences and 6822 frames of eight classes. There are 15 joints for each skeleton. For fair comparison, we adopt five-fold cross validation protocols as suggested in [9].
NTU RGB + D Dataset. The NTU dataset is a high-quality action recognition dataset consisting of more than 56,000 action samples. It provides 3-dimensional coordinates of skeleton joints. There are total 60 classes of actions carried out by 40 subjects, where the ratio of interaction behaviors to all classes of action behaviors is 11/60. The large variations on viewpoint, intra class and sequence length determine its demandingness. In fairness, we follow the standard cross-subject and cross-view evaluation protocols in [10].

Implementation Details
For the original NTU RGB+D dataset, we transposed the original coordinate system to humancentric coordinate system. Different from [30], we always chose the first person's body center as the center of the coordinate system in order to better express the relative position between two subjects. Furthermore, coordinate transformation can eliminate the influence of different perspectives of actions. Figure 6 shows the proposed human-centric coordinate system. The formula of calculating transformation of coordinates is shown as (13) where ⃗ and ⃗ are the original coordinate and the converted coordinate, and C is the coordinate of the body center of the first person. is the rotation matrix. For the NTU RGB + D datasets, the matrices were obtained from all the frames of a skeleton sequence, since each person has m = 25 body joints and every interaction was considered to be acted by two subjects. When denoting single person's actions, we considered the second performer's joint coordinates were always zeros. Then in each frame there would be N = 2 m joints, so the original image size was 3 N T for the feature 1 using the part-based arrangement, where T is the length of the sample. For feature 2, we adopted traversal-based arrangement and the size of the generated image was 3 × N × T. For feature 3, we chose 16 rather than 25 key body joints for NTU RGB + D datasets to decrease the amount of calculation and reduce model complexity. Then the size of image was 3 × 256 × T. In order to meet the input requirements of the network, we fixed the length T of images by scaling the column number of the matrix from t to fixed t' through a bilinear interpolation For the NTU RGB + D datasets, the matrices were obtained from all the frames of a skeleton sequence, since each person has m = 25 body joints and every interaction was considered to be acted by two subjects. When denoting single person's actions, we considered the second performer's joint coordinates were always zeros. Then in each frame there would be N = 2 × m joints, so the original image size was 3 × N × T for the feature 1 using the part-based arrangement, where T is the length of the sample. For feature 2, we adopted traversal-based arrangement and the size of the generated image was 3 × N × T. For feature 3, we chose 16 rather than 25 key body joints for NTU RGB + D datasets to decrease the amount of calculation and reduce model complexity. Then the size of image was 3 × 256 × T. In order to meet the input requirements of the network, we fixed the length T of images by scaling the column number of the matrix from t to fixed t' through a bilinear interpolation scheme. Some early works [26] confirmed it was better to resize the images to a square size for recognition. Therefore, we resized the image to 3 × 50 × 50, 3 × 50 × 50 and 3 × 256 × 256 for three features, respectively.
Compared with the above datasets, we used a similar method on SBU dataset, and resized the image to 3 × 30 × 30, 3 × 30 × 30 and 3 × 225 × 225 for three features respectively since the SBU interaction dataset has less body joints and shorter interaction sequences.
For the training of the model, stochastic gradient descent algorithm with Nesterov acceleration with a momentum of 0.8 was adopted for optimization. The initial learning rate was set as 0.01, and decreased by a factor of 0.1 every 25 epochs. The batch size was 64 and the dropout rate was 0.3. After 100 epochs, the training process stopped. Figures 7-9 show the training loss and test accuracy curves of the best performance of our methods acquired by product fusion for NTU cross-subject, cross-view protocols and SBU dataset independently. As can be seen, the convergence speed was very fast.
Electronics 2020, 9, x FOR PEER REVIEW 8 of 12 scheme. Some early works [26] confirmed it was better to resize the images to a square size for recognition. Therefore, we resized the image to 3 × 50 × 50, 3 × 50 × 50 and 3 × 256 × 256 for three features, respectively. Compared with the above datasets, we used a similar method on SBU dataset, and resized the image to 3 × 30 × 30, 3 × 30 × 30 and 3 × 225 × 225 for three features respectively since the SBU interaction dataset has less body joints and shorter interaction sequences.
For the training of the model, stochastic gradient descent algorithm with Nesterov acceleration with a momentum of 0.8 was adopted for optimization. The initial learning rate was set as 0.01, and decreased by a factor of 0.1 every 25 epochs. The batch size was 64 and the dropout rate was 0.3. After 100 epochs, the training process stopped. Figures 7-9 show the training loss and test accuracy curves of the best performance of our methods acquired by product fusion for NTU cross-subject, cross-view protocols and SBU dataset independently. As can be seen, the convergence speed was very fast. (a) (b) Figure 9. Training and test curve on SBU dataset: (a) training loss curve; (b) test accuracy curve.

Results
Experimental results of the three subnets and ensemble-networks on two datasets have been listed in Table 1. scheme. Some early works [26] confirmed it was better to resize the images to a square size for recognition. Therefore, we resized the image to 3 × 50 × 50, 3 × 50 × 50 and 3 × 256 × 256 for three features, respectively. Compared with the above datasets, we used a similar method on SBU dataset, and resized the image to 3 × 30 × 30, 3 × 30 × 30 and 3 × 225 × 225 for three features respectively since the SBU interaction dataset has less body joints and shorter interaction sequences.
For the training of the model, stochastic gradient descent algorithm with Nesterov acceleration with a momentum of 0.8 was adopted for optimization. The initial learning rate was set as 0.01, and decreased by a factor of 0.1 every 25 epochs. The batch size was 64 and the dropout rate was 0.3. After 100 epochs, the training process stopped. Figures 7-9 show the training loss and test accuracy curves of the best performance of our methods acquired by product fusion for NTU cross-subject, cross-view protocols and SBU dataset independently. As can be seen, the convergence speed was very fast. (a) (b) Figure 9. Training and test curve on SBU dataset: (a) training loss curve; (b) test accuracy curve.

Results
Experimental results of the three subnets and ensemble-networks on two datasets have been listed in Table 1.

Results
Experimental results of the three subnets and ensemble-networks on two datasets have been listed in Table 1.
For NTU RGB+D dataset, it can be seen that all these subnetworks achieved good performances for both cross-subject and cross-view evaluation protocols based on our methods. On cross-view evaluation protocols, our method performed better due to the less variety among action performers.
Furthermore, the human-centric coordinate system could eliminate this influence of different perspectives of actions, which verifies the availability of coordinate transformation. The best performance was achieved by feature 3, because it carried more information. Furthermore, the score Electronics 2020, 9, 187 9 of 12 fusion strategy improved the final accuracy by almost 3%, and product fusion method performed better than sum fusion, which exhibits the effectiveness of our approach. (a) (b) Figure 9. Training and test curve on SBU dataset: (a) training loss curve; (b) test accuracy curve.

Results
Experimental results of the three subnets and ensemble-networks on two datasets have been listed in Table 1.  For SBU dataset, our networks also achieved relatively better performance with three subnetworks and ensemble network, the fusion strategy improved the accuracy from 92.25% to 93.58%, which proves the generalization ability of our method. Table 2 shows the comparison result between our method and other methods on SBU dataset. Compared with other methods including hand-crafted feature-based methods and deep learning method, our approach achieved comparable performance except for [27,31]. Reference [27] generated different clips from skeleton sequences and proposed a multitask convolutional neural network to learn the generated clips and achieved 94.17% accuracy, which led to the increase of computation complexity and time consumption. In [31], Li et al. proposed an end-to-end convolutional co-occurrence feature learning framework hierarchical aggregation which could encode the spatial and temporal contextual information simultaneously, and achieved the state-of-the-art results. However, our proposed model had fewer layers and thus required fewer parameters than [31]. Table 2. Performance comparison of different methods on SBU dataset.

Method Accuracy
Raw Skeleton [9] 49.70% Joint Feature [32] 86.90% CHARM [33] 83.90% Hierarchical RNN [11] 80.35% Deep LSTM [12] 86.03% Deep LSTM+Co-occurrence [12] 90.41% ST-LSTM [13] 88.60% ST-LSTM+Trust Gate [13] 93.30% RotClips+MTCNN [27] 94.17% HCN [31] 98.60% Proposed Method 93.58% Table 3 lists the performance comparison of the proposed method with other state-of-the-art approaches for the NTU dataset; we can see our proposed model achieved excellent performances of 82.53% and 91.75%. Especially on cross-view evaluation protocols, our method performed better than others, which demonstrates the effectiveness of coordinate transformation system. On cross-sub evaluation protocols, our method also achieved good results, however, there were some gaps with the state-of-the-art method. One reason is that our method was mainly about human interaction recognition, the features of single person' actions got weakened due to the side effect of zero padding, which affected our recognition results. Table 3. Performance comparison of different methods on NTU RGB + D dataset

Conclusions
In this paper, we propose an ensemble network for skeleton based interaction recognition. In our model, diverse and complementary features are extracted from the original skeleton data as the inputs of three sub-networks. The three subnets are fused as one ensemble network. To learn different levels of significance of different frames adaptively, we design an attention mechanism based on BiGRU where each frame is assigned a different attention weight in order to emphasize key frames which contain important and discriminative information. Excellent results have been achieved on two widely used datasets and the results have shown that our proposed method is effective for feature extraction and recognition.
However, the proposed method was only evaluated in human action recognition and interaction recognition. In the future, we will focus on multiple-person related group activity.
Author Contributions: Conceptualization, methodology and writing original draft preparation, X.W.; writing-review, editing and supervision, H.D. All authors have read and agreed to the published version of the manuscript.