Integrally Cooperative Spatio-Temporal Feature Representation of Motion Joints for Action Recognition

In contemporary research on human action recognition, most methods separately consider the movement features of each joint. However, they ignore that human action is a result of integrally cooperative movement of each joint. Regarding the problem, this paper proposes an action feature representation, called Motion Collaborative Spatio-Temporal Vector (MCSTV) and Motion Spatio-Temporal Map (MSTM). MCSTV comprehensively considers the integral and cooperative between the motion joints. MCSTV weighted accumulates limbs’ motion vector to form a new vector to account for the movement features of human action. To describe the action more comprehensively and accurately, we extract key motion energy by key information extraction based on inter-frame energy fluctuation, project the energy to three orthogonal axes and stitch them in temporal series to construct the MSTM. To combine the advantages of MSTM and MCSTV, we propose Multi-Target Subspace Learning (MTSL). MTSL projects MSTM and MCSTV into a common subspace and makes them complement each other. The results on MSR-Action3D and UTD-MHAD show that our method has higher recognition accuracy than most existing human action recognition algorithms.


Introduction
Human action recognition [1] is a research hotspot in the field of artificial intelligence and pattern recognition. The research achievements have been used in many aspects of life [2], such as human-computer interaction, biometrics, health monitoring, video surveillance systems, somatosensory game, robotics, etc. [3].
Due to the development of lower-cost depth sensors, deep cameras have been widely used in action recognition. Compared with traditional red, green and blue (RGB) cameras, the depth camera is not sensitive to lighting conditions [4]. It is easy to distinguish the background and foreground, and provides human depth data. In addition, human skeletal information can also be obtained from the depth map.
So far, many scholars have used depth sequences for human action recognition research. Li et al. [5] selected representative 3D points to depict human action. Oreifej et al. [6] proposed histogram of oriented 4d normals (HON4D) to capture the structural information of action. Yang et al. [7] proposed depth motion map (DMM) to accumulate the motion energy of neighboring moments.
Since the skeletal information can directly describe the movement of human joints, many scholars have begun to use skeleton sequences to recognize human action. Xia et al. [8] proposed histograms of 3D joints (HOJ3D) as a compact representation of human action. Vemulapalli et al. [9] represented human actions as curves that contain skeletal action sequences. They modeled the 3D geometric correlations among different body parts by using rotations and translations. Luvizon et al. [10] proposed a robust method based on vector of locally aggregated descriptors (VLAD) algorithm and clustering library, which extracts spatiotemporal local feature sets from joint subgroups to represent human action.
In the field of action recognition, the use of single modal data is one-sided, and it is necessary to integrate different modals data for comprehensive decision-making. Many fusion methods have been successfully applied to action recognition, among which feature fusion is most extensive. Canonical Correlation Analysis (CCA) [11] and Discriminant Correlation Analysis (DCA) [12] are commonly used feature fusion methods.
Although significant progress has been made in human action recognition, there are still shortcomings. Currently, many studies only independently perform feature extraction [13] and recognition of some motive joints. However, they did not consider the comprehensiveness, integration and collaboration between the joints. For example, in throwing, the main movement part of the body is right hand. If we only consider the movement features of right hand, this action is likely to be recognized as a waving. If we combine the movements of the left hand, left and right legs of the body with those of the right hand, then the possibility of this action being recognized as a throwing is greatly enhanced. Regarding the problem, this paper proposes an action feature representation algorithm that considers the integrally cooperative movement features of human action, called Motion Collaborative Spatio-Temporal Vector (MCSTV) and Motion Spatio-Temporal Map (MSTM). MCSTV reflects the integrally cooperative of human action through weighted accumulating limbs' motion vector. MSTM can accurately describe the spatial structure [14] and temporal information [15] of actions, we extract key motion energy by key information extraction based on inter-frame energy fluctuation, and project the key energy of body on three orthogonal axes and stitched according to temporal series to form three-axis MSTMs. To give full play to the advantages of MSTM and MCSTV, we propose Multi-Target Subspace Learning (MTSL). MTSL projects MSTM and MCSTV into a common subspace, alienates the inter-class distance of samples of different categories and reduces the dimension of projection target area by constructing multiple projection target centers of each sample of the same category. The workflow illustration of our method is shown in Figure 1. In this paper, we focus on the challenging problem of action recognition. Our contributions are as follows: • Motion Collaborative Spatio-Temporal Vector is proposed, which is a feature representation method that comprehensively considers the integral and cooperative of human action. • Motion Spatio-Temporal Map is proposed, which is a feature representation method that fully preserves the spatial structure and temporal information of human action. • Key information extraction based on inter-frame energy fluctuation is proposed, which is a method that extracts key information in motion energy. • Multi-Target Subspace Learning is proposed, which is used to fuse MCSTV and MSTM.
This paper is organized as follows. In Section 2, the related work is briefly reviewed. In Section 3, method is detailly described. The results of experimental and discussions are presented in Section 4. Finally, Section 5 summarizes the paper.

Action Recognition Based on Skeleton Approach
Human action recognition based on skeleton approach is a research hotspot. Lv et al. [16] decomposed the high dimensional 3D joint space into a set of feature space. Then, the hidden Markov model (HMM) combined with multi-class AdaBoost (Adabbs.m2) algorithm and dynamic programming algorithm are used to segment and recognize the actions. Hussein et al. [17] presented an action recognition algorithm from 3D skeleton sequences extracted from depth data. This method uses the covariance matrix for skeleton joint locations over time as a discriminative descriptor for a sequence. Zanfir et al. [18] proposed a non-parametric Moving Pose (MP) framework for human action recognition. They captured both pose information and the speed and acceleration of human joints and finally used a modified K-Nearest Neighbor (KNN) classifier in conjunction with the descriptors to classify human actions. However, these methods did not consider the comprehensiveness, integration and collaboration between the joints.

Motion Energy Image and Motion History Image
In the early stage of human action recognition, various methods were used to extract features from color video collected by RGB cameras to complete recognition. Bobick et al. [19] proposed Motion Energy Image (MEI) and Motion History Image (MHI). The MEI algorithm needs to extract the foreground area of movement. Then the foreground region is binarized to obtain the binary image sequence of the action. Finally, the union of binary image sequences is taken to obtain MEI of the action. The calculation of MEI is expressed as: where MEI δ (x, y, t) represents that MEI is generated by δ images at the t-th frame in the video sequence. B(x, y, t) represents the t-th frame of the binary image sequence. x and y respectively represent the height and width of a pixel point on the image. t represents the sequence number of a frame in the video sequence. The MEI describes the largest contour of actions. It loses some motion information inside the contour and cannot express the temporal information of actions.
Unlike MEI, MHI is a grayscale image. The pixel intensity is a function of the temporal history of motion at that point. A simple replacement and decay operator can be used: where τ is the initial brightness; MHI τ (x, y, t) is the MHI generated by the t frames of the video sequence. Compared with MEI, MHI retains part temporal information of actions through brightness attenuation. However, MHI also cannot fully express the spatial structure.

Action Recognition Based on Depth Approach
With the development of depth sensors, scholars began to use depth sequence to study human action recognition. Numerous scholars [20] use depth motion map (DMM) for action recognition research. Each frame in the depth sequence is projected onto three orthogonal cartesian planes to form three 2D projection maps according to front, side, and top views, denoted map f , map s , map t . The motion energy of two consecutive projection images is respectively accumulated to form the DMM of three views. The calculation of DMM is expressed as: where v ∈ {f, s, t} represents the projection view, f is front view, s is side view, t is top view. DMM v is the DMM of view v. map i+1 v − map i v denotes the image difference between the i-th frame and the i + 1-th frame, namely the motion energy between the i-th frame and the i + 1-th frame. N represents the number of frames of the depth sequence. ε indicates a difference threshold.
Compared with MEI, DMM fully reflects the depth information of actions. However, DMM also cannot express the temporal information.
Yang et al. [7] computed Histograms of Oriented Gradients (HOG) from DMM as the representation of an action sequence. Chen et al. [20] extracted Local Binary Pattern (LBP) from DMM. Chen et al. [21] extracted Gradient Local Auto-Correlations (GLAC) from DMM. Zhang et al. [22] presented 3D histograms of texture (3DHoTs) to extract discriminant features from depth video sequences. Wang et al. [23] extracted random occupancy pattern (ROP) features from depth video sequences and use a sparse coding approach to encode these features. Xia et al. [24] built depth cuboid similarity features (DCSF) around the local spatio-temporal interest points (STIPs) extracted from depth video sequences to describe local 3D depth cuboids. Vieira et al. [25] presented Space-Time Occupancy Patterns (STOP) to preserve spatial and temporal information between space-time cells.

Feature Fusion
Feature fusion can make single features complement each other to improve recognition accuracy. Many achievements have been made in this field. For example, Hardoon et al. [11] proposed Canonical Correlation Analysis (CCA) which maximizes the correlation between two different features. Zhang et al. [26] proposed group sparse canonical correlation analysis to preserve the group sparse characteristics of data within each set in addition to maximize the global covariance. Haghighat et al. [12] proposed DCA that performs feature fusion by maximizing the pairwise correlations across the two feature sets and eliminating the inter-class correlations and restricting the correlations to be in the classes. Wang et al. [27] proposed Joint Feature Selection and Subspace Learning (JFSSL) for for cross-modal retrieval, in which multimodal features are projected into a common subspace by learning projection matrices. However, these methods generally describe the correlation between different features and do not consider the correlation between the samples of different categories.

Method
As shown in Figure 1, we extract MSCTV from skeleton sequences, and extract MSTM from depth video sequences. Next, we further extract the Histogram of Oriented Gridient (HOG) feature of MSTM. Then, we fuse MCSTV and MSTM through MTSL, and use the fusion features to complete human action recognition research.

Motion Collaborative Spatio-Temporal Vector
Human action is the fact of integrally cooperative movement of all joints, not the fact of individual movement of some joints. Therefore, this paper proposes Motion Collaborative Spatio-Temporal Vector that considers the integrally cooperative movement of limbs' joints.
Most actions are the result of multiple joints moving integrally. In this paper, the scattered and separate motion joints' information is spliced by vector superposition to form a comprehensive vector that can comprehensively describe action and highlight the integral and cooperative of movement. The comprehensive vector is called MCSTV, the basic principle of MCSTV is shown in Figure 2 Due to MCSTV is stitched by the limbs' motion vector, MCSTV can reflect the fact of the comprehensive effect of multiple motion vectors to a certain degree. The human skeleton is shown in Figure 3a. The skeletal frame of the high wave is shown in Figure 3b. As shown in Figure 3b, we select the motion vector from the spine to the left hand to represent the motion of left upper limb, that from the spine to the right hand to represent the motion of right upper limb, that from the spine to the left foot to represent the motion of left lower limb, and that from the spine to the right foot to represent the motion of right lower limb, respectively, denoted For different actions, the contribution degree of each joint is different. As showed in Figure 2, the motion vector of the limbs are directly accumulated. Due to the degree of limb is different, if limbs' motion vector are directly accumulated, the action cannot be accurately described. We must obtain the contribution degree of each limb. However, these motion vectors are all three-dimensional vectors, and the change of the vector in space is the result of the change of the vector in multi-view. Therefore, these motion vectors need to be projected onto three orthogonal cartesian planes xoy, yoz and xoz. The offset of − − → SLH on each plane is expressed as: respectively represent the projection of the i-th frame of − − → SLH on xoy, yoz and xoz. θ SLH xy (i), θ SLH yz (i) and θ SLH xz (i) respectively represent the offsets of the i-th frame of − − → SLH on xoy, yoz and xoz. The offset of − − → SLH of each frame is the sum of offset on three orthogonal planes. The offset of each frame of − − → SLH is expressed as: where θ SLH (i) represents the offsets of the i-th frame of − − → SLH. Each action consists of N frames, so the total offset of − − → SLH is the sum of the offset of each frame. The offset of − − → SLH is expressed as: where sum SLH represents the total offset of − − → SLH. Similarly, according to Equations (4)-(6), the total offset of −−→ SRH is obtained as sum SRH , the total offset of − − → SLF as sum SLF and the total offset of −−→ SRH as sum SRF . The contribution degree of each limb is the ratio of the offset of the limb and the total offset of all limbs. The contribution degree of each limb is expressed as: where W SLH , W SRH , W SLF and W SRF respectively represent the contribution degree of Finally, the contribution degree of each limb was used to constrain motion vector of each frame, and motion vector of each limb were weighted accumulated to form MCSTV. The calculation of MCSTV is expressed as: represents the MCSTV that motion vector of each limb were weighted accumulated.
MCSTV is obtained by weighted accumulation of motion vector of each limb is shown in Figure 4. As can be seen from the figure, after weighted accumulation, the MCSTV of this action is dominated by −−→ SRH. Compared with the direct accumulation of motion vector in Figure 2, this method more directly reflects the major motion joints of the action.

Motion Spatio-Temporal Map
To describe the action information more comprehensively and accurately, this paper proposes a feature representation algorithm, called Motion Spatio-Temporal Map (MSTM). MSTM can completely express the spatial structure and temporal information. This algorithm calculates the difference between adjacent frames of depth sequence to obtain the motion energy. Next, the key information is extracted from the motion energy by key information extraction based on inter-frame energy fluctuation. Then, the key energy is projected to three orthogonal axes to obtain the motion energy list of the three orthogonal axes. Finally, the motion energy list is spliced in temporal series to form MSTM. The flow illustration of MSTM is shown in Figure 5.  As shown in Figure 5, the motion energy of the action is obtained through the difference operation between two adjacent frames of the depth sequence. The motion energy is expressed as: where I k (x, y, z) and I k+1 (x, y, z) respectively represent the body energy of the k-th and k+1-th moment, i.e., the k-th and k+1-th frame of depth video sequence. E k (x, y, z) represents the motion energy of the k-th moment. Due to the habitual sloshing of some joints, there is a lot of redundancy in the motion energy obtained by Equation (9). Regarding the problem, we propose an energy selection algorithm, i.e., key information extraction based on inter-frame energy fluctuation. We use this algorithm to remove the redundant of the motion energy at each moment. The main idea of this algorithm is to divide the human body into four areas according to the height range and width range at the initial moment of actions. Then, we calculate the proportion of the motion energy of each region in the whole body, and select a certain amount of motion energy as the main energy. The remaining energy is taken as redundancy and removed. The detailed steps of this algorithm are as follows: Firstly, we calculate the height range [h min , h max ] and width range [w min , w max ] of human activity area at the initial moment of actions. The body is divided into upper body and lower body according to the hip center of the body. The body is divided into left body and right body according to the symmetry. Finally, the body is divided into four regions: left upper body (LU), right upper body (RU), left lower body (LL), and right lower body (RL). The motion energy of the body is the sum of the motion energy of four regions. The division of human activity areas is shown in Figure 6. Next, we calculate the motion energy proportion of each region in the whole body, the energy proportion can be expressed as: where ψ ∈ {LU, RU, LL, RL}, η ψ represent the motion energy proportion of each region. H, W respectively represent the height and width of the whole body. H ψ , W ψ respectively represent the height and width of each region. Then, we rank the motion energy proportions of each region from the largest to the smallest. The maximum value is denoted as η 1 , the minimum value is denoted as η 4 , and η 1 > η 2 > η 3 > η 4 . In this paper, we select ξ of the whole body energy as the key energy. The remaining energy is considered to be redundant and removed from the original motion energy. The value of ξ is determined by the experimental results and recognition accuracies in Section 4.2.1. The selection of key energy follows the following criteria.
If η 1 > ξ, then the motion energy of the corresponding region of η 1 is retained as the key energy, and the motion energy of the other three regions is considered to be redundant. If η 1 < ξ and η 1 + η 2 > ξ, then the motion energy of the corresponding region of η 1 and η 2 is retained as the key energy, and the motion energy of the other two regions is considered to be redundant. If η 1 + η 2 < ξ and η 1 + η 2 + η 3 > ξ, then the motion energy of the corresponding region of η 1 , η 2 and η 3 is retained as the key energy, and the motion energy of η 4 is considered to be redundant. If none of the above conditions are met, the whole body motion energy is considered to be the key energy and retained.
The key energy is projected onto three orthogonal cartesian planes to form three 2D projection maps according to front view, side view, and top view, denoted map f , map s , map t . The 2D projection maps are expressed as: where map f (x, y), map s (x, z) and map t (y, z) respectively represent the coordinate of a pixel point on the map f , map s and map t . E(x, y), E(x, z), and E(y, z) respectively represent the value of a pixel point on the map f , map s and map t .
To obtain the energy distribution of the width axis, height axis and depth axis, we select map f and map t to continue to project to the corresponding orthogonal axis to obtain the row sum or column sum of the 2D energy projection maps. According to width axis, height axis and depth axis, three 1D motion energy lists are generated, which are expressed as M w , M h and M d respectively. The 1D motion energy list is expressed as: Where M w (j), M h (j), and M d (j) respectively represent the j-th element of the energy list on the width axis, height axis, and depth axis. W m and H m respectively represent the width and height of the 2D energy projection map. According to the temporal order, M u are spliced to form MSTM of three axes, which are respectively represented as MSTM w , MSTM h and MSTM d . For the depth sequence of N frames, the calculation of MSTM is expressed as: where u ∈ {w, h, d} , w is width axis, h is height axis and d is depth axis, M k u represents the 1D motion energy list of the k-th frame of the action sequence on the u axis. MSTM u represents the MSTM on the u axis. MSTM u (k) represents the k-th row of MSTM u .
With the maximum width W max , minimum width W min , maximum height H max and minimum height H max of the body's activity area as the bounds, MSTM is processed with the region of interest (ROI) [28], i.e., the image is cropped and normalized.
The actions in the original depth sequence are defined as positive order actions. The positive order high throw is shown in Figure 7a. The actions that the order is contrary to the original depth sequence are defined as reverse order actions. The reverse order high throw is shown in Figure 7b. The various feature maps of the positive and reverse order high throw are shown in Figure 8.  MHI retains part of temporal information of the actions and has the ability to distinguish between positive order and reverse order actions. However, due to the coverage of the trajectory and the absence of the depth information, MHI cannot fully express the spatial information. Figure 8c,d are the MEI of positive order and reverse order high throw, respectively. Figure 8g,h are the DMM of positive order and reverse order high throw, respectively. MEI and DMM do not involve temporal information, so they cannot be distinguish. MEI does not involve the depth information, which means the spatial information is incomplete. DMM contains depth information and expresses spatial information fully.

Feature Fusion
To ensure that the description of action information more accurate, fuse features are usually used in human action recognition research. Therefore, this paper fuses skeleton feature MCSTV and image feature MSTM. It cannot only reflect the integrity and cooperativity of the action, but also express the spatial structure and temporal information more completely.
denote N action samples, and the i-th sample Γ i = x 1 i , x 2 i , · · · , x M i contains features from M different modalities, but they correspond to the same representation in a common space, denoted by y i . Where x i is the sample of the i-th category, y i is the target projection center of the i-th category, modality M represents M types data.
In this paper, we propose Multi-Target Subspace Learning (MTSL) to study the common subspace of different features. The minimization problem is followed as: where U p , p = 1, 2, . . . , M is the projection matrix of the p-th modality. X p is the sample features of the p-th modality before the projection. X T p U p is the sample features of the p-th modality after the projection. Y is the primary target projection matrix in the subspace, Y = {y 1 , y 2 , · · · , y N } T . L is the number of categories. λ 1 and λ 2 are weighting parameters. G c is the c-th auxiliary projection target center matrix for samples of each category. The auxiliary projection target center is the symmetric target center of the projection target center of other categories with respect to the projection target center of the current category. The selections of G c are shown in Algorithm 1.
In Equation (14), the first term is used to learn the projection matrix, the second term is used for feature selection. The third item is used to expand the inter-class distance between different categories of samples and reduce the dimension of the projection target area.
According to the analysis of l 21 -norm by He et al. [29], the second term is optimized. U p 21 is deduced to Tr U T p R p U p , where R p = Diag r p , r p is an auxiliary vector of l 21 -norm. The i-th parameter of r p is r i p = 1 2 u i p 2 , u i p is the i-th row vector of U p . To keep the denominator from being 0, an infinite decimal α is introduced, and α is not 0. r i p is rewritten as: Equation (14) is derived and the computational formula of the projection matrix is obtained as: The projection matrix of different modalities is obtained through Equation (16), and the test sample of each modality are sent into the common subspace to acquire fusion features X T p U p . The fusion features are used to research human action recognition.

Experiment
In this paper, the experiment runs on a desktop.The hardware is as follows: the main board is X370 Taichi, the CPU is 3.4 GHz R7 1700X, the graphics card is GTX 1660, and the memory is 16 GB. The software is Python 3. The experiments of proposed method were tested on Microsoft Research Action3D (MSR-Action3D) [5] and University of Texas at Dallas Multimodal Human Action Dataset (UTD-MHAD) [30]. Support vector machine (SVM) is used as a classifier to classify samples and obtain the final recognition accuracy. The dimension of skeleton feature MCSTV is related to the number of frames N, so we use Fisher Vector (FV) [31] to normalize MCSTV to make it linearly separable. After processing, the size of MCSTV is changed into 2pK × 1. Where p is the number of rows of MCSTV, K is the clustering center of FV. In this paper, p = 3, K = 128, the size of MCSTV is 768 × 1.

MSR-Action3D
There are 557 depth sequences in this MSR-Action3D. The dataset includes 20 actions performed by 10 subjects, and each subject performs each action 2 to 3 times. The 20 actions are high wave, horizontal wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, and pick up throw. The positive order actions of the dataset are denoted as D1. The positive order actions and reverse order actions of the dataset are denoted as D2. In this paper, two different settings are employed to this dataset. Setting 1. Similar to Li et al. [5], D1 and D2 are divided into 3 groups (AS1, AS2, AS3). In addition, the MSR-Action3D subset dataset is shown in Table 1. The actions with high similarity are divided into the same group, in order to evaluate the model performance traits according to the training dataset size changes, each group of samples is tested to three experiments. In Test1, 1/3 samples are used as training data, and the remaining samples are used as test data. In Test2, 2/3 samples are used as training data, and the remaining samples are used as test data. In Test3, half subjects are used for training and the rest ones used for testing.  [20], all samples in the MSR-Action3D are classified at the same time. The samples of subject 1, 3, 5, 7, 9 are used as training data, and samples of subject 2, 4, 6, 8, 10 are used as test data.

UTD-MHAD
There are 861 depth sequences in UTD-MHAD. The dataset includes 27 actions performed by 8 subjects (4 females and 4 males). Each subject performs each action 4 times. The 27 actions are swipe left, swipe right, wave, clap, throw, arm cross, basketball shoot, draw x, draw circle clockwise, draw circle counter clockwise, draw triangle, bowling, boxing, baseball swing, tennis swing, arm curl, tennis serve, push, knock, catch, pickup throw, jog, walk, sit2stand, stand2sit, lunge, squat. The positive order actions of the dataset are denoted as D3. The positive order actions and reverse order actions of the dataset are denoted as D4. In this paper, two different settings are employed to this dataset.
Setting 3. In order to evaluate the model performance traits according to the training dataset size changes, the samples in D3 and D4 are tested to three experiments. In Test1, 1/3 samples are used as training data, and the remaining samples are used as test data. In Test2, 2/3 samples are used as training data, and the remaining samples are used as test data. In Test3, we divid the samples into 5 parts and take turns to use 4 parts for training and 1 part for testing, the final result is the average of the 5 recognition rates.
Setting 4. Similar to Xu et al. [30], all samples in the UTD-MHAD are classified at the same time. The samples of subject 1, 3, 5, 7 are used to train, and samples of subject 2, 4, 6, 8 are used to test.

Parameter Selection of ξ
In the calculation of MSTM, we must use key information extraction based on inter-frame energy fluctuation algorithm to extract the key information in the motion energy. The amount of key information directly affects the ability of MSTM. Therefore, we need to set the appropriate ξ, which cannot only remove redundant information, but also retain key information completely. When setting different ξ, the effect of key motion energy retained is shown in Figure 9.  Figure 9a shows the key motion energy of tennis serve retained when ξ = 50%. It can be seen that the value of ξ is too small, much key information is mistaken for redundant, and the possibility of this action being recognized as tennis serve is reduced. Figure 9b shows the key motion energy of tennis serve retained when ξ = 80%. Compared with the original motion energy in Figure 9c, it not only retains the key information, but also removes the energy of the area that the motion is not obvious. Figure 9d shows the recognition rate of MTSM-HOG when setting different ξ according to Setting 2.
It can be seen that MSTM-HOG achieves the highest recognition rate when ξ = 80%. So we set ξ to 80% in the following experiment.

Selection of Image Features
Due to excessive noise, if the proposed MSTM is directly classified, the recognition results are affected. In this paper, we select Histogram of Oriented Gridient (HOG) [32] operator and Local Binary Pattern (LBP) [33] operator for feature extraction of image because they are not sensitive to light. HOG operator uses the image unit of 10 × 10 pixels to segment the image, combines every 2 × 2 image units into an image block, and slides the image block of 10 pixels to extract the HOG features of the image. LBP operator uses sampling radius of 2 and sampling point of 8 to extract the LBP features of the image. According to Setting 1, the results on D1 are shown in Figure 10 when HOG features and LBP features of MSTM are extracted. As showed in Figure 10, the recognition accuracy of MSTM when extracting HOG features is higher than LBP. The LBP features mainly reflects the texture information around the pixel, while the HOG features can capture the image contour, and the main information of MSTM distributes in the image contour. Therefore, HOG features are more suitable for MSTM than LBP. In the following experiments, we only extract the HOG features of the images.

Parameter Selection of λ 1 and λ 2
When using MTSL to fuse image and skeleton features, we should select the fitting parameters λ 1 and λ 2 . Assuming that λ 2 = 0.01, the optimal λ 1 is selected by enumerating different λ 1 and taking the recognition accuracy of our method as the evaluation standard. As can be seen from Figure 11a, when λ 1 = 15, our method achieves the optimal result. To obtain the optimal value of λ 2 , we select the optimal value by enumerating different λ 2 when λ 1 = 15, and taking the recognition accuracy of our method as the evaluation criterion. As can be seen from Figure 11b, when λ 2 = 0.05, our method achieves the optimal result. In this experiment, the results are the recognition rate of our method on MSR-Action3D according to Setting 2. As shown in Figure 12a,b, MCSTV achieves the highest recognition rate on two datasets. Among them, it reaches 77.42% on MSR-Action3D and 74.5% on UTD-MHAD, both of them are higher than other movement features.
Next, we compare the variations of the three axes of various movement features. We select the actions with the right hand as the main movement limb and compares the expressive effect of SRH, MCSTV 1 and MCSTV 2 . Where MCSTV 1 is formed by the direct accumulation of each limb's motion vectors, MCSTV 2 is formed by the weighted accumulation of each limb's motion vectors. We select high wave and tennis serve in MSR-Action3D, and select baseball swing in UTD-MHAD. Various movement features of each action are shown in Figure 13.
The main motion limb of the high wave is right upper limb, so the trajectory of MCSTV should be similar to SRH. Tennis serve and baseball swing are the actions that the movement of each limb is obvious, and right upper limb is the main motion joint, so the final MCSTV trajectory should be dominated by SRH. As shown in Figure 13, compared with MCSTV 1 , the trajectory of MCSTV 2 is closer to SRH. In particular, the trajectory of baseball swing's MCSTV 2 is similar to SRH, but MCSTV 1 is different. It can be explained that MCSTV formed by weighted accumulation is more accurate and highlight the main moving limbs. Then, we verify that MCSTV formed by the weighted accumulation describes actions more accurately. The recognition rates are used as the criterion, MCSTV 2 is compared with SRH and MCSTV 1 . The results on MSR-Action3D are shown in Figure 14a, and the results on UTD-MHAD are shown in Figure 14b.
The data in Figure 14a,b are the recognition rates of various features running 20 times. The figure clearly shows the mean, maximum, minimum, median, outlier, upper quartile, and lower quartile of the multiple results. It can be seen from the two figures that the recognition rate of MCSTV 2 is higher than MCSTV 1 and slightly higher than SRH. The main reason is that MCSTV formed by weighted accumulation not only considers the motion vector of each limb, but also gives different weight to each limb according to its contribution, which can highlight the information of the main moving limbs and describe actions more accurately.

Evaluation of MSTM
MSTM expresses the change of motion energy with time on three orthogonal axes. It retains spatial structure and temporal information of actions. To verify that MSTM can completely retain the spatial structure of actions, the recognition rate of MSTM is compared with MEI, MHI and DMM when only positive order actions exist. In this experiment, according to Setting 1, the results of different methods on D1 are shown in Table 2. According to Setting 3, the results of different methods on D3 are shown in Table 3. It can be seen from Tables 2 and 3 that MSTM-HOG achieves or approaches the highest recognition accuracy in most tests. The reason is that MSTM represents the projection of motion energy on three orthogonal axes, which completely preserves the spatial structure of actions. By contrast, the recognition accuracy of MEI and MHI was lower in multiple tests. Due to MEI and MHI describe the largest contour of actions, there is coverage of front and behind motion information, MEI and MHI lose the motion information inside the contour. In addition, MEI and MHI do not involve the depth information of actions. DMM achieves the highest recognition accuracy in some tests, mainly it accumulates the motion energy from three views and completely retains the spatial information.
Next, we verify that MSTM has a strong ability to represent temporal information, the recognition rate of MSTM is compared with MEI, MHI and DMM when positive and reverse order actions exist. According to Setting 1, the results of different methods on D2 are shown in Table 4. According to Setting 3, the results of different methods on D4 are shown in Table 5. In this paper, positive order high throw and reverse order high throw are considered to be two different actions. They have the same spatial trajectory and the opposite temporal information. So the number of actions in D2 is twice in D1, the number of actions in D4 is twice in D3. It can be known from Tables 4 and 5 that the recognition rate of each method is lower than that of Tables 2 and 3. MSTM maintains the highest recognition rate in all tests. The main reason is that MSTM splices the motion energy according to the temporal series, which can fully express the temporal information of actions. The MSTM of positive and reverse order actions are symmetric along the time axis, so the two actions can be accurately classified. By contrast, the recognition rate of MEI and DMM is lower in each test. The reason is that MEI and DMM could not express the temporal information. MEI and DMM of positive and reverse order actions are very similar and cannot be distinguished. MHI expresses part of the temporal information through brightness attenuation, so the recognition rate of MHI is higher than MEI and DMM, but far lower than MSTM.

Evaluation of Feature Fusion
MCSTV can accurately describe the integrity and cooperativity of human limbs. MSTM can completely record the spatial structure and temporal information of actions. To combine the advantages of MSTM and MCSTV, we use MTSL to fuse MCSTV and MSTM-HOG. To prove that the fusion feature describes actions more accurately, we compare the recognition accuracy of the fusion algorithm with single algorithms. According to Setting 2, the results on MSR-Action3D are shown in Table 6. According to Setting 4, the results on UTD-MHAD are shown in Table 7.
It can be seen from Tables 6 and 7 that the recognition accuracy of feature fused by MTSL algorithm is higher than single algorithms. The reason is that MTSL projects different features into a common subspace and complements the advantages of each single features.
It can be also known that the recognition accuracy of feature fused by MTSL is higher than CCA and DCA. Mainly, the MTSL algorithm constructs multiple projection targets to make the subspace samples converge to the hyperplane near the multiple projection target centers and increases the distance between the subspace samples. However, CCA and DCA mainly describe the correlation of two features, image and skeleton are two different modals with small correlation.

Comparison with Other Methods
We compare our method with other methods. According to Setting 2, the recognition accuracy comparison with other methods on MSR-Action3D is shown in Table 8, According to Setting 4, the recognition accuracy comparison with other methods on UTD-MHAD is shown in Table 9. Table 8. The recognition accuracy comparison with other methods on MSR-Action3D.

Method Recognition Rate (%)
Kinect [30] 66.1 Kinect + Inertial [30] 79.1 3DHoT-MBC [22] 84.4 Our Method 89.53 It can be seen from Tables 8 and 9 that the recognition accuracy of our method reached 91.58% on the MSR-Action3D and 89.53% on the UTD-MHAD, both of which were higher than the recognition accuracy of other methods listed. The evaluation results indicate the superiority of our method.

Conclusions
In this paper, we propose an action feature representation that considers the integrally cooperative movement features of human action, called MCSTV and MSTM. MCSTV accumulates weighted limbs' motion vector to form a new vector and uses this vector to account for the movement features of actions. MSTM algorithm projects key motion energy that extracted by key information extraction based on inter-frame energy fluctuation to three orthogonal axes and stitches them in temporal series to construct the MSTM. To describe the action information more accurately, the MTSL algorithm is used to fuse MCSTV and MSTM-HOG. The experimental results on MSR-Action3D and UTD-MHAD show that MCSTV not only considers the integral and cooperative between the motion joints, but also highlights the main moving limbs of the body. Compared with MEI, MHI, and DMM, MSTM describes the spatial structure and temporal information better. The recognition accuracy of features fused by MTSL algorithm is higher than most existing algorithms.

Future Expectations
When we use key information extraction based on inter-frame energy fluctuation algorithm to extract the key information, in some cases, the redundancy cannot be effectively removed because the habitual shaking of some joints is too sharp. Next, we will focus on how to effectively remove the redundant information.