Exploring 3D Human Action Recognition Using STACOG on Multi-View Depth Motion Maps Sequences

This paper proposes an action recognition framework for depth map sequences using the 3D Space-Time Auto-Correlation of Gradients (STACOG) algorithm. First, each depth map sequence is split into two sets of sub-sequences of two different frame lengths individually. Second, a number of Depth Motion Maps (DMMs) sequences from every set are generated and are fed into STACOG to find an auto-correlation feature vector. For two distinct sets of sub-sequences, two auto-correlation feature vectors are obtained and applied gradually to L2-regularized Collaborative Representation Classifier (L2-CRC) for computing a pair of sets of residual values. Next, the Logarithmic Opinion Pool (LOGP) rule is used to combine the two different outcomes of L2-CRC and to allocate an action label of the depth map sequence. Finally, our proposed framework is evaluated on three benchmark datasets named MSR-action 3D dataset, DHA dataset, and UTD-MHAD dataset. We compare the experimental results of our proposed framework with state-of-the-art approaches to prove the effectiveness of the proposed framework. The computational efficiency of the framework is also analyzed for all the datasets to check whether it is suitable for real-time operation or not.


Introduction
Human action recognition is one of the most challenging tasks in the area of artificial intelligence and has obtained attention due to widespread real-life applications, which extend from robotics to human-computer interface, automated surveillance system, healthcare monitoring, etc. [1][2][3]. Human actions are composed of contemporary behaviors of human body parts. The objective of human action recognition is to recognize actions automatically from an unlabeled video [4,5]. To capture human actions, there are two broad categories of devices based on wearable sensors and video sensors. In the prior, using these apparatuses many research works have been completed in the area of action recognition. To recognize wearable sensor-based actions, multiple sensors are connected to the human body. To obtain action information, most of the researchers have used different sensors such as accelerometers, gyroscopes, and magnetometers [6][7][8]. These wearable sensors are used in the healthcare system, worker monitoring, interactive gaming, sports, etc. However, they are not acceptable in all the domains of action recognition, for example in the automatic surveillance system. It is far from convenient for humans (especially patients) to wear the sensors for a long time and relatively it is difficult in cases of energy costs. Wearable sensors can have health risks. For those carrying smartphones, laptops, and tablets, wearable sensor increases • The depth map sequences of each action video are partitioned into a set of subsequences of equal size. Afterward, DMMs are created from each sub-sequence corresponding to three projection views (front, side, and top) of 3D Euclidean space. Then, three DMMs sequences are derived by organizing all the DMMs along the projection views. The video is fragmented by two times generating two sets of subsequences using two different frame lengths and thus there are two sets of three DMMs sequences are obtained. • Our recognition framework mines the 3D auto-correlation gradient feature vectors from three DMMs sequences by using the STACOG feature extractor instead of mining from depth map sequences as shown in [15]. • A decision fusion scheme is applied to combine residual outcomes obtained for two 3D action representation vectors.
• The proposed framework achieved the highest results as compared to all the other work done by applying the STACOG descriptor on depth video.
The remainder of this paper is organized as follows. A couple of action recognition frameworks are reviewed in Section 2. The proposed framework is described in Section 3. In Section 4, experimental results and discussion of the proposed framework are reported. Finally, the conclusion and future research directions are presented in Section 5.

Related Work
This section describes current depth maps-based action recognition frameworks. Additionally, it also reviews skeleton, RGB, inertial, and fusion-based frameworks. Depending on depth data, Chen et al. [16] used local binary patterns (LBPs) to extract features. They represented two types of fusion levels and used the Kernel-based Extreme Learning Machine (KELM) for both levels. Ref. [17] introduced DMM-CT-HOG feature extractor that depends on Depth Motion Maps (DMMs), Contourlet Transform (CT), and Histogram of Oriented Gradients (HOGs). To improve accuracy, [18] used texture and dense shape information and combined them into DLE features that are fed to L 2 -regularized Collaborative Representation Classifier (L 2 -CRC). Ref. [19] proposed a method that fused classification results obtained by using multiple classifiers Kernel-based Extreme Learning Machine (KELM) through three types of features. A Bag-of-Map-Words (BoMW) method is introduced in [20] and feature vectors are extracted from Salient Depth Map (SDM) and Binary Shape Map (BSM) respectively and combined by the BoMW. Ref. [21] submitted a method using gradient local auto-correlations (GLAC) feature description algorithm based on spatial and orientational auto-correlations of local image. They introduced a fusion method depend on the Extreme Learning Machine classifier (ELM). Ji et al. [1], proposed a Spatio-Temporal Cuboid Pyramid (STCP) which subdivides the Depth Motion Sequence into spatial cuboids and temporal segments and used Histograms of Oriented Gradients (HOG) features. Chen et al. [22], used the texture feature descriptor Local Binary Pattern (LBP) and used the Kernel-based Extreme Learning Machine (KELM) classifier [19] to detect action. Again, in [23], DMMs are used as the feature descriptor. In their method, classification is accomplished by L 2 -CRC consisting of a distance-weighted Tikhonov ma-trix. A new feature named Global Ternary Image (GTI) was introduced in [24]. By a bag of GTI model, the authors in [24] obtained data from motion regions and motion directions. After that, Liang et al. [25], used multiscale HOG descriptors and extracted local STACOG features. Then actions were recognized by L 2 -CRC classifier. To improve accuracy, [15] fused 2D and 3D auto-correlation of gradients features which are extracted by Gradient Local Auto-Correlations (GLAC) and STACOG descriptors, respectively. Then, the action is classified by KELM with RBF kernel. Liu et al. [26] presented a method that used Adaptive Hierarchical Depth Motion Maps (AH-DMMs) and Gabor filter. Their method can extract motion and shape cues without decreasing temporal information and adopt the Gabor filter to encode the texture data of AH-DMMs. Jin et al. [27] split depth maps into a set of subsequences to create a vague boundary sequence (VB-sequence). They obtained dynamic features by combining all DMMs of VB-sequences. After that, Zhang et al. [28], presented low-cost 3D histograms of texture feature descriptors by which discriminant features are obtained. They also introduced a multi-class boosting classifier (MBC) to use different features for recognition. Furthermore, Chen et al. [29] introduced a multi-temporal DMMs descriptor in which a non-linear weighting function is used to assemble depth frames. They used a patch-based Local Binary Pattern (LBP) feature descriptor to obtain texture information. They used Fisher kernel representation and used the KELM classifier [19] for action classification. Li et al. [30], extracted texture features by discriminative completed LBP (disCLBP) descriptor and used a hybrid classifier associated with Extreme Learning Machine (ELM) and collaborative representation classifier (CRC). The authors in [31] used Histogram of Oriented Gradients (HOG) and Pyramid Histogram of Oriented Gradients (PHOG) as shape feature descriptors. They used L 2 -CRC classifier. Azad et al. [32], introduced a multilevel temporal sampling (MTS) scheme that depended on the motion energy of depth maps. They extracted histograms of gradient and local binary patterns from a weighted depth motion map (WDMM). In [33], an action recognition scheme based on two types of depth images (generated using 3D Motion Trail Model (3DMTM)) was introduced. They obtained two features by using the GLAC algorithm from the images respectively and the features were fused in a vector. In the same year, Weiyao et al. [34] submitted Multilevel Frame Select Sampling (MFSS) model to obtain temporal samples from depth maps. They also proposed motion and static maps (MSM) and extracted texture features by the block-based LBP feature extraction scheme. They used the fisher kernel representation method to fuse obtained features and the KLM classifier to detect action. After that, Shekar et al. [35] introduced Stridden DMMs from which effective information of actions can be obtained quickly. They Undecimated the Dual-Tree Complex Wavelet Transform algorithm to extract wavelet (UDTCWT) features from the proposed DMMs. They used a Sequential Extreme Learning Machine classifier. To improve results, [36] used two types of images that are obtained by using the 3D Motion Trail Model (3DMTM). In their method feature vectors are mined from MHIs and SHIs by the GLAC feature descriptor. Al-Faris et al. [37] presented the construction of a multi-view region-adaptive multi-resolution-in-time depth motion map (MV-RAMDMM). They trained several scenes and time resolutions of the region-adaptive depth motion maps (RA-DMMs) by multistream 3D convolutional neural networks (CNNs). They used a multi-class SVMs classifier to recognize human actions.
Additionally, in [38], depth and inertial sensor-based features were extracted and fused to a single feature. The final feature set was passed to the collaborative representation classifier. Based on skeleton information, Youssef et al. [39], extracted normalized angles of local joints and used modified spherical harmonics (MSHs) to model the angular skeleton. They used MSH coefficients of the joints as the discriminative descriptor of the depth maps. Hou et al. [40], proposed a framework to convert Spatio-temporal data from skeleton sequence into color texture images. They used convolutional neural networks to obtain discriminative features. The authors in [41] created a Deep Convolutional Neural Network (3D2CNN) to acquire Spatio-temporal features from depth maps and calculated JointVectors from depth maps. The spatio-temporal features and JointVectors were passed individually to the SVM classifier and the outputs were combined into a single result. To improve accuracy [42] introduced a Spatially Structured Dynamic Depth Images S 2 DDI to represent an action video. To generate S 2 DDI, they presented a non-scaling method and approved a multiply score fusion scheme to increase accuracy. Using RGB image, Al-Obaidi et al. [43] presented a method to anonymize action video. Histograms of oriented gradients (HOG) features are extracted from anonymized video images. A Generative Multi-View Action Recognition (GMVAR) method is presented in [44], by which three discrete scenarios are managed at the same time. They introduced a View Correlation Discovery Network (VCDN) to concatenate multi-view data. Liu et al. introduced dynamic pose images (DPI) and attention-based dynamic texture images (att-DTIs) in [45] to obtain spatial and temporal cues. They combined DPI and att-DTIs through multi-stream deep neural networks and a late fusion scheme. Inertial sensor-based low-level and high-level features are used in [46] to categorize human actions acted by a performer in real time. Haider et al. [47] introduced balanced, imbalanced, and super-bagging methods to recognize volleyball action. They used four wearable sensors to evaluate their method. Using signals created by the inertial measurement unit [48] introduced a method based on 1D-CNN construction and consider the tractability of features in time and duration. Bai et al. [49], presented a Collaborative Attention Mechanism (CAM) to develop Multi-view action recognition (MVAR) performance. They also proposed Mutual-Aid RNN (MAR) cell to obtain multi-view sequential information. Ullah et al. [50] introduced a conflux long short-term memory (LSTMs) network. They used CNN model to extract features and used SoftMax for classification. A fusion technique called View-Correlation Adaptation (VCA) in feature and label space was presented in [51]. They generated a semi-supervised feature augmentation (SeMix) and introduced a label-level fusion network. In [52], a light-weight CNN model was used to detect humans and LiteFlowNet CNN was proposed to extract features. The deep skip connection gated recurrent unit (DS-GRU) was used to recognize the action.

Proposed Recognition Framework
In this segment, we introduced the proposed framework with a detailed discussion on the construction of DMMs sequences, 3D auto-correlation features extraction, and action recognition. Algorithms 1 and 2 describe the mechanism of feature extraction and action recognition, respectively. Steps: (11) end for end for end for 3. Calculate P(ω|c) through Equation (12) 4. Decide class(c) through Equation (13) Output: class(c)

Construction of DMMs Sequences
In our work, DMMs corresponding to three projection views (front, side, and top) are constructed for each sub-sequence of depth map sequence. To obtain DMMs, all the depth frames of each sub-sequence are projected onto 3D Euclidean space and projection frames corresponding to three projected views are generated. For each projected view, the addition of the utmost differences between sequential projection frames forms DMMs of front, side, and top.
To interpret computation of DMMs sequence [23], at first, a depth video D of length L is divided into a set {S j } m j=1 of sub-sequences of uniform size l 1 > 0 as D = ∪ m j=1 S j , where j represents the index of sub-sequence. Let us consider a depth frame sequence {p 1 , p 2 , p 3 , . . . , p l 1 } for each sub-sequence, where l 1 is the frame length of each sub-sequence, i.e., len(S j ) = l 1 for all j. The projection of ith frame p i on 3D Euclidean space provides three projected frames p i v (which are referred to as PF j v in Figure 1), where v designates front, side, and top projection views and v ∈ { f , s, t}. The DMMs corresponding to projection views are defined by the following equation: For all S j , DMMs are represented by DMM In action datasets, the same actions are performed by different individuals with different speeds. To cope with action speed variations, the depth map sequence of D is further divided into another set {V k } n k=1 of sub-sequences where frame length of each sub-sequence is l 2 > 0, i.e., len(V k ) = l 2 for all k (see Figure 1). As a result, more three new sets of DMMs sequences . . , DMM n t } are obtained from {V k } n k=1 . In our DMMs sequences constructing mechanism, numerical values of frame lengths l 1 and l 2 are experimentally chosen to 5 and 10 respectively. The frame length of a sub-sequence may vary and must be set to less than the length of total depth video, i.e., l 1 and l 2 < L. The DMMS sequences constructing scheme for frames length l 1 is displayed in Figure 2. The frame interval I in Figure 2 is set to 1 which is the number of frames from the first frame of a portion to the first frame of the neighboring portion which indicates how many frames between the two portions are overlapped. Please note that the frame interval must be less than the frame length of a sub-sequence, i.e., I < l 1 and l 2 .

Action Vector Formation
STACOG was introduced in [53] for RGB video sequences to extract local relationships within the space-time gradients of three-dimensional motion by using auto-correlation functions to space-time orientations and the magnitudes of the gradients. In our work, this method is applied to all the DMMs sequences (calculated in the previous section) of a depth video D to extract 3D geometric features of human motion. At each space-time volumeS(x, y, t) (in general, this volume stands for a DMMs sequence) around each spacetime point in a DMMs sequence the space-time gradient vector is computed through the derivatives S x , S y , and S t to extract features. The space-time gradients can be described by angles α = arctan(S x , S y ) and β = arcsin( S t m g ), where the magnitude of gradient is defined by m g = (S 2 x + S 2 y + S 2 t ). By the two angles, space-time orientation of the gradient is coded into B orientation bins on a unit sphere by selecting weights to the nearest bins (see Figure 3). Finally, the orientation is represented by B-dimensional vector named space-time orientation coding (STOC) vector which is denoted by b. By using the magnitude m g and the STOC vector b of the gradients, the Nth order auto-correlation function for the space-time gradients is defined as follows: where d i = (d 1 , . . . , d N ) displacement are vectors from the reference point p = (x, y, t), f represents a weighting function and ⊗ is the tensor product of vector. In the tensor products, there are small numbers of non-zero components related to the gradient orientations of the neighboring vectors. The parameters N ∈ {0, 1}; d 1x,y ∈ {±∆s, 0}; d 1t ∈ {∆t, 0}; f (·) ≡ min(·) are confined in the experiment. Where ∆s is the displacement interval along the spatial axis and ∆t is that of along the temporal axis. To inhibit the effect of isolated noise on surrounding auto-correlations, min is received regarding to weight function f . For N ∈ {0, 1} the 0th order and the 1st order STACOG features can be written as, where S 0 and S 1 are 0th and 1st order auto-correlations which gives the 0th order and the 1st order STACOG features, and T is the transpose.

Action Recognition
By applying Algorithm 1, two auto-correlation feature vectors H 1 and H 2 are acquired corresponding to two different sets of sub-sequences, {S j } m j=1 and {V k } n k=1 , of the depth video D (see Figure 1). The dimension of H 1 and H 2 are reduced through Principal Component Analysis (PCA) [54]. Then the two vectors are passed separately to L 2 -regularized Collaborative Representation Classifier (L 2 -CRC) [13] and the relevant two distinct outcomes are fused by logarithmic opinion pool (LOGP) [14]. To explain L 2 -CRC, let us denote the class number by K. The set Y = [Y 1 , Y 2 , Y 3 , . . . , Y i , . . . , Y K ] = [y 1 , y 2 , y 3 , . . . , y j , . . . , y n ] ∈ R (d×n) is the set of all training samples, where d is the-dimensionality of training samples, m is the number of training samples from K classes, Y i ∈ R (d×m i ) is subset of training samples from class i and y j ∈ R d is any training sample of Y i . Let, c ∈ R d be any unknown test sample which is defined by the linear combination of all the training samples in Y: where γ = [γ 1 , γ 2 , γ 3 , · · · , γ i , . . . , γ K ] is a m × 1 coefficients vector associated with the training samples of class i. In practice, Equation(5) cannot be solved directly because it is under determination [55]. By the solution of the following norm minimization problem, Equation(5) can be solved: where λ denotes the regularization parameter and M is the Tikhonov regularization matrix [56], which is configured by the following diagonal matrix.
The coefficient vector can be calculated as [57], Since the training samples are Y is given and λ is determined by these samples then Z can be simply calculated and thus Z is independent of c. It is clear when the test sample c is given, the corresponding vector γ can be easily computed from Equation (8). The coefficient vector γ is represented as [ γ 1 , γ 2 , γ 3 , . . . , γ i , . . . , γ K ] by considering all the action classes. Now, the class-specific residual error can be obtained by where, Y i is the dictionary sample and γ i is the coefficient of ith class, respectively. From Equation (9), an error vector is obtained about an input feature vector. In our case, there are two error vectors e 1 = [e 1 1 , e 1 2 , e 1 3 , . . . , e 1 i , . . . , e 1 K ] and e 2 = [e 2 1 , e 2 2 , e 2 3 , . . . , e 2 i , . . . , e 2 K ] since we input two feature vectors H 1 and H 2 obtained by Algorithm 1 for the test sample c. A decision fusion scheme logarithmic opinion pool (LOGP) [14] rule is used to concatenate the probabilities of those errors and to output the class label. In this scheme, the following global membership function is calculated through the posterior probability p q (ω|c) of each classifier.
Equation (11) defines the higher posterior probability p q (ω|c) for a smaller residual error e i . Therefore, the combined probability from the two classifiers is defined as: And, class(c) = max{P(ω|c)}, where e 1 and e 2 are normalized to [0, 1].

Experimental Results and Discussion
This section discusses three sets of experiments on three datasets to evaluate the performance of the proposed framework. First, the datasets are introduced along with their challenges. Secondly, the setup of STACOG parameters is then discussed to evaluate the proposed framework. Finally, experimental results on three datasets are described.

MSR-Action 3D Dataset
MSR-Action 3D dataset is captured by a depth camera which represents action data of depth map sequences. The resolution of each map is 320 × 240. This dataset has 20 types of action categories. All the actions are acted by 10 different persons and every subject act in each action 2 or 3 times. In this dataset, the number of depth map sequences is 557 [58]. This dataset is a challenging because of the correspondence between some actions (e.g., "Draw x" and "Draw tick").

DHA Dataset
DHA dataset was introduced in [59] which contains some actions extended from the Weizmann dataset [60]. The Weizmann dataset is used in action recognition based on RGB sequences. The DHA dataset involves 23 action types among which 1 to 10 actions are adopted from Weizmann dataset [61]. All the actions are performed by 21 subjects (12 males and nine females) and the total number of depth map sequences is 483. Because of the inter-similarity between action classes (e.g., "rod-swing" and "golf swing"), the DHA dataset is challenging.

UTD-MHAD Dataset
In the UTD-MHAD dataset [38], RGB videos, depth videos, skeleton positions, and inertial signals are captured by a video sensor and a wearable inertial sensor. All the actions of this dataset contain 27 actions and all the actions are performed by eight subjects (four females and four males). Each performer repeats each action four times. This dataset includes 861 depth action sequences, after eliminating three inappropriate sequences.

Parameter Setting
The proposed framework is evaluated on the datasets discussed above and compared with the other state-of-the-art approaches. Of all the samples of each dataset, some samples are used as training samples and the remaining samples are used as test samples. Depending on the test samples, results on all datasets are obtained. Each depth action video of all datasets is partitioned into sub-sequences using the same frame lengths. The frame interval between two consecutive sub-sequences is set to 1 which indicates the number of overlapping frames. Thus, for two different frame lengths 5 and 10, the overlapping frames 4 and 9 are obtained, respectively. Additionally, for all action datasets, we used the same values of parameters. At first, all parameter values are tuned for a dataset to query which values give the highest recognition accuracy. Then, the values of parameters set for the highest result are used in all other datasets to verify the superiority of the framework. To extract STACOG features, orientation bins in the x − y plane and orientation bin layers are set to 9 and 4, respectively. According to [15], the temporal interval is set to 1 and the spatial interval is fixed to 8. The L 2 -CRC parameter λ is tuned to 0.0001.

Classification on MSR-Action 3D Dataset
In the experimental arrangement, we used all action categories of MSR-Action 3D dataset instead of dividing them into different action subsets. The action samples acted through persons of the odd number are employed as training samples (284) and the samples of the remaining persons of even number are used as test samples(273). Our proposed framework gives 93.4% recognition accuracy which is compared with other frameworks on depth data as shown in Table 1. Among 20 actions, the classification accuracy is 100% for 14 actions. The remaining 6 actions have some confusion with other actions because of some inter-class similarities. For example, the confusion of an action "Side kick" with an action "Hand catch" is 9.1% (see Figure 4). The accuracy including confusion information of each class is further clarified in Table 2.  Table 1. Comparison of action recognition accuracy (%) with state-of-the-art frameworks on the MSR-Action 3D dataset.

Classification on DHA Dataset
In the DHA dataset, samples of the odd subjects are used as training samples and the samples of the even subjects are used as test samples. There are 253 samples are used as training samples and 230 samples are used as test samples. Our proposed framework achieves 95.2% accuracy which shows the effectiveness of the recognition framework. From Table 3, we can observe that 15 out of 23 actions are recognized with 100% accuracy. The remaining 8 actions are confused with other actions shown in Figure 5. The action "golf swing" gives 10% confusion with "rod-swing". The comparison of our recognition framework with other state-of-the-art methods is shown in Table 3. It is clear from the table that our proposed framework beats other existing frameworks considerably. The class-wise classification accuracy (for right and wrong classification) is shown in Table 4. Table 3. Comparison of action recognition accuracy (%) with state-of-the-art frameworks on the DHA dataset.

Classification on UTD-MHAD Dataset
In the UTD-MHAD dataset, samples of the odd subjects are used as training samples (431) and the samples of the even subjects are used as test samples (430). The evaluation result of our framework on this dataset gives 87.7% recognition accuracy (see Table 5) because of using varieties actions. The result in our recognition framework gives 100% accuracy for 11 actions and the remaining 16 actions show confusion with other actions (see Figure 6). The individual class recognition performance is reported in Table 6. Table 5. Comparison of action recognition accuracy (%) with state-of-the-art frameworks on the UTD-MHAD dataset.

Efficiency Evaluation
The execution time and the space complexity of key factors are deliberated to show the efficiency of our system.

Execution Time
The system is executed by using MATLAB on CPU platform with an Intel i5-7500 Quad-core processor of 3.41 GHz frequency and a RAM of 16 GB. There are seven major components in the proposed approach: DMMs sequences construction for frame length 5, DMMs sequences construction for frame length 10, H 1 feature vector generation, H 2 feature vector generation, PCA on H 1 , PCA on H 2 , Action label. The execution time of these components is determined to assess the time efficiency of the system on three datasets as MSR-Action 3D, DHA, and UTD-MHAD dataset. Table 7 showed the execution time (in milliseconds) of the seven components on those datasets and compared the total execution time on the datasets. In the case of the MSR-Action 3D dataset, execution times are calculated for each action sample with 40 frames on average. As can be seen from Table 7, 40 frames are processed in less than one second (i.e., 252.6 ± 74.8 milliseconds). Therefore, our proposed recognition framework can be used for real-time action recognition on the MSR-Action 3D dataset. The execution times on the DHA dataset are calculated for each action sample with 29 frames on average. Table 7 showed that the 29 frames are processed in less than one second (i.e., 379.1 ± 90.7 milliseconds) which proves our framework can be used for real-time action recognition on the DHA dataset. Table 7 also presented the execution times (in milliseconds) on the UTD-MHAD dataset for each action sample with 68 frames on average. To process 68 frames, the system requires less than one second (i.e., 508.9 ± 100.9 milliseconds) which showed the capability of the real-time action recognition of our proposed framework.

Space Complexity
The components PCA and L 2 -CRC are the key components for the calculation of space complexity of the proposed system. PCA and L 2 -CRC are adopted for both frame lengths 5 and 10. Therefore, the complexity of PCA is 2 * O l 3 + l 2 m [23] and the complexity of L 2 -CRC is 2 * O(n c × m) [62]. Thus, the total complexity of the system can be expressed as 2 * O l 3 + l 2 m + 2 * O(n c × m). Table 8 describes the computed complexity and compared it with the complexities of other existing frameworks. The table shows that our framework delivers lower complexity and recognizes actions better than other existing frameworks.

Conclusions
In this paper, we present an effective action recognition framework that is based on 3D Auto-Correlation features. In fact, the Depth Motion Maps (DMMs) sequence representation is firstly introduced to obtain additional temporal motion information from depth map sequences which can distinguish similar actions. The space-time auto-correlation of gradients features description algorithm is then used to extract motion cues from the sequences of DMMs according to different projection views. At last, the Collaborative representation classifier (CRC) and the decision fusion scheme are used for detecting action class. Experimental results on three benchmark datasets shows that the proposed framework is better than the state-of-the-art methods. Moreover, the framework outperforms other existing techniques that are based on space-time auto-correlation of gradients feature. Furthermore, the space-time complexity analysis of the proposed framework indicates that it can be used for the real-time human action recognition.