Keys for Action: An Efficient Keyframe-Based Approach for 3D Action Recognition Using a Deep Neural Network.

In this paper, we propose a novel and efficient framework for 3D action recognition using a deep learning architecture. First, we develop a 3D normalized pose space that consists of only 3D normalized poses, which are generated by discarding translation and orientation information. From these poses, we extract joint features and employ them further in a Deep Neural Network (DNN) in order to learn the action model. The architecture of our DNN consists of two hidden layers with the sigmoid activation function and an output layer with the softmax function. Furthermore, we propose a keyframe extraction methodology through which, from a motion sequence of 3D frames, we efficiently extract the keyframes that contribute substantially to the performance of the action. In this way, we eliminate redundant frames and reduce the length of the motion. More precisely, we ultimately summarize the motion sequence, while preserving the original motion semantics. We only consider the remaining essential informative frames in the process of action recognition, and the proposed pipeline is sufficiently fast and robust as a result. Finally, we evaluate our proposed framework intensively on publicly available benchmark Motion Capture (MoCap) datasets, namely HDM05 and CMU. From our experiments, we reveal that our proposed scheme significantly outperforms other state-of-the-art approaches.


Introduction
Human action recognition and behavior analysis has been an active research area in recent decades because of its wide range of potential applications, including human-computer interaction applications, e.g., sport annotations and fitness training, game and film industries, computer animations, robotics, content-based data retrieval, health monitoring, and medical rehabilitation. For this reason, the demand for capturing and rendering 3D human motion is continually increasing. In general, studio-like environments with well-developed motion capture systems are used to capture the dynamic movements of an actor or object. The motion capture data basically represent human motions in the form of the spatiotemporal trajectories of the specified human skeleton joints [1]. The motivation behind motion capturing is to record the motion and then re-utilize it to perform different kinds of tasks rather than generate the motion synthetically. In order to capture human motion, a variety of sources are utilized, such as mechanical systems in which potentiometer and fiber optics are used; magnetic-/acoustic-based systems [2]; camera-based systems [3][4][5] in which expensive, high-speed and high-definition cameras are used; high-speed optical motion capture systems using photosensors [6];  Figure 1. System overview. The first phase is the training phase, while the second phase represents the testing phase. In the training phase, 3D poses are normalized first by removing orientation and translational information from each pose, thus developing the normalized pose space. The joint features are extracted from these normalized poses and are given as input to the deep neural network in order to learn the model. The testing phase includes normalization, keyframe extraction and then the extraction of joint features, on the basis of which the action is recognized with the help of the learned model. We evaluate our proposed approach thoroughly on popular benchmark MoCap datasets, namely HDM05 [32] and CMU [33]. We categorize these datasets further into four groups according to the motion classes, e.g., HDM05-65, HDM05-14, CMU-30, and CMU-14. We conducted several experiments in order to analyze our proposed scheme, and we describe them step by step. In particular, we analyze (i) the impact of essential keyframes on the overall process of action recognition in terms of accuracy and time complexity, (ii) the impact of the training MoCap datasets and, ultimately, and (iii) the impact of a variety of deep neural network architectures. We compare our approach with other existing state-of-the-art approaches and conclude that our proposed pipeline is not only computationally fast enough but also achieves comparatively better performance in terms of accuracy. This paper is organized as follows: we first discuss the related methods and techniques in Section 2. In Section 3, we describe our proposed methodology step by step, including the process of normalization, the proposed keyframe extraction algorithm, and the details about the proposed deep neural network architecture. The experiments and the discussion on the obtained results are presented in Section 4. A comparison of our approach with other techniques is also available in Section 4. Finally, we conclude our work in Section 5.

Related Work
In this section, we discuss the background research work in the field of action recognition using motion capture datasets. There exists a variety of techniques to recognize 3D human action from a motion capture dataset. We categorize these techniques into two major classes, i.e., conventional machine-learning-based approaches and deep-learning-based algorithms.

Conventional Learning-Based Approaches
An abundance of research has been done on human action recognition based on conventional machine learning techniques, such as K Nearest Neighbor (KNN) classifiers [34,35], Support Vector Machine (SVM) [18,[35][36][37][38][39][40], Hidden Markov Models (HMM) [41,42], clustering strategies [42][43][44], and Bayesian learning [45,46]. Most of these techniques first extract hand-crafted features [34,36,38,39,46,47] and then apply a learning algorithm in order to classify the action. Wu et al., in [34], propose action descriptors with a sliding temporal window of size 5, which includes joint position, angular velocity, and angular acceleration. They exploit three different modified K Nearest Neighbor (KNN) classifiers for the confidence of each frame, for the prediction of frame-wise labels, and, ultimately, for the final action classifications. In [18], Cho et al. propose a method that classifies the action while simultaneously reconstructing the given input motion sequences. This approach utilizes two sets of features: the first feature set consists of the relative positions of joints (PO) with temporal differences (TD), represented as (PO+TD); and the second set includes the relative positions of joints (PO) with temporal differences (TD) and normalized trajectories (NT) of motion, represented as (PO+TD+NT). A Hybrid Multi-Layer Perception with a deep autoencoder, a symmetric feedforward neural network, is then trained on this extracted feature set to perform the classification and reconstruction tasks simultaneously. The experiments were performed on the HDM05 MoCap dataset. For the evaluation, they also employed other classification techniques, such as Multi-Layer Perception (MLP), SVM, Extreme Learning Machines (ELM), and Hybrid Multi-Layer Perception (Hybrid MLP), with different learning rates λ = {0, 0.1, 0.5, 0.9}. Yang and Tian, in [46], propose a new action feature descriptor, e.g., eigenJoints, for action recognition, which is basically the combination of multiple action-related information types, such as the static posture of the actor, how the motion is performed, and the ultimate overall dynamics. For the selection of the most informative frame, they employ Accumulated Motion Energy (AME), which measures the dissimilarity between the frames. For the final action classification task, they deploy a non-parametric Naïve-Bayes-Nearest-Neighbor (NBNN) classifier. Vantigodi et al. [39] propose a method for action recognition that is based on two feature sets, the variance of skeleton joints and the time-weighted variance of the skeleton points, which incorporates the temporal information of the performed action. Both feature sets are embedded together for further model training. For action classification, a linear SVM and a correlation-based metric are employed. Liang et al. [42] introduce a local joint structure and 3D histogram-based local and global features in order to represent 3D actions. Linear Discriminant Analysis (LDA) is employed to reduce feature dimensionality, the k-means clustering algorithm is utilized to generate codewords, and Hidden Markov Models are deployed for action recognition on the basis of the codewords. Moussa et al. [38] propose a methodology that depends on high-level features that carry information about changes in human body dimensions during the performance of the action. Their proposed system comprises four stages: the extraction of skeleton details, parameter calculation, parameter encoding, and, finally, a classification module, in which a multi-class linear SVM classifier is employed.
Slama et al. [36] present a method in which 3D human skeleton motion is represented as geometric formulations, and an action is represented as a component of a Grassmann manifold. For classification, they employ a wrapped Gaussian model and a linear Support Vector Machine (SVM). Kovar and Gleicher, in their paper [47], propose a method to logically identify similar motion segments by employing a novel distance metric with which they numerically find similarity and closeness between motion segments from a large MoCap dataset. They further utilize these similar segments for automated motion registration, as well as for blending the continuous and parameterized space of 3D motions. Xiao and Song [45] propose a technique of Statistical Learning and Bayesian Fusion (SLBF) for motion clip similarity in which a motion feature database is developed by representative frames extracted through a fuzzy clustering strategy and gesture features. They basically combine category-based motion similarity distances and Canonical Correlation Analysis (CCA)-based distances through Bayesian estimation in order to find similar segments from a MoCap dataset. Kadu and Kuo [37] propose a multi-resolution string representation method using Tree Structure Vector Quantized (TSVQ) in order to generate a codeword for action classification. For final action recognition and classification, they considered a number of methods, such as (i) Method A: Motion-String Similarity including Sim-parameter with Level-n (SLn) and Max-parameter with Level-n (MLn), where n = 10, 11, 12, 13; (ii) Method B: a Pose-Histogram Classifier with Level-n (SLn), where n = {3, 4, 5, 6}; (iii) Method C: Two-Step Score Fusion with TSVQ; and (iv) Method D: Two-Step Support Vector Machine (SVM) Fusion with TSVQ. Overall, they achieved very good results with Two-Step (SVM) Fusion with the TSVQ method. Ko et al.,in [43], deploy Principle Component Analysis (PCA) for dimensionality reduction, motion saliency and the k-means clustering algorithm in order to first extract informative and significant keyframes from human motion input sequences; then, they reconstruct these motion sequences to compare them with the input motion clip. In [48], Wu et al. employ a Self Organizing Map (SOM) and the Smith-Waterman algorithm to achieve efficient retrieval and, ultimately, indexing of the human motion capture data. For indexing, the SOM is utilized, while the local similarities between motion clips are computed with the help of the Smith-Waterman algorithm. They basically propose an unsupervised method for the indexing and clustering of motion clips, which are deployed further to classify the actions, as well. They enhanced their strategy of the motion map, as described in [44], where they present a cluster-based scheme for the indexing and retrieval of motion clips. They partition the human skeleton model into three body parts: the torso, the arms and the legs. They then measure temporal similarity information for each body part by the SOM and Smith-Waterman algorithm. In the end, a hierarchical clustering method is implemented to cluster similar data, as well as to find relationships between them. The authors of [41] propose a novel frame-by-frame action recognition approach by considering the algebraic velocity generated by different body parts of the 3D skeleton. For the classification of different action categories, a real-time Hidden Markov Model algorithm with Gaussian Mixture Models (GMM) is deployed. Barnachon et al. [49] propose an action recognition technique for ongoing action sequences. They compute the histogram of the action and then the Hausdorff distances accordingly, which are further warped by Dynamic Time Warping (DTW). For that purpose, they deploy dynamic programming in order to compute the final recognition score. Baumann et al. [50] propose an action graph, for which a kd tree is developed. The neighborhood of a query is fetched, which is further utilized in the action graph. Finally, action recognition is transformed into the shortest-path-finding problem, where the target is to find the shortest path through the action graph. This shortest path represents the final action. Laraba et al. [35] first represent 3D human motions as 2D RGB images, and then they employ classical machine learning algorithms, including KNN, SVM, Random Forest, and Convolutional Neural Network (CNN), for action classification. On the basis of their experiments, they claim that because the images of the motion sequences are represented in the RGB domain, the CNN outperforms all other competing techniques. Plenty of methods exist that employ key pose-based features or descriptors in order to classify actions [49,[51][52][53].

Deep-Learning-Based Approaches
Hinton et al. [54] define a DNN as a neural network that consists of two or more hidden layers between the input layer and the output layer. The literature contains a number of techniques that employ deep-learning-based approaches; for example, [1,[55][56][57] rely on a Convolutional Neural Network (CNN), [19,[58][59][60] utilize a Recurrent Neural Network (RNN), as well as Long-Short Term Memory (LSTM), and [61] uses Deep Progress Reinforcement Learning (DPRL). Sedmidubsky et al. [1] propose a method for action recognition and segmentation in which the motions are mapped onto encoded RGB images. They first normalize the poses, and then the x-, y-, and z-coordinates of the poses are translated into red, blue, and green channels of the colors. They combine distance-based functions with a CNN classifier. In this way, they generate fixed-sized, highly descriptive feature vectors with 4096 dimensions. They learn motion characteristics by employing a CNN while performing indexing by a distance-based comparison. They further enhanced this approach towards the process of segmentation. Zhang et al. [59] extended an RNN model to the spatial domain by adding up simple geometric relational features that are based on the distances between skeleton joints. They use a three-layer-deep LSTM model in which they drop the in-cell connections. The geometric features are given as input to the first layer of the LSTM network, and the output of the first layer is provided as the input to the upper layer. A softmax layer is ultimately used on top of the highest LSTM layer. Liu et al. [55] propose a skepxel in which they combine spatial and spatiotemporal information in order to represent the skeleton joint sequences. Furthermore, they also add up relative joint velocities. In this way, the authors provide a more detailed hierarchical representation with micro-temporal relation and macro-temporal relation for learning through a CNN. They extended the Inception-ResNN CNN with their proposed scheme and obtained outstanding results. Tang et al. [61] recognize action by proposing Deep Progress Reinforcement Learning (DPRL) with a graph-based CNN. They extract the most informative frames from the input action video sequences through DPRL and employ a graph-based CNN in order to exploit the extrinsic, as well as intrinsic, human joint dependencies. Pham et al. [57] propose an SPMF (Skeleton Posture-Motion Feature) based on necessary spatiotemporal information extracted from skeleton poses and their motions in order to represent unique patterns that exist in skeletal movements. It is further enhanced by exploiting the Adaptive Histogram Equalization (AHE) method to build the action map. In the end, Deep CNNs (DCNN) based on the already-proposed DenseNet architecture are utilized for the purpose of final learning and action classification. In [19], the authors propose different Recurrent Neural Network (RNN) architectures for action recognition. Rather than use the whole skeleton, this approach divides the skeleton into five subparts (two legs, two arms, and one trunk) according to the skeleton structure in order to feed different recurrent network architectures, such as a Hierarchically  Veeriah et al. [60] added a new gating strategy in LSTM to develop a differential RNN that depends on information obtained through the changes that occur in successive frames. Ijjina et al. [56] propose a fuzzy CNN to recognize action using human 3D skeleton data. They measure the temporal variation between the skeleton joints during action sequences and recognize local patterns.

Methodology
The detailed pipeline and framework of our proposed methodology can be seen in Figure 1. We discuss all the steps involved in our proposed methodology, one by one, as follows.

Normalization
The first step of the proposed pipeline is the process of normalization. We normalize each 3D pose X with 31 joints J that exist in the motion M. In fact, we eliminate the translational, as well as the orientation, information from the 3D pose so that we can avoid ambiguities and complexities that may arise because of such information. In the case of translational normalization, we translate the 3D pose in such a way that the pose must have its center of mass (the root joint) at the coordinates (0,0,0). We basically subtract the root joint coordinates from all other joint coordinates in order to shift the pose at the position (0,0,0). For orientation normalization, we rotate the joint trajectories along the y-axis (facing upward) so that the subject becomes just the frontal view: the skeleton faces towards the x-axis, and the hip joints are aligned to the z-axis. We first estimate the angle and then rotate all the joints of a pose with this angle about the y-axis. As a result, each pose has only the information about how the motion is performed, rather than where and from what viewpoint it is executed. An example of the normalization of different poses is shown in Figure 2.

Keyframes
In this paper, we propose a keyframe extraction technique to extract the most informative frames and to remove pose redundancy. As a result, human motion is effectively compressed and summarized. Moreover, in this way, we increase efficiency in terms of time and accuracy for action recognition, as well. In fact, the extraction of keyframes may be considered an indispensable step in an online recognition system that demands short latency for a quick response.

Implementation Details
Our proposed keyframe extraction strategy is iterative in nature. In each iteration, a new suitable keyframe is selected from the remaining frames of the input motion according to the similarity measure. For example, in the first iteration, we find k nearest neighbors of the first frame of the given input motion of size n within that input motion. In order to find the nearest neighbors in the J × 3 × n-dimensional space defined by the skeleton joints, we measure the average 3D Euclidean distance. We fix the size of k = n 2 so that the size of the nearest neighbors do not exceed the size of the motion. We further purify the nearest neighbors N in hand to select suitable candidate keyframes Φ by means of a threshold; a threshold is used to control the compression ratio for the number of frames that should be reduced. We report the details about the selection of the threshold in Section 4. To this end, we have candidate keyframes Φ from which we have to extract the final keyframes Ψ. We sort these candidate keyframes and then find the median frame, which is finally considered to be the keyframe. All the candidate keyframes are discarded from the input motion so that these frames do not participate again in subsequent iterations. The complete process is repeated until there are no input frames left. The details about the algorithm are presented in Algorithm 1. inputs: M = {X 1 , X 2 , X 3 , · · · , X n }, given input sequence of frames 3: persistent: n, total number of frames k, total number of nearest neighbors end if 15: end while 16:

Deep Network
Our proposed deep neural network architecture consists of an input layer, two hidden layers (h and h ) and an output layer; all these layers are fully connected to each other. To design the deep neural network architecture, we conducted several experiments with varying numbers of hidden layers, as well as varying numbers of neuron units within each hidden layer. We empirically concluded that when we increase the number of hidden layers beyond two, the performance decreases. We report and discuss all these results in detail in Section 4.

Implementation Details
We input the joint features extracted from motion sequences to our neural network. Each node of the first hidden layer h takes real-valued numbers, computes the weighted sum, and applies a non-linear activation function (sigmoid function) in order to execute the output as where P is the total number of nodes for the first hidden layer. Similarly, at each unit of the second hidden layer h , we compute the output as where Q is the number of units for the second hidden layer. Finally, for this multi-class classification problem, we employ the softmax function in the output layer in order to yield the probability of each class at each unit of the output layer: where R is the number of units for the output layer. The softmax function basically squashes a vector into the range of 0-1, and the sum of all the resulting elements is necessarily equivalent to 1. We exploit a cross-entropy cost function with the predicted value o r and the target value t r for this multi-nominal classification problem, e.g., the one-hot encoded vector t r = [0, 0, 0, 1, 0, . . . , R] contains just a single 1 at the 4th position. This cross-entropy cost, is computed at the output layer, and the errors are back-propagated towards the hidden layers in order to update the weight vectors w, w , and w with a gradient descent algorithm. For the implementation of the gradient descent, the derivative of the error E is computed with respect to each weight w q,r that connects the hidden layer h to the output layer with the softmax function, For each unit in the output layer o indexed by r, the gradients are and the gradient with respect to w q,r becomes Similarly, the derivative of the error E with respect to each weight w p,q that connects the first hidden layer h to the second hidden layer h with the sigmoid function is For each unit in the hidden layer h indexed by q, the gradients of the loss function are and, with respect to weight w p,q , The derivative of the error E with respect to each weight w j,p that connects the input layer to the first hidden layer h with the sigmoid function is For each unit in the first hidden layer h indexed by p, the gradients are and, with respect to weight w j,p , Weight Updates: The weights w j,p that establish the connection between the input and first hidden layers are updated as and similarly, the weights w p,q and weights w q,r are updated as where η is the learning rate, which is kept equal to 0.01. We fix the maximum number of epochs to 1000, and the minimum performance gradient is kept at 1e −6 . The training process stops if the validation performance deteriorates continuously for 5 consecutive epochs.

Action Score
To this end, we define the deep network architecture, and as an input, we provide the extracted features from the keyframe sequences of the motion in the form of joint positions to the network. Finally, in the last step, similar to [51,52], we calculate the action score frame by frame of the given keyframe sequence of the motion. On the basis of the probability determined through the deep network, we assign a vote to each keyframe involved in the given action sequences. We exploit the majority function here, where the majority count of the votes ultimately leads us to the prediction of the final action class.

Experiments
We evaluated our proposed approach extensively on a pool of benchmark MoCap datasets, namely CMU [33] and HDM05 [32], both of which are publicly available. We further categorized these datasets into four different types of datasets on the basis of action categories. The details about these datasets can be found in Section 4.1. We adopted a 5-fold cross-validation method in order to evaluate the performance of our proposed pipeline for classification. We performed a number of experiments in this context on these datasets. We first tuned the parameters that we utilized in our keyframe extraction algorithm and the deep network architecture for action recognition. We then started with the evaluation of our proposed keyframe extraction algorithm, as mentioned in Algorithm 1. Finally, we thoroughly evaluated the performance of our proposed framework by comparing it with other existing approaches. [32] is a well-defined popular dataset that contains almost 2337 sequences with 130 motion classes performed by 5 different actors. The Vicon MX system with 12 high-resolution cameras was used to capture the motions at a sampling rate of 120 Hz. The ultimate 3D skeleton consists of 31 joints in total. We categorized the HDM05 dataset [32] into two groups according to the number of classes, as found in the literature, as follows.

Datasets
HDM05-65: As stated in [18], most motion classes can be combined into one distinctive major motion class; e.g., shuffle2StepsLStart, shuffle2StepsRStart, shuffle4StepsLStart, shuffle4StepsRStart belong to the motion category shuffle; thus, they are represented as one motion class, i.e., shuffle. As a result, we came up with 65 motion classes, which are the same as those described in [18,19].

CMU Dataset
Our second dataset is CMU [33], which is also considered a very popular dataset in the research community. The Vicon motion capture system, consisting of 12 infrared MX-40 cameras, recorded motions with a sampling rate of 120 Hz [33]. In the CMU dataset, the 3D skeleton also consists of 31 joints in total. We again categorized this dataset into two groups according to the number of classes as follows.
CMU-30: The dataset CMU-30 consists of 30 distinctive motion classes. It contains 278 labeled motion clips belonging to 30 different motion categories. A total of 33 different subjects participated in the recording of these motion clips, as mentioned in [37].

Parameters
We evaluated the impact of the parameters involved in our proposed framework on the overall performance of our approach. We first tuned these parameters and ultimately fixed their values for the other experiments.

Threshold
For the performance assessment of our proposed keyframe extraction technique, we first adjusted the threshold value empirically. From the experiments, we observed that as we increased the threshold value, the error decreased up to a certain point and then started increasing again. We fixed the threshold value to t = 30 for all other experiments; at this threshold, the system obtained the best results, as is quite obvious in Figure 3a. We also conducted another experiment to see the impact of the threshold value on the selection of the keyframes, i.e., how many frames can be eliminated from the motion by the selection of the threshold. The compression ratio increased with the increase in the threshold. More precisely, the number of selected keyframes was reduced when the threshold had higher values, as shown in Figure 3b.

Deep Network
We developed and compared various deep neural network configurations in order to tune the number of hidden layers, as well as the number of neurons within a hidden layer. We performed experiments with one, two and three hidden layers and with varying numbers of neurons within a hidden layer in the deep neural network architecture. The overall impact of using a different number of layers with a different number of neurons can be seen in Table 1. Although the results obtained with just one hidden layer are significant, with an accuracy of 93.53%, the highest accuracy (95.14%) was achieved by employing two hidden layers in the deep network for the process of action recognition. From the experiments, we empirically concluded that the performance in terms of accuracy decreased when more than two hidden layers were exploited in the deep neural network. Similarly, increasing or decreasing the number of neurons beyond 85 for the first hidden layer and 80 for the second hidden layer diminished the performance of the network, as well. As a result, we stopped going deeper and fixed the two hidden layers with 85 and 80 neurons, respectively, with the sigmoid activation function. For the output layer, we employed the softmax activation function in our proposed deep neural network architecture. The input provided to our deep network almost has low dimensionality (31 × 3), and the maximum action classes are just only 65 in total in case of HDM05 dataset; as a result, we obtained promising results with two hidden layers only.

Keyframes
We evaluated our proposed keyframe extraction approach by carrying out different types of experiments. We first examined how the keyframes affected the accuracy of action recognition. We performed a comparison between scenarios in which (i) we employed all the frames available in the motion category for the process of action recognition; (ii) only the keyframes extracted through our proposed Algorithm 1 were used in the process of action recognition; and (iii)-(iv) the keyframes were selected randomly with varying sizes: i.e., ξ f , the number of keyframes with a size equal to the number of keyframes extracted through Algorithm 1, and ξ f , the number of keyframes with a size that was double the number of keyframes extracted through Algorithm 1. Moreover, we adapted a 10-fold cross-validation procedure for the random selection of the keyframes. The results presented in Table 2 demonstrate that our proposed Algorithm 1 obtained the best results in terms of accuracy as compared with the other models mentioned above for different motion categories. Table 2. A comparison between scenarios in which (i) all the original frames in the specified motion are utilized in the process of action recognition, (ii) only the keyframes extracted through Algorithm 1 are employed in the process of action recognition, (iii) ξ f keyframes selected randomly are used, and (iv) ξ f keyframes selected randomly are used. Our next experiment determined how many keyframes were extracted for a variety of motion categories and how much processing time was required for the extraction of these frames. The overall results for different motion categories are shown in Figure 4. For example, for the walk motion class, the number of frames was reduced from 82 to 13; similarly, for the motion class sitDownFloor, the number of frames was reduced from 115 to 32, etc. The processing time required for the extraction of each keyframe was approximately 0.062 seconds. Further details about the processing time are available in Section 4.3.3. We also assessed our keyframe extraction approach qualitatively, and we show the results in Figure 5, where the actual frames in the motion class and the keyframes extracted through our proposed Algorithm 1 are visualized. For example, there are 31 original frames in the motion category jogOnPlace and 32 frames in the jumpingJack motion; for the two motion classes, 11 and 12 keyframes were extracted, respectively, which have most of the information about the motion, as evident in Figure 5.

Motions (i) All Frames (ii) Keyframes
From these experimental results, we conclude that extracting a few informative frames through the proposed keyframe method, rather than using all the frames of a motion sequence, is sufficient to recognize the action accurately. Moreover, our keyframe extraction approach improves the accuracy of action recognition by using a few informative frames rather than exhausting all the existing frames of the motions.

Action Recognition
We finally evaluated our proposed framework for action recognition and performed extensive experiments on the benchmark MoCap datasets CMU [33] and HDM05 [32], which were further categorized into four groups of datasets, HDM05-65, HDM05-14, CMU-30, and CMU-14, as mentioned in Section 4.1. We discuss all the results separately in detail as follows.

Evaluation on HDM05-65
In the case of the HDM05-65 dataset with 65 motion classes, we followed the same experimental protocol as that proposed in [18,19]. For this dataset, our approach achieved competitive results, with 95.14% accuracy, in comparison with other techniques. Although our approach did not outperform comparatively, it still produced competitive and very promising results. Our approach with deep neural network architecture performed better as compared with other popular classifiers, including Support Vector Machine (SVM), Multi-layer perceptron (MLP), Extreme Learning Machine [18], Convolutional Neural Network (CNN) [1], Deep Bidirectional Recurrent Neural Network (DBRNN-T) [19] with the tanh activation function, and Deep Unidirectional Recurrent Neural Network (DBRNN-T) [19] with the tanh activation function on the same HDM05-65 dataset. The Hierarchical Bidirectional Recurrent Neural Network (HBRNN-L) [19] with Long-Short Term Memory (LSTM) had the highest accuracy. Other RNN variants, such as the Hierarchical Unidirectional Recurrent Neural Network (HURNN-L) [19] with LSTM, Deep Bidirectional Recurrent Neural Network (DBRNN-L) [19] with LSTM, and Deep Unidirectional Recurrent Neural Network (DURNN-L) [19] with LSTM, also produced results with high accuracies, but there was very little marginal difference in accuracy in comparison with our proposed approach, as observed in Table 3. Moreover, our proposed deep neural network architecture is a comparatively simple architecture with less complexity. On the other hand, all the different categories of recurrent neural networks proposed in [19] consist of a large network structure with feedback connections, where the previous layer with a node influences itself to form a loop and ultimately leads towards high complexity.
The confusion matrix for HDM05-65, presented in Figure 6, shows that our proposed framework performed well on almost all types of motion. The few motion categories for which our approach did not perform significantly well are the depositLow and grabLow motions. Since both of these motion categories are very similar to each other, our approach misclassified them: e.g., the motion class grabLow was 33% misclassified as the motion depositLow, and alternatively, depositLow was 66% misclassified as the grabLow motion. As a result, the overall performance of our scheme decreased to 95.14% accuracy.   Evaluation on HDM05-14 On the dataset HDM05-14 with 14 motion classes, our proposed scheme outperformed the other existing state-of-the-art approaches [1,62], and it achieved 98.6% accuracy. The addition of keyframes with Normalized Trajectories (NT) substantially contributed to the improved performance of our proposed framework, as is obvious in Table 3. The detailed version of the results of our approach on the HDM05-14 dataset in the form of a confusion matrix is presented in Figure 7. Although some of the actions were misclassified by our approach (sitDownFloor and standUpSitChair were misclassified; staircaseUp was misclassified as steak), our approach still performed significantly well.

Evaluation on CMU-30
On the CMU-30 dataset, the results of our approach are outstanding as compared with other state-of-the-art methods (see Table 3). Although the method of two-step Support Vector Machine (SVM) fusion with Tree Structure Vector Quantized (TSVQ) [37] had the best results, our approach provided competitive results with an accuracy of 99.3% and outperformed other approaches, such as the Similarity Suffix Array, Pose-Histogram Classifier, and Two-Step Score Fusion [37]. All these results can be seen in detail in Table 4, where it is quite obvious that our approach only misclassified the bouncyWalk motion and achieved 99.3% accuracy, while the method of two-step SVM fusion with TSVQ [37] also misclassified the motion rhymeTeaPot and achieved 99.6% accuracy.
The confusion matrix in Figure 8 provides a detailed description of the results of our approach on the CMU-30 dataset. All the motion classes were correctly classified except bouncyWalk, which was misclassified as the motion boxing.   [37]. Methods C and D refer to Two-Step Score Fusion and Two-Step Support Vector Machine (SVM) Fusion, respectively [37]. Evaluation on CMU-14 On the CMU-14 MoCap dataset, we compared our approach with [44,48]. These methods use the CMU MoCap dataset, as well as their own dataset with 14 classes, recorded by utilizing a Vicon (http://www.vicon.com/) motion capture device, while we conducted experiments on only the CMU-14 MoCap dataset. Our proposed scheme outperformed both approaches in [44,48] and obtained 98.5% accuracy on the CMU-14 dataset. In this case, we employed 85% of the training data for training purposes. From the detailed analysis presented in Table 3 and in Figure 9, we observe that out of 14 motion classes, only one motion class, walkBackwardsOnToes, was misclassified as the walk motion category.

Processing Time
We conducted our experiments on a Core I7 with 8GB RAM and a Windows operating system. The proposed keyframe extraction methods took approximately 0.062 seconds to extract keyframes from a motion of 60 frames. The input query motion took roughly 0.0085 seconds per frame for action recognition. More precisely, the overall time for feature extraction, as well as action recognition, was ∼0.0095 seconds per frame.

Discussion
We evaluated our proposed method extensively on different types and sizes of MoCap datasets. From the experimental results described in Tables 3 and 4, our proposed algorithm performs comparatively well. We observe that our approach not only classifies the distinct action classes very accurately, e.g., cartwheel, walk, jumpingJack, and clap, but also correctly classifies most of the similar action classes, e.g., turnLeft vs. turnRight, jogLeftCircle vs. jogRightCircle, jogOnPlace vs. runOnPlace, walkLeftCircle vs. walkRightCircle, kickLSide vs. kickRSide and punchLFront vs. punchRFront. The few motion categories that are misclassified are immensely similar to each other: depositHighR is misclassified as grabHighR, and depositLowR is misclassified as grabLowR. In a few cases in which the action is the combination of a sequence of atomic movements that are further shared among different classes, our keyframe-based approach might mislead: e.g., sitDownTable is misclassified as standUpSitTable, and the majority of the keyframes extracted from standUpSitTable belong to the subclass SitTable, and, as a result, the sitDownTable class is predicted. Short, single actions with multiple classes, such as standUpSitTable, may create ambiguity for our proposed method.
Our proposed approach is data-driven, and it performs well enough when at least a few similar action poses of the input action are available in the MoCap dataset; the performance may deteriorate when a similar input motion is not available in the MoCap dataset. Our proposed method is fast enough with a short latency (∼0.0095 seconds per frame), which is a crucial requirement for real-time online systems. Because our approach is keyframe-dependent, it has the capacity to compensate for missing information, as well.

Conclusions
In this paper, we present a novel action recognition schema that relies on keyframes extracted from action sequences. The extracted keyframes enhance the process by providing information that is free from redundancy but carries the most relevant details about the action that exists in the motion. We conducted a variety of experiments to evaluate the keyframe extraction results, and we concluded that a few but significant information-carrying frames, rather than all frames with redundant information, are sufficient. We started with the normalization of 3D poses and extract joint features, on the basis of which we constituted our proposed deep neural network. We empirically configured the number of hidden layers, as well as the number of units within a hidden layer, by carrying out several experiments. We conclude from these experiments that, in our case, two hidden layers with 85 and 80 neurons are adequate to get substantial results. Finally, for thorough and detailed performance evaluations, we worked on four different datasets with varying numbers of motion categories, i.e., HDM05-65, HDM05-14, CMU-30, and CMU-14. On the HDM05 dataset, our proposed framework comparatively produced very competitive results, with 95.14% accuracy on HDM05-65 and 98.6% accuracy on the HDM05-14 dataset. On the CMU dataset, our proposed approach outperformed other techniques, with 98.5% accuracy on the CMU-14 dataset and 99.3% accuracy on the CMU-30 dataset. Furthermore, our approach has satisfactory efficiency in terms of time, as well. It takes roughly 0.0095 seconds per frame to recognize the action.
In future work, the proposed framework may be extended to perform action recognition and motion segmentation simultaneously. Furthermore, the proposed method for action classification can be coupled with gait analysis and person identification. Another important direction might be the integration of action recognition and pose estimation together in 3D-3D, 2D-3D, and 3D-2D scenarios. To date, we have worked only with action classes that are performed by a single person, and this approach can be extended to multiple persons interacting with each other.