Deep Learning-Based Human Action Recognition with Key-Frames Sampling Using Ranking Methods

: Nowadays, the demand for human–machine or object interaction is growing tremendously owing to its diverse applications. The massive advancement in modern technology has greatly inﬂuenced researchers to adopt deep learning models in the ﬁelds of computer vision and image-processing, particularly human action recognition. Many methods have been developed to recognize human activity, which is limited to effectiveness, efﬁciency, and use of data modalities. Very few methods have used depth sequences in which they have introduced different encoding techniques to represent an action sequence into the spatial format called dynamic image. Then, they have used a 2D convolutional neural network (CNN) or traditional machine learning algorithms for action recognition. These methods are completely dependent on the effectiveness of the spatial representation. In this article, we propose a novel ranking-based approach to select key frames and adopt a 3D-CNN model for action classiﬁcation. We directly use the raw sequence instead of generating the dynamic image. We investigate the recognition results with various levels of sampling to show the competency and robustness of the proposed system. We also examine the universality of the proposed method on three benchmark human action datasets: DHA (depth-included human action), MSR-Action3D (Microsoft Action 3D), and UTD-MHAD (University of Texas at Dallas Multimodal Human Action Dataset). The proposed method secures better performance than state-of-the-art techniques using depth sequences. novel approach to a to the classiﬁcation. Different levels of key-frames sampling were considered to evaluate the robustness of the proposed method. We also applied the proposed system on three different benchmark datasets to show the generalization power. We provided recognition accuracy, performance comparisons, and confusion charts as the experimental results. The proposed method assured better results over the state-of-the-art works.


Introduction
The rapid development of electronic devices such as smartphones, televisions, notebooks, and personal computers plays an important role in our daily life. The ways of interacting with these devices have also improved dramatically over the past years. To provide easy, smart, and comfortable ways of communication, several devices and applications have been invented ranging from wired keyboards to wireless vision-based communication.
Hand gesture recognition [12][13][14] is one of the most popular forms of vision-based interaction. It is limited to the classification of actions that are performed by only hands. Thus, it is necessary to develop a system that can understand the actions accomplished by different parts of the human body. Human action recognition focuses on the discrimination 1.
We introduce a novel ranking-based approach for human action recognition using 3D-CNN with raw depth sequence.

2.
First, we use SSIM or CCM ranking metrics to select k-ranked frames that contain more spatial and temporal changes. This allows us to discard the redundant frames having the same or very similar information. 3.
Then, we use transfer learning to perform the recognition of human action. It helps maintain the knowledge of previous hand gesture datasets and applies them to human action datasets. 4.
We also adopt three different publicly available benchmark human action recognition datasets to emphasize the robustness of the proposed method.
The remaining parts of this article are organized as follows. We illustrate the related works in Section 2 by summarizing the key ideas. The detailed methodology of the proposed method is described in Section 3. Section 4 provides experimental setups, performance evaluation, and comparisons. We discuss the drawbacks, benefits, and future works of the prior and proposed methods in Section 5. Finally, we summarize the proposed method in Section 6.

Related Works
The drawbacks of the inertial, skeleton, and RGB sequences lead to the adoption of depth datasets to effectively recognize human actions. Depth sequences are introduced in the form of one-channel data, known as grayscale. Many researchers have converted the depth sequences into several other formats such as skeleton joints [24][25][26][27], point clouds [28][29][30], and spatial-temporal images [31][32][33][34][35][36][37][38][39][40][41][42][43][44] and then extracted discriminative features for classification. Figure 1 shows three examples of depth sequence to the skeleton, point cloud, and dynamic images representation, respectively. Very few methods have conducted depth sequences directly to discriminate human activities [45][46][47][48]. The skeleton data represent the 3D coordinate values of joints in the human bo Very few methods have been published in which skeleton joint information is extrac from depth videos for action classification. Yang et al. [24] extracted joint informa from depth sequences and computed the Eigen-joints feature, which combined static p ture, motion, and offset of action. They conducted a naïve Bayes nearest neighbor ( NN) classifier for recognizing actions. Xia et al. [25] also estimated the joint informa from depth maps to discriminate human action. They determined the histogram featu from skeleton joints and used linear discriminant analysis for re-projection to perform classification. Ji et al. [26] partitioned the body parts embedded in skeleton informat called motion body partitioned. Then, they extracted local features for each part and gregated them for classification. Zhang et al. [27] also conducted a similar approac which they extracted the skeleton joint information depth sequences. They applied a temporal CNN model along with a 2D spatial CNN for depth motion image (DMI) an 3D volumetric CNN for raw depth sequences to accomplish action recognition.
Li et al. [28] used an action graph to model the dynamics of actions and collecte bag of 3D points that corresponded to the nodes of the action graph for action discrim tion. Rahmani et al. [29] proposed a point cloud-based human action recognition sys in which they introduced a descriptor and a key point detection algorithm. The descrip was calculated based on a histogram of oriented principal components. In [30], Li e extracted human behavioral features called historical point cloud track features, includ depth and skeleton information for action recognition.
Megavannan et al. [31] calculated motion information from depth difference and erage depth images by dividing the silhouette bounding box hierarchically and enco them into motion history images (MHIs). They used translation, scale, and orienta invariant Hu moments to encode the MHIs. Xia et al. [32] applied a filtering method determining the spatio-temporal interest points from the depth (DSTIP) sequence used the depth cuboid similarity feature (DCSF) to describe the local depth cuboid arou The skeleton data represent the 3D coordinate values of joints in the human body. Very few methods have been published in which skeleton joint information is extracted from depth videos for action classification. Yang et al. [24] extracted joint information from depth sequences and computed the Eigen-joints feature, which combined static posture, motion, and offset of action. They conducted a naïve Bayes nearest neighbor (NB-NN) classifier for recognizing actions. Xia et al. [25] also estimated the joint information from depth maps to discriminate human action. They determined the histogram features from skeleton joints and used linear discriminant analysis for re-projection to perform the classification. Ji et al. [26] partitioned the body parts embedded in skeleton information, called motion body partitioned. Then, they extracted local features for each part and aggregated them for classification. Zhang et al. [27] also conducted a similar approach in which they extracted the skeleton joint information depth sequences. They applied a 1D temporal CNN model along with a 2D spatial CNN for depth motion image (DMI) and a 3D volumetric CNN for raw depth sequences to accomplish action recognition.
Li et al. [28] used an action graph to model the dynamics of actions and collected a bag of 3D points that corresponded to the nodes of the action graph for action discrimination. Rahmani et al. [29] proposed a point cloud-based human action recognition system in which they introduced a descriptor and a key point detection algorithm. The descriptor was calculated based on a histogram of oriented principal components. In [30], Li et al. extracted human behavioral features called historical point cloud track features, including depth and skeleton information for action recognition.
Megavannan et al. [31] calculated motion information from depth difference and average depth images by dividing the silhouette bounding box hierarchically and encoded them into motion history images (MHIs). They used translation, scale, and orientation invariant Hu moments to encode the MHIs. Xia et al. [32] applied a filtering method for determining the spatio-temporal interest points from the depth (DSTIP) sequence and used the depth cuboid similarity feature (DCSF) to describe the local depth cuboid around DSTIP. Eum et al. [33] designed a feature map called depth histogram of oriented gradient (MHI-HOG) and modeled actions through k-means clustering. They evaluated the proposed method using HMM model. Bulbul et al. [34] described a new feature descriptor based on depth motion map (DMM), contourlet transform (CT), and HOG. They applied the CT on DMM and computed HOG features for classification. Liu et al. [35] extracted salient depth map (SDM) and binary shape map (BSM) features from depth sequences and formed a bag of map words for action classification. Bulbul et al. [36] encoded the depth sequences into three DMMs along the front, side, and top projections. From the three DMMs, they computed contourlet-based HOG (CL-HOG) and local binary pattern (LBP) features. They investigated the performance using majority voting (MV) and logarithmic opinion pool (LOGP) fusion strategies. Likewise, Chen et al. [37] projected the depth maps into three orthogonal Cartesian planes and generated three DMMs along the front, side, and top views. They used an l2-regularized classifier with a distance-weighted Tikhonov matrix for action discrimination. Jin et al. [38] partitioned a whole depth sequence into sub-sequences with uniform length, called vague division (VD), and projected them onto three orthogonal planes. They also computed DMMs by determining the difference between adjacent frames of the projected views. Azad et al. [39] sampled the entire depth sequences based on the motion energy of key-frames and produced a weighted depth motion map (WDMM). From the WDMM, they extracted HOG and LBP features to perform the classification. Li et al. [40] extended the DMMs-based human action recognition in which they extracted local ternary pattern features to filter the DMMs and conducted a CNN model for classification. Liang et al. [41] segmented a depth sequence to sub-actions called energy-guided sub-actions (EGSA) and time-guided sub-action (TGSA) to explore the sub-actions relationship for action recognition. Li et al. [42] used a multi-scale feature map called Laplacian pyramid depth motion images (LP-DMI) obtained from DMI. Finally, they conducted an extreme learning machine (ELM) for action recognition using the features extracted with HOG and visual geometry group (VGG) descriptors. Similarly, Bulbul et al. [43] split a depth sequence into two different sizes of sub-sequences and computed several DMMs. From the obtained DMMs, they extracted auto-correlation of gradient features (ACG) and classified them using an l2-regularized collaborative representation classifier (CRC). Pareek et al. [44] proposed a new algorithm called self-adaptive differential evolution with a knowledgebased control parameter-extreme learning machine. They calculated the LBP features from DMM and evaluated the recognition results using CRC, probability CRC, and kernel ELM with LBP features.
The performance of the above methods is vastly dependent on the effective representation of depth sequence into the skeleton, point cloud, and dynamic image. When the number of frames and position of the camera or human varies, the spatial and the temporal information of the same action can differ and degrade the overall performance. Moreover, it is very cumbersome and time-consuming to convert the depth sequence into the skeleton, point cloud, and dynamic image and extract discriminative features from them. Therefore, some methods have used raw depth sequence to recognize human action. Wang et al. [45] designed a view correlation discovery network (VCDN) to generate high-level information by fusing multi-modality information. Caballero et al. [46] proposed a 3D-CNN model to automatically extract spatial and temporal information from the raw depth sequence. In [47], Liu et al. introduced the view-correlation adaptation (VCA) technique to handle multi-modality datasets for action recognition. They integrated a view encoder which extracted the view representation features from the depth and RGB sequences. They evaluated the proposed method with different features representation, such as VCA-Entropy and semi-supervised feature augmentation (SeMix). Bai et al. [48] proposed a new method for human action recognition with raw depth sequences. They suggested a collaborative attention mechanism (CAM) to solve the multi-view problems. To obtain the multi-view collaborative process, they conducted an extended LSTM model.
As described earlier , the encoding of depth sequences into other formats such as skeleton, point cloud, and dynamic required additional effort and time. As a result, it is a good idea to use the raw data directly to recognize the action. The methods suggested in [12][13][14] considered raw depth sequences to hand gestures recognition. They used whole sequences to detect and classify hand gestures based on a fixed number of frames. The methods [46,48] also used raw depth sequences for activity recognition using 3D-FCNN and LSTM. Even though the methods in [45][46][47][48] performed action classification with raw depth sequences, they achieved very low performance. Therefore, we introduce a new method to select k-ranked frames which contain the almost full spatial and temporal information of action. The proposed method achieved higher recognition performance over the state-of-the-art methods. The detail of the proposed method is illustrated in the next section.

Proposed Methodology
The explanation of the proposed method including motivation, key-frames sampling, and action classification is demonstrated in this section. First, we provide the inspiration for the research on human action recognition. Then, we illustrate the main concepts by providing the overall architectural workflow of the proposed method.

Motivation
The demand for easy, smart, and effective methods of interacting with electronic devices has reached its peak owing to the massive improvements in modern technology. Vision-based methods, particularly human action recognition, have become very popular nowadays. Most of the methods described in Section 2 encoded the depth sequences into dynamic images to capture spatial and temporal changes in action. However, there can be several neighboring frames that are very similar in terms of spatial and motion information. These frames have less significance in the overall structure and texture information of action. Thus, it is important to define a method that can effectively sample more significant frames by keeping the spatial and temporal information unaltered. At the same time, most of the papers used hand-crafted features to recognize the actions by using traditional machine learning techniques such as NB-NN, MV, CRC, and ELM. By considering the above drawbacks, we suggest a novel approach to sample key-frames using the ranking method. In addition, we directly use the sampled frames to recognize the action using the 3D-CNN model.

Architecture of the Proposed Method
The overall architecture of the proposed system includes two modules, as shown in Figure 2. They are: (1) key-frames sampling using the ranking method and (2) recognition of action using the 3D-CNN model. Initially, the rank values (ψ 1 , ψ 2 , . . . , ψ n−1 ) are calculated between the neighboring frames from the raw depth sequence having n frames using rank metrics (SSIM and CCM). The rank values are then sorted in ascending order and selected k-ranked frames from n raw frames. Finally, the sampled frames are sent to a 3D-CNN model to produce class labels as outputs.

Key-Frames Sampling Using Ranking Methods
Frame sampling is a procedure in which a fixed number of frames are chosen from a sequence of frames based on a criterion. In this research, we define the frame sampling criterion as the rank metric. The reason behind the fixed number of frames sampling is that there are several redundant frames in an action sequence. The redundant frames in most cases, the adjacent frames contain similar types of information in terms of spatial and temporal changes, as shown in Figure 3. Thus, it is very important to discard the unnecessary frames which have the same motion information. Sometimes, it is crucial to determine the best set of frames having better motion information to effectively recognize actions. We use the rank metric to reorganize the frames based on the rank values and then sample k-ranked frames as the significant set (k = 16,20,24). The rank metric can be different forms based on the intended works. As the frame sampling in an action recognition task, we choose SSIM and CCM metrics. Appl. Sci. 2022, 12, x FOR PEER REVIEW 6 of 18 Figure 2. k-ranked key-frames sampling using ranking methods (the background of each depth frames was removed). The SSIM metric considers the structural properties between two adjacent frames to perform the comparisons. Let and be the two adjacent frames at times t and 1 with height and width H and W, respectively, in an action sequence. Then, the SSIM values (ψ) between and can be defined as follows: where ( , ) indicates the inter-frame variance; ( ) and ( ) indicate the intra-frame variances. and are constants.  The SSIM metric considers the structural properties between two adjacent frames to perform the comparisons. Let and be the two adjacent frames at times t and 1 with height and width H and W, respectively, in an action sequence. Then, the SSIM values (ψ) between and can be defined as follows: where ( , ) indicates the inter-frame variance; ( ) and ( ) indicate the intra-frame variances. and are constants. The SSIM metric considers the structural properties between two adjacent frames to perform the comparisons. Let I t and I t+1 be the two adjacent frames at times t and t + 1 with height and width H and W, respectively, in an action sequence. Then, the SSIM values (ψ) between I t and I t+1 can be defined as follows: where var(I t , I t+1 ) indicates the inter-frame variance; var(I t ) and var(I t+1 ) indicate the intra-frame variances. c 1 and c 2 are constants. Figure 3 illustrates how the 16-ranked frames sampling works. We have a total of 23 frames in the 'Bend' action of the DHA dataset. We compute the SSIM value between the current frame and its immediate next frame. As given in Figure 3, we can say that the neighboring frames contain almost the same information. Since the larger SSIM values indicate more similarity, we sort the SSIM values in ascending order and sample the lowest 16-frames for the experiments. From Figure 3, we can assume that the sampled 16-frames also bear the same spatial and temporal information as the original sequences, with a little distortion. Similarly, we also perform the experiments and analysis with k = 20 and 24 for showing the generality of the proposed ranking method.
To show the robustness of the proposed system, we also conduct the CCM metric to sample k-ranked frames. The CCM values (ρ) between I t and I t+1 can be calculated by using Equation (2) as follows: where I indicates mean value of I. The overall procedure is the same as the sampling of frames based on SSIM values. For sequences with the number of frames less than k, we duplicate every nth frame for making them the same length as k. The overall scenario for balancing the samples in the short sequence is depicted in Figure 4. We use the 'kick' action from the DHA dataset having 13 frames for the illustration of the balancing procedure. Since we need k-frames (k = 16) for training the deep learning, we duplicate the 4th, 8th, and 12th frames for balancing and insert them next to the neighboring frames.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 7 of 18 Figure 3 illustrates how the 16-ranked frames sampling works. We have a total of 23 frames in the 'Bend' action of the DHA dataset. We compute the SSIM value between the current frame and its immediate next frame. As given in Figure 3, we can say that the neighboring frames contain almost the same information. Since the larger SSIM values indicate more similarity, we sort the SSIM values in ascending order and sample the lowest 16-frames for the experiments. From Figure 3, we can assume that the sampled 16frames also bear the same spatial and temporal information as the original sequences, with a little distortion. Similarly, we also perform the experiments and analysis with k = 20 and 24 for showing the generality of the proposed ranking method.
To show the robustness of the proposed system, we also conduct the CCM metric to sample k-ranked frames. The CCM values ( ) between and can be calculated by using Equation (2) as follows: where indicates mean value of . The overall procedure is the same as the sampling of frames based on SSIM values. For sequences with the number of frames less than k, we duplicate every nth frame for making them the same length as k. The overall scenario for balancing the samples in the short sequence is depicted in Figure 4. We use the 'kick' action from the DHA dataset having 13 frames for the illustration of the balancing procedure. Since we need k-frames (k = 16) for training the deep learning, we duplicate the 4th, 8th, and 12th frames for balancing and insert them next to the neighboring frames.

Deep Leaning for Human Action Classification
Over the past years, we have observed significant improvement in the field of machine learning, especially deep learning [49][50][51]. Nowadays, deep learning is being widely applied in every field, including detection, recognition, classification, and segmentation. Owing to the challenges of hand-crafted features in human action datasets, deep learning has also become very successful in human action recognition. However, there are very few methods that directly deal with raw human action sequences to recognize action classes. Even though some methods have been directly applied to raw sequences using RNN and LSTM, the performance was comparably lower. Most of the methods have encoded

Deep Leaning for Human Action Classification
Over the past years, we have observed significant improvement in the field of machine learning, especially deep learning [49][50][51]. Nowadays, deep learning is being widely applied in every field, including detection, recognition, classification, and segmentation. Owing to the challenges of hand-crafted features in human action datasets, deep learning has also become very successful in human action recognition. However, there are very few methods that directly deal with raw human action sequences to recognize action classes. Even though some methods have been directly applied to raw sequences using RNN and LSTM, the performance was comparably lower. Most of the methods have encoded the action sequences into spatial formats and then integrated a machine learning or deep learning model to discriminate the actions.
Instead of representing a whole sequence in a single image for action, we consider sampling k-ranked frames as the input to the 3D-CNN model. The 3D-CNN can directly extract both spatial and temporal features from raw input frames without encoding them into another domain. We investigate the recognition performance using different deep learning models such as residual network (ResNet) [12], convolution 3D (C3D) [12], I3D [52], R2P1D [53], X3D [53], and 3D-FCNN [46]. Figure 5 depicts the average CCM and SSIM results for DHA, MSR-Action3D, and UTD-MHAD datasets. We experimentally find that a 3D ResNet with 101 layers called ResNet101 works well for the proposed method. As a result, we use ResNet101 for the backbone deep learning model for the whole experiment. A large number of convolutional layers in ResNet101 can learn high-level features and useful functions to obtain the hierarchical representation of action information. The average pooling layer helps extract discriminative features containing spatial and temporal directions.
the action sequences into spatial formats and then integrated a machin learning model to discriminate the actions.
Instead of representing a whole sequence in a single image for a sampling k-ranked frames as the input to the 3D-CNN model. The 3D extract both spatial and temporal features from raw input frames with into another domain. We investigate the recognition performance u learning models such as residual network (ResNet) [12], convolution [52], R2P1D [53], X3D [53], and 3D-FCNN [46]. Figure 5 depicts the SSIM results for DHA, MSR-Action3D, and UTD-MHAD datasets. find that a 3D ResNet with 101 layers called ResNet101 works wel method. As a result, we use ResNet101 for the backbone deep learn whole experiment. A large number of convolutional layers in ResNet level features and useful functions to obtain the hierarchical represen formation. The average pooling layer helps extract discriminative featu tial and temporal directions. The ResNet101 is one of the most effective 3D-CNN models in w mation of the previous layer is again connected to the current layer features as we consider 101 layers. A residual block in ResNet101 cons batch normalization, rectified linear unit, convolution, and batch no tions, as shown in Figure 6. The input, , is passed through a residual f generates features, , by combining the output from ( ) with the i Figure 6. A residual block in ResNet. The ResNet101 is one of the most effective 3D-CNN models in which residual information of the previous layer is again connected to the current layer. It extracts deeper features as we consider 101 layers. A residual block in ResNet101 consists of convolution, batch normalization, rectified linear unit, convolution, and batch normalization operations, as shown in Figure 6. The input, I, is passed through a residual function, F(I), and generates features, I , by combining the output from F(I) with the input. extract both spatial and temporal features from raw input frames with into another domain. We investigate the recognition performance u learning models such as residual network (ResNet) [12], convolution [52], R2P1D [53], X3D [53], and 3D-FCNN [46]. Figure 5 depicts the SSIM results for DHA, MSR-Action3D, and UTD-MHAD datasets. find that a 3D ResNet with 101 layers called ResNet101 works we method. As a result, we use ResNet101 for the backbone deep lear whole experiment. A large number of convolutional layers in ResNet level features and useful functions to obtain the hierarchical represen formation. The average pooling layer helps extract discriminative featu tial and temporal directions. The ResNet101 is one of the most effective 3D-CNN models in w mation of the previous layer is again connected to the current layer features as we consider 101 layers. A residual block in ResNet101 cons batch normalization, rectified linear unit, convolution, and batch no tions, as shown in Figure 6. The input, , is passed through a residual generates features, , by combining the output from ( ) with the As we have trained the ResNet101 model for the multi-class clas used the cross-entropy loss function , defined as follows: As we have trained the ResNet101 model for the multi-class classification, we have used the cross-entropy loss function Loss CE , defined as follows: where C is the number of classes and y i and p i indicate original and predicted labels, respectively.

Experimental Results
This section provides the experimental results by illustrating the environmental settings, performance evaluation, performance comparisons, and complexity analysis.

Datasets
To show the effectiveness and establish the robustness of the proposed system, we study three publicly available benchmark depth datasets. They are DHA [54], MSR-Action3D [28], and UTD-MHAD [55] datasets.

DHA Dataset
The DHA dataset is an extended version of the Weizmann dataset. It was first introduced by the computer vision exchange lab. They combined 10 classes of actions from the Weizmann dataset along with their own 13 classes of actions dataset to make a total of 23 classes DHA dataset. Each class has 21 sequences which are performed by 21 subjects where is the number of classes and and indicate original and predicted labels, respectively.

Experimental Results
This section provides the experimental results by illustrating the environmental settings, performance evaluation, performance comparisons, and complexity analysis.

Datasets
To show the effectiveness and establish the robustness of the proposed system, we study three publicly available benchmark depth datasets. They are DHA [54], MSR-Ac-tion3D [28], and UTD-MHAD [55] datasets.    Figure 7c shows an example of the 'BaseballSwing' action in the UTD-MHAD dataset.

Settings of the Training and Testing Dataset
We split the dataset into training and testing datasets. For the DHA, MSR-Action3D, and UTD-MHAD datasets, we follow the training and testing configuration as described in [51]. The total ratio of training and testing data in each dataset is as follows: DHA dataset (253:230), MSR-Action3D dataset (292:275), and UTD-MHAD dataset (431:430).

Environmental Setup and Evaluation Metrics
We accomplished the overall experiments in the Linux-20.04 environment. The hardware devices including CPU-Intel (R) Core (TM) i7 and GPU-GeForce GTX 1080 are used to perform the experiments. We used Python-3.8 and Matlab-202a as the programming languages. We trained the deep learning model until 100 epochs. The batch size and learning rate were set to 16 and 0.001, respectively. The learning rate drops 10% after every 20 epochs. For optimization, we conducted stochastic gradient descent (SGD) optimizer with a momentum of 0.9. We reported the recognition results by calculating the accuracy, defined as follows: Owing to the smaller number of video sequences in action recognition datasets, we have used transfer learning in which, first, we have trained the model with the jester dataset [12]. Then, the human action datasets are trained with the pre-trained weight of the ResNet101 model.

Performance Evaluations and Comparisons
We carefully analyze the dataset and choose k as 16, 20, and 24, with a difference of 4 that provides better results. We evaluate the performance of the proposed method on the DHA, MSR-Action3D, and UTD-MHAD human action datasets. Table 1 lists the recognition performance for the CCM and SSIM metrics with 16, 20, and 24-ranked frames. The average recognition results using the CCM metric are approximately 92.2% for the DHA, 93.1% for the MSR-Action3D, and 93.4% for the UTD-MHAD datasets, respectively. For the SSIM metric, the proposed method achieves approximately 92.9% for the DHA, 94.1% for the MSR-Action3D, and 94.4% for the UTD-MHAD datasets. If the value of k is increased from 16 to 20, the average performance improvement is about 0.6% for the CCM and SSIM metrics. On the other hand, if the value of k is increased from 20 to 24, the average performance improvement is about 0.1% for the CCM and 0.3% for the SSIM metrics, which are comparatively much lower. This is because the temporal information remains almost the same even when we increase the number of sampling frames along the temporal direction, such as 16, 20, and 24. However, if we compare the effectiveness of sampling metrics, the SSIM metric can sample more effectively, achieving an average accuracy of 93.8%, than the CCM metric with an average accuracy of 92.9%.
To show the effectiveness of the proposed system on different datasets, we compare the recognition results with several state-of-the-art methods, as provided in Table 2. The first, second, and third columns represent the recognition results of the prior works on the DHA, MSR-Action3D, and UTD-MHAD datasets, respectively. The proposed method can improve by approximately 10%, 5%, and 9% greater average accuracy than the prior works on the DHA, MSR-Action3D, and UTD-MHAD datasets, correspondingly. This is because most of the prior methods encoded the entire sequence into the dynamic images and used 2D-CNN or traditional machine learning techniques to classify the human action. While encoding an action into the spatial format, it cannot capture full temporal changes that reduce the overall performance. Even though few methods conducted 3D-CNN and LSTM, they did not provide better results because of the selection of frames from a whole sequence for training and testing. The network configurations also have a great effect on the performance of the human action recognition systems. The proposed method ensures better performance on the three different datasets because the proposed ranking metric can effectively select k-ranked frames that contain meaningful temporal information. The 3D-CNN can extract the discriminative features from the selected frames and provides better results. Table 2. Performance comparisons of the DHA, MSR-Action3D, and UTD-MHAD datasets with state-of-the-art methods. --87.7% VCDN [45] 79.8% --VCA [47] 80.9% --VCA-Entropy [47] 82.6% --SeMix [47] 82.7% --LSTM [48] 67.7% -- From the above results, it can be stated that the proposed method with the SSIM metric having k = 24 provides the best or similar results. We show the confusion chart for SSIM metric having k = 24 to describe the individual class results. Figure 8 depicts the confusion chart for the DHA dataset. From Figure 8, it can be said that the proposed method does work well for most of the actions except for two actions, 'LegCurl' (70%) and 'LegKick' (60%), which are performed by the leg. The proposed method misdirects by detecting 'LegCurl' as the 'LegKick', with an accuracy of 30% and vice-versa. We achieve 100% accuracy for most of actions, such as 'Bend', 'Jack', 'Jump', and 'Kick'.

DHA Dataset MSR-Action3D UTD-MHAD
Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 of 18 'LegKick' (60%), which are performed by the leg. The proposed method misdirects by detecting 'LegCurl' as the 'LegKick', with an accuracy of 30% and vice-versa. We achieve 100% accuracy for most of actions, such as 'Bend', 'Jack', 'Jump', and 'Kick'. Likewise, for the DHA dataset, we provided the confusion chart for the best results of the MSR-Action3D dataset, as shown in Figure 9. The lowest recognition result is about 78.6% for the 'DrawCross' action. This action is equally misclassified as 'ForwardPunch', 'HandCatch', and 'HorizontalArmWave' actions, with an accuracy of 7.1%. On the other hand, most of the actions are recognized correctly, with an accuracy of 100% in the MSR-Action3D dataset. Figure 10 visualizes the confusion chart for the UTD-MHAD dataset in which the minimum accuracy is observed for 'Throw', 'ArmCurl', and 'Clap' actions of approximately 68.8%, 81.3%, and 81.3%, correspondingly. The 'Throw' is recognized as 'Catch', 'DrawCircle (CLW)', 'Knock', and 'SwipeLeft' actions. On the other hand, the 'ArmCurl' action is misclassified as the 'ArmCross' action with approximately 18.8% accuracy, which is the highest misclassification rate.

Ablation Study
We summarize the performance of the proposed methods based on ranking metrics and datasets, as shown in Figure 11. For any dataset, the SSIM rank metric works much better than the CCM metric. Likewise, the recognition results are higher for the UTD-MHAD dataset than for the DHA and MSR-Action3D datasets with the SSIM and CCM ranking metrics, because the action sequences in the UTD-MHAD dataset are more accurately collected.
To show the effectiveness of the number of key-frames sampling on the recognition performance, we additionally perform experiments with 28, 32, and 36 frames using the UTD-MHAD dataset. Figure 12 depicts the recognition results for different rank values ranging from 16 to 36 with a difference of 4 frames. The recognition accuracy is 94.2% for 16-ranked frames, which improves slightly by about 0.2% while increasing the number of key-frames to the next level 20. From 20 to 36, the classification results remain almost the  Likewise, for the DHA dataset, we provided the confusion chart for the best results of the MSR-Action3D dataset, as shown in Figure 9. The lowest recognition result is about 78.6% for the 'DrawCross' action. This action is equally misclassified as 'Forward-Punch', 'HandCatch', and 'HorizontalArmWave' actions, with an accuracy of 7.1%. On the other hand, most of the actions are recognized correctly, with an accuracy of 100% in the MSR-Action3D dataset. Figure 10 visualizes the confusion chart for the UTD-MHAD dataset in which the minimum accuracy is observed for 'Throw', 'ArmCurl', and 'Clap' actions of approximately 68.8%, 81.3%, and 81.3%, correspondingly. The 'Throw' is recognized as 'Catch', 'DrawCircle (CLW)', 'Knock', and 'SwipeLeft' actions. On the other hand, the 'ArmCurl' action is misclassified as the 'ArmCross' action with approximately 18.8% accuracy, which is the highest misclassification rate.

Ablation Study
We summarize the performance of the proposed methods based on ranking metrics and datasets, as shown in Figure 11. For any dataset, the SSIM rank metric works much better than the CCM metric. Likewise, the recognition results are higher for the UTD-MHAD dataset than for the DHA and MSR-Action3D datasets with the SSIM and CCM ranking metrics, because the action sequences in the UTD-MHAD dataset are more accurately collected. same, or a slight change happened. This is because only a certain segment in a sequence contains action information, and other frames are static that do not have any effect on the overall recognition performance.    Figure 9. Confusion chart of the MSR-Action3D dataset (best results with SSIM, k = 24).
Appl. Sci. 2022, 12, x FOR PEER REVIEW 13 of 18 same, or a slight change happened. This is because only a certain segment in a sequence contains action information, and other frames are static that do not have any effect on the overall recognition performance.

Complexity Analysis
We determine the network complexity in terms of parameters, floating-point opera tions (FLOPs), and testing time, as given in Table 3. We only consider the UTD-MHAD dataset with 16, 20, and 24 frames to report the results. The parameters, FLOPs, and testing time are given in million (M), giga (G), and second (s). The testing time for action recog nition is calculated by averaging over all testing sequences. The total number of parame ters in ResNet101 for UTD-MHAD is 47.58 M, which changes for DHA and MSR-Action3D datasets as 47.57 M and 47.56 M, respectively, as the number of classes varies. We compare the time complexity of the proposed system with some of the state-of-the-art methods The prior works take too much time to recognize an action using the DHA, MSR-Action3D and UTD-MHAD datasets, since the ecoding of action requires a greater amount of time than directly extracting discriminative features from raw depth sequences using 3D-CNN in the proposed method. We do not need to spend time in encoding an action that reduce the time complexity significantly. MSR-Action3D --0.8 Figure 11. Summarizations of recognition results for ranking metrics and three datasets.
To show the effectiveness of the number of key-frames sampling on the recognition performance, we additionally perform experiments with 28, 32, and 36 frames using the UTD-MHAD dataset. Figure 12 depicts the recognition results for different rank values ranging from 16 to 36 with a difference of 4 frames. The recognition accuracy is 94.2% for 16-ranked frames, which improves slightly by about 0.2% while increasing the number of key-frames to the next level 20. From 20 to 36, the classification results remain almost the same, or a slight change happened. This is because only a certain segment in a sequence contains action information, and other frames are static that do not have any effect on the overall recognition performance.

Complexity Analysis
We determine the network complexity in terms of parameters, floating-point operations (FLOPs), and testing time, as given in Table 3. We only consider the UTD-MHAD dataset with 16, 20, and 24 frames to report the results. The parameters, FLOPs, and testing time are given in million (M), giga (G), and second (s). The testing time for action recognition is calculated by averaging over all testing sequences. The total number of parameters in ResNet101 for UTD-MHAD is 47.58 M, which changes for DHA and MSR-Action3D datasets as 47.57 M and 47.56 M, respectively, as the number of classes varies. We compare the time complexity of the proposed system with some of the state-of-the-art methods. The prior works take too much time to recognize an action using the DHA, MSR-Action3D, and UTD-MHAD datasets, since the ecoding of action requires a greater amount of time than directly extracting discriminative features from raw depth sequences using 3D-CNN in the proposed method. We do not need to spend time in encoding an action that reduces the time complexity significantly.

Complexity Analysis
We determine the network complexity in terms of parameters, floating-point operations (FLOPs), and testing time, as given in Table 3. We only consider the UTD-MHAD dataset with 16, 20, and 24 frames to report the results. The parameters, FLOPs, and testing time are given in million (M), giga (G), and second (s). The testing time for action recognition is calculated by averaging over all testing sequences. The total number of parameters in ResNet101 for UTD-MHAD is 47.58 M, which changes for DHA and MSR-Action3D datasets as 47.57 M and 47.56 M, respectively, as the number of classes varies. We compare the time complexity of the proposed system with some of the state-of-the-art methods. The prior works take too much time to recognize an action using the DHA, MSR-Action3D, and UTD-MHAD datasets, since the ecoding of action requires a greater amount of time than directly extracting discriminative features from raw depth sequences using 3D-CNN in the proposed method. We do not need to spend time in encoding an action that reduces the time complexity significantly.

Discussion
The widespread applications of modern technology such as human-machine or human-object interaction are significantly increasing. In this article, we propose a new approach for human action recognition using 3D-CNN with the raw depth dataset. Several methods have introduced human action recognition using depth sequences, in which they tried to encode the entire sequence into a spatial format called dynamic image. Some methods also focused on the extraction of the skeleton and point cloud information from depth video. These methods are completely dependent on the effective encoding, skeleton, or point cloud extraction to correctly recognize the human action. Generally, it is very difficult or sometimes impossible to maintain all the temporal information during the encoding process. Very few methods used raw depth sequence to avoid human interference to generate dynamic image, skeleton, or point cloud information. Even though they used raw depth sequence, the recognition performance was much lower.
We analyze the dataset and find that each sequence has a different number of frames. When a sequence contains a large number of frames, most of the frames remain the same, which means there are no spatial and temporal changes. Only a few numbers of frames include action information that changes in the spatial and temporal direction. As a result, we propose a novel ranking-based approach for human action recognition using 3D-CNN with raw depth sequences. We use SSIM and CCM ranking metrics to rank the whole sequence and select k-ranked frames that contain more spatial and temporal information. Then, we train a 3D-CNN model with the k-ranked frames to recognize specific actions. We do not need to encode the sequence and extract dynamic images for further classification. We consider different levels of ranking and investigate the recognition performance for better understanding. After a certain level of k, the recognition performance does not improve further because we must copy the same frame multiple times in most of the sequences.

Conclusions
The extraction of discriminative features from the depth sequence is cumbersome work in the case of action recognition. Thus, it is always demanding to build a system that can directly process the raw sequences and extract the discriminative features for classification. There are several methods suggested for hand gesture recognition with raw depth sequences directly using the 3D-CNN models. However, there are very few methods that have considered raw depth videos to discriminate human activity using RNN and LSTM. Even though they consider raw depth videos, they still have many limitations such as complexity, effectiveness, efficiency, and robustness. In this paper, we derived a novel approach to sample key-frames and passed through a 3D-CNN model to perform the classification. Different levels of key-frames sampling were considered to evaluate the robustness of the proposed method. We also applied the proposed system on three different benchmark datasets to show the generalization power. We provided recognition accuracy, performance comparisons, and confusion charts as the experimental results. The proposed method assured better results over the state-of-the-art works.