Next Article in Journal
Analysis of Machine Learning Techniques for Information Classification in Mobile Applications
Previous Article in Journal
A Novel Active Noise Control Method Based on Variational Mode Decomposition and Gradient Boosting Decision Tree
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Employing FGP-3D, a Fully Gated and Anchored Methodology, to Identify Skeleton-Based Action Recognition

1
College of Computer Sciences & Information Technology, King Faisal University, Al Hofuf 31982, Saudi Arabia
2
State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei 230026, China
3
Department of Computer Science, Air University, Islamabad 44000, Pakistan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(9), 5437; https://doi.org/10.3390/app13095437
Submission received: 13 March 2023 / Revised: 11 April 2023 / Accepted: 13 April 2023 / Published: 27 April 2023

Abstract

:
Recent years have seen an explosion in interest in and development of action recognition based on skeletal data. Contemporary methods using fully gated units can successfully extract characteristics from human skeletons by relying on the human topology that has been predefined. Despite advancements, fully gated unit-based techniques have trouble generalizing to other domains, particularly when dealing with various human topological structures. In this context, we introduce FGP-3D, a novel skeleton-based action recognition technique that can generalize across datasets while being effective at learning spatiotemporal features from human skeleton sequences. This is accomplished via a multi-head attention technique to learn an ideal dependence feature matrix from the uniform distribution. We next re-evaluate state-of-the-art techniques as well as the suggested novel descriptor FGP-3D in order to examine the cross-domain generalizability of skeleton-based action recognition in real-world video skeleton statistics. After being applied to commonly used action categorization datasets, experimental results demonstrate that the proposed FGP-3D, with pre-training, generalizes well and outperforms the state-of-the-art.

1. Introduction

Vision-based human action recognition is a subject of intense study because of the decade’s rewarding progress in the fields of artificial intelligence and computer vision. Knowing the amount of uniquely human behavior in each frame is a major highlight, and the data gleaned is a boon to detecting dangerous or falling actions [1]. It’s useful for (a) variety of things, including (but not limited to) (b) ambient assisted living [2], (c) medical activities [3], and many more [4]. Due to its value and adaptability in the field of health care, human pose estimation in particular has grown increasingly popular because of the availability of modern activity recognition sensors. Many studies have shown that accurate posture labeling in clinical settings may support research and improve patient medical cycle monitoring. Diverse frames of reference, both local and global, based on human motion are needed to distinguish between human acts. Small but significant variations in human motion may cause recognition errors at a number of different sites. As a result, recognition performance might suffer when comparable characteristics are shared across many classifiers.
Due to their exclusive reliance on the positions of human pivotal joints, skeleton-based approaches to human action recognition are preferably able to ignore irrelevant information and zero in on the action itself, regardless of distracting factors like, for example, background clutter or fluctuating lighting [5]. For both spatial and temporal inferences, a new method was proposed in [6], which represents human joints and their natural connections (i.e., bones) as spatio-temporal graphs. As a result, a number of successors have been proposed and shown promising results, all of which make use of the spatial and temporal information contained within joints with optimized anchored point construction strategies to extract multi-scale structural features on skeletons and long-range temporal dependencies among them. While spatial and temporal inferences from the joint have achieved preferable results, they are restricted in comparison to RGB frames-based approaches like spatio-temporal convolutional neural networks [7], which are pre-trained to improve accuracy in downstream datasets and tasks [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39]. Based on our investigation of the datasets depicted in system diagram, we conclude that the necessity for distinct adaptive adjacency matrices when using different topological human structures (e.g., joint number, joint order, bones) limits the generalization capacities of these techniques. However, we note that such adaptive sparse adjacency matrices are transformed into fully dense matrices in deeper layers in order to capture long-range dependencies between joints. This new structure contradicts the initial and original topological skeleton structure.
We believe that 3D skeletal joint action detection may be replaced by a more optimal and general initialization technique in consideration of these factors and the fact that the human body’s skeletal joint action representation is profoundly transformed throughout training. FGP-3D is a new unified framework for action recognition based on skeletons, and it helps to prove the premise. Each joint in FGP-3D indicates the dependency weight between its associated pair of joints because, to the anchored point, the action features matrix is initialized to be a uniformly distributed dependency detector. After that, a fully gated, anchored approach to aggregation is carried out to learn and aggregate many joint dependency matrices using various action descriptors. This mechanism effectively learns the spatio-temporal properties of skeletons by combining information from several representation sub-spaces located at various points along the dependence action features. Since the suggested FGP-3D does not only depend on skeletal structure but also facilitates more precise and accurate action data collection. Thus, by applying our suggested model to a difficult action dataset, we have a significant amount of flexibility to optimize its recognition performance. When applied to and fine-tuned on all assessed datasets [40,41,42,43], our experimental research shows that our suggested FGP-3D action descriptor enhances state-of-the-art skeleton-based action recognition algorithms. In conclusion, this study makes the following contributions:
1.
By recommending FGP-3D with a novel design approach that employs a fully gated unit mechanism and an anchored point for skeleton-based action recognition, we go beyond conventional designs.
2.
We explore skeleton-based real-world action recognition with a concentration on temporal and spatial domain learning. While we used anchored points and fully gated unit features to derive spatial information, we also used fully popovicius features.
3.
We show that the suggested FGP-3D is tested on demanding datasets and exhibits a consistent level of accuracy, making it a useful and generic methodology for skeleton-based action classification.
We put the recommended strategy to the test across four different datasets and found that it greatly improved in challenging situations while also outperforming state-of-the-art methods. Our proposed algorithms, FGP-3D, perform better than the state-of-the-art methods developed by various researchers. Since the proposed method is a drop-in component, we have also included it in our skeleton-based action recognition system.

2. Related Work

Skeleton-based human action recognition has lately gained popularity due to its compactness and resistance to fluctuations in appearance. Skeletal-joint-based methods are frequently used in the field of action recognition and seem to be an appropriate approach for comprehending spatio-temporal features in videos. Minimal body-joint elements mean that joint-based representation requires numerous videos to learn correctly. In the case of the state-of-the-art approach for human action recognition proposed in the literature, skeleton joints are used. Due to their great representational capacity, early skeleton-based methods [5,6,7,8,9,10,11,12,13,14,15,16,17], like recurrent neural networks or temporal convolutional networks [9], were developed. These methods, however, disregard the human body’s spatial and semantic interconnectedness. The skeleton was then proposed to be mapped as a pseudo-image using a 2D grid structure to represent the spatial-temporal features based on manually created transformation rules and to use 2D CNNs to process the spatio-temporal local dependencies within the skeleton sequence by taking into consideration partial human-intrinsic connectivity [10,11]. For skeleton-based action recognition, ST-GCN [6] combined spatial graph convolutions with interleaving temporal convolutions. Although the human skeleton’s architecture was taken into account, the significant long-range relationships between the joints were not taken into account. Decoupled spatial-temporal attention network for skeleton-based action recognition is proposed in [12]. Contrarily, current AGCN-based techniques [13,14,15] have significantly improved in terms of performance thanks to the benefit of better processing long-range dependencies for action recognition. In particular, 2s-AGCN [16] presented an adaptive graph convolutional network to adaptively learn the graph’s topology with self-attention, which proved beneficial in action recognition. As an extension, MS-AAGCN [17] included a four-stream ensemble based on 2s-AGCN [16] and multi-stream adaptive graph convolutional networks that utilized attention modules. These methods focus mainly on spatial modeling. A unified method for recording complicated joint correlations directly across space and time was subsequently given by MS-G3D Net [15]. To avoid transfer learning, the scale of the temporal segments should be precisely adjusted for each dataset as it affects accuracy. Due to this, earlier methods [15,16,17] learned adaptive adjacency matrices from the initialized human topology, which is not ideal. Due to their great representational capacity, early skeleton-based methods like recurrent neural networks (RNNs) or temporal convolutional networks (TCNs) were developed. These methods, however, disregard the human body’s spatial and semantic interconnectedness. Instead of being constrained by human topology and having a finite number of attention maps, our work proposes an optimized and unified dependency matrix that can be learned from the uniform distribution by a multi-head attention process for skeleton-based action recognition in order to enhance performance as well as generalization capacity.
Instead of taking advantage of fine-tuning action recognition models, previous techniques [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39] were only tested on target datasets. Recent research [31,32,33,38,39] proposed view-invariant using 2D or 3D incorporating algorithms with joints performed on our evaluated datasets [25,31,32,33,34,35,38,39] that do not correspond to legit, and thus these techniques struggle to improve the action recognition performance on regression tasks with practical and mostly used databases [40,41,42,43]. These algorithms were developed to investigate the transferability of action recognition using the human skeleton. To the best of our knowledge, we are the first to investigate completely anchored skeleton-based action recognition using authentic and challenging datasets. To our knowledge, skeleton-based action recognition rarely addresses topologies based on completely gated and anchored joints, and our study is the first to define dynamic joint-wise topologies. Our method belongs to dynamic methods because fully anchored features are inferred during inference.

3. Proposed Approach

The musculoskeletal system, which includes the bones, muscles, ligaments, and joints, all work together under the direction of the neurological system to allow for human movement [44]. The goal of action identification based on 3D skeleton joints is to detect human activities using data collected from a variety of sensors, the most common of which are RGB pictures, depth maps, and the coordinates of the points that make up the human skeleton. Our proposed system will be supported by our skeletal system. Musculoskeletal is another name for the skeletal system [44]. Real-time and practical appearance motion information cannot be calculated using color and depth data since they include a minuscule bit of superfluous information; we have used skeleton data in our study for this publication. Since information about skeletal joints is now readily available, the key idea is that we may use motion capture data to encode complicated human movements by tracking how those joints move from frame to frame. All of the body’s joints are seen as potential kinematic sensors in our suggested system (for example, acceleration sensors or position sensors). The premise is that gathering time-series data from a variety of sensors attached to various parts of the human body (such as the torso, head, leg, neck, elbow, shoulder, etc.) may aid in the identification of complicated human actions. When considering joints to extract important points to extract robust characteristics in the post-processing stage, data from the body skeleton may serve the same function.

3.1. Anchored Point

The human body needs mobility and stability in order to carry out any task. Mobility is the freedom and ease of movement. Stability is the body’s capacity to keep its joints and muscles in proper alignment while in motion. People can be found, and their joint skeleton actions can be pinpointed all at once using anchored point detection. In this context, an anchored point is synonymous with an interest point. They are regions in space or in the body’s joints that determine what stands out or is of interest while carrying out the action. They remain unchanged when subjected to transformations like stretching, bending, translating, twisting, etc. Using the locations of 3D body anchors, we estimate a human body mesh and show it as a 3D anchor. To this purpose, we provide an analytical method to articulate a parametric body model, FGP-3D, by means of a series of easy geometric transformations, and we learn to predict the 3D locations of the remaining body joints and body keypoints. When compared to state-of-the-art methods, our method provides much better alignment to integrate more crucial body joints into the content, which is especially important given that keypoint estimation depends directly on arms and legs. Our suggested method uses simple 3D-anchored point regression to provide state-of-the-art mesh fits without the need for paired mesh annotations.
Using an anchored point attention module and temporal feature extraction, the authors of this research offer an action detection system that is human skeleton-based. It involves trying to put a value on what are essentially anchored points on the human body. In this article, we will take a look at one such method designed to locate anchored points in pictures of human skeletons. To accommodate temporal detection, a mean point will be derived from the first frame of the clip, with the anchored point defined as the midpoint between the opposing legs and arms. Figure 1 depicts the hierarchy of human joints, with the anchored point shown in red. The definition of a perpendicular bisector is a line that splits another line segment in half along its midpoint. Understanding the relationship between perpendicular bisectors is essential for construction. To appreciate the significance of the fact that a perpendicular bisector of a segment (maybe the side of a polygon or the side of an angle) bisects the segment at its midway point, creating two congruent, smaller segments, is crucial. In order to get the angle between each pair of joints, we will use the formula for finding the perpendicular bisector of a line segment. The time data will be computed in the first frame.
The point of intersection of two lines in 3D (vector form), generated by employing the left and right acromioclavicular joints and the intertarsal joint (i) denoted by a r , a l , t r , and t l , is the content for determining the location of the anchored point [44]. In humans, there are three dimensions involved in joint positioning, and the third (k) dimension is perpendicular to the more common two (i, j). The i, j, and k are used to create spatial models with length, breadth, and depth in three dimensions.
γ 1 = ( a 1 l i ^ + a 2 l j ^   +   a 3 l k ^ )   + α   ( t 1 r i ^ + t 2 r j ^ + t 3 r k ^ ) ,
γ 2 = ( a 1 r i ^ + a 2 r j ^ + a 3 r k ^ )   +   β ( t 1 l i ^ + t 2 l j ^ + t 3 l k ^ ) ,
If all the lines (i) and (ii) intersect, then they have a common point.
( a 1 l i ^ + a 2 l j ^ + a 3 l k ^ )   +   α   ( t 1 r i ^ + t 2 r j ^ + t 3 r k ^ )   =   ( a 1 r i ^ + a 2 r j ^ + a 3 r k ^ )   +   β ( t 1 l i ^ + t 2 l j ^ + t 3 l k ^ ) ,
a 1 l + α t 1 r = a 1 r + β t 1 i ,
a 2 l + α t 2 r = a 2 r + β t 2 i ,
a 3 l + α t 3 r = a 3 r + β t 3 i
Determine where these four lines meet by calculating their intersection. We have used the values of and found in solving α and β , by using Equations (4)–(6). The Equations (1) and (2) meet if and only if the third equation is true for all values of and, at which point all four joints will have intersected and formed an anchored point. By substituting a value for α (or β ) in (1), we have determined the magnitude of the position vector (anchored point) created at the intersection of γ 1 and γ 2 . The calculation of precise and accurate information about human action recognition requires temporal information about the operations of the target joints. By using joint sequences to create motion information across several time steps, motion rendering may include and make sense of temporal information. A joint’s temporal dynamics, such as wave propagation, may be efficiently investigated with the use of such sequences. This is done by using the information included in the first frame to verify the orientations and originations of the remaining joints. We have created an algorithm that not only cuts down on unnecessary data but also greatly speeds up the process of incorporating the time needed to produce the flow of joints.

3.2. Fully Gated Unit Features

In order to estimate a human body mesh using the locations of 3D body skeleton joint points, we introduce an action recognition system based on 3D joint points. In order to do this, we study how to make educated assumptions about the 3D locations of all the body’s joints, and we suggest an analytical approach to articulating a parametric body model, FGP-3D, using a series of simple geometric transformations. Our method outperforms the state-of-the-art methods in terms of alignment to video clips and the extraction of motion data since it depends directly on the coordination of all joints’ locations. Our proposed approach utilizes gated recurrent units to pair mesh annotations and is able to achieve state-of-the-art mesh fittings through 3D skeletal joint regression alone.
For machine learning applications that need memory and grouping, such as voice recognition, a gated recurrent unit is a component of a specialized type of recurrent neural network. In order to address the vanishing gradient problem, which often arises in recurrent neural networks, gated recurrent units may be used to modify the input weights of the network. In recurrent neural networks, the Gated Recurrent Unit is an effective solution for addressing the vanishing gradient issue [45]. In machine learning, the vanishing gradient issue arises when the gradient is so tiny that the weight cannot move. When depicted graphically, the operation of a gated recurrent unit network is not dissimilar to that of a basic recurrent neural network; the main difference between the two is in the internal workings of each recurrent unit, as gated recurrent unit networks are comprised of gates that modulate the current input and the previous hidden state. In addition to speech recognition, neural network models using gated recurrent units may be used for research on the human genome, handwriting analysis, and much more. Some of these innovative networks are used in stock market analysis and government work. Many of them leverage the simulated ability of machines to remember information. The update gate helps the model determine how much of the past information (from previous time steps) needs to be passed along to the future. That is really powerful because the model can decide to copy all the information from the past and eliminate the risk of vanishing gradient problems. We shall use the update gate in the next round since each component of p th joint is required for the formation of z ( i , j , k ) .
W ( z , r , h ) = e 2 ( i p , j p , k p ) + 1 e 2 ( i p , j p , k p ) 1 ,
U ( z , r , h ) = e 2 ( i p , j p , k p ) e 2 ( i p , j p , k p ) + 1 ,
b ( z , r , h ) = e 2 ( i p , j p , k p ) e 2 ( i p , j p , k p ) 1 ,
r t =   i p 2 + j p 2 + ( k p 2 ) ,
Since each component (z, r, h) of W, U, and b is calculated using Equations (7)–(9), the components of each cartesian coordinate are used to calculate each component of (z, r, h), i.e., (i, j, k). As a helpful factor to extract Equations (11) and (12), W, U, and b are used. Due to the fact that these equations and Equation (13) are the crucial elements for calculating Equation (14), which is an important consideration for calculating Equation (18), which includes the fully gated unit features.
z ( i , j , k ) = γ 1 W z ( i p , l p , k p ) + U z h ( i , j , k ) 1 + b z ,
r ( i , j , k ) = γ 1 W r ( i p , l p , k p ) + U r h ( i , j , k ) 1 + b r ,
h ( i , j , k ) ^ = γ 2 W h ( i p , l p , k p ) + U h z ( i , j , k ) ʘ h ( i , j , k ) 1 + b h ,
h ( i , j , k ) = z i , j , k ʘ h ( i , j , k ) ^ 1 z ( i , j , k ) ʘ   h ( i , j , k ) 1 .
Equation (14) incorporates fully gated unit features, which are represented for each coordinate by the notation h ( i , j , k ) . . The term “effectively used of fully gated unit to extract motion details” refers to a condensed version of the full gated unit that uses gating performed using the prior hidden state and the bias in various combinations. We are calculating three different pieces of information by using variables:
  • i p , l p , k p : Skeletal joints’ three dimensions are located in the third-dimension (3D).
  • h ( i , j , k ) : Output motion vector generated from a fully gated unit (3D).
  • h ( i , j , k ) ^ : Information about the anchored point is included in the 3D activation vector, i.e., h i ^ , h j ^ , and h k ^ .
  • z ( i , j , k ) : 3D update gate vector, containing γ 1 details, i.e., z i , , z j , , and z k .
  • r t : In order to accurately capture motion details, the reset gate vector contains the mutual information shared by all of the 3D information of skeletal joint spatial information.
  • W ( z , r , h ) : Parameter vectors will be calculated by using any particular joint ( i p , j p , k p ) , located in the 3D domain. The W z contain abscissa information for i p , W r contain abscissa information for l p , and W h contain abscissa information for z p .
  • U ( z , r , h ) : Any particular joint ( i p , j p , k p ) situated in the 3D domain will be used to construct parameter vectors.
  • b ( z , r , h ) : By employing any specific joint ( i p , j p , k p ) , located in the 3D domain, parameter vectors will be computed.

3.3. Fully Popovicius Features

We introduce an approach for joint motion based on anchored correspondences between 3D skeletal joint points. Anchored point correspondence proposes a change between other joints, and anchored represents automatically determined distinctive joint positions. We use these correspondences to fully Popovicius feature label maps of entire joints from the frames. We present the results of displacement segmentation for abdominal body skeletal joints in the full-body contrast-enhanced fully Popovicius feature and whole-body spatial information extraction. In comparison to conventional multi-dimensional segmentation, our method achieves a speedup of nearly three orders of magnitude while maintaining favorable accuracy. Three steps make up the fully Popovicius feature detection algorithm: (i) anchored point detection; (ii) any specific joint’s orientation in 3D space; and (iii) anchored point-based probabilistic displacement between any specific joint and an anchored point. In convex analysis, Popoviciu’s inequality is a convex function inequality. It is comparable to Jensen’s inequality and was found in 1965 by Romanian mathematician Tiberiu Popoviciu [46]. Let f represent a function from the range I ⊆ R to R. For any three points i, j, and k in I, f is convex (Equation (15)).
f i + f j + f k 3 +   f i + j + k 3     2 3   f i + j 2 + f j + k 2 + f ( k + i 2 ) ,
If a function f is continuous, it is convex if and only if the above inequality holds for all x, y, and z from I. When f is strictly convex, the inequality is strict except for i = j = k. We are interested in the extraction of action features by using graph-based approaches, so we have modified this equation in such a way that features can be extracted. As we are interested in obtaining feature vectors.
f ( i , j , k )   =   f   i + f j + f k 3   +   f   i + j + k 3 ,
We used Popoviciu’s inequality equation for action feature detection ( f p f ): Joints are a collection of non-homozygous places that define border points between the anchored point and any particular joint. We will calculate the joints spatial information between an anchored joint i a , j a , k a and any individual p t h   ( i p , j p , k p ) joint. Equation (16), is modified in such a way:
f p f = f i a i p + f j a j p + f k a k p 3   +   f i p + j p + k p 3 ,

3.4. Fully Gated and Anchored Approach, FGP-3D

In this part, we provide FGP-3D, a unifying solution for skeleton-based action recognition with completely gated, anchored spatiotemporal dependencies. Figure 2 depicts the finished model’s architecture. The connectivity between joints in the original human topology does not include this connectivity, which means that for many human movements, it is not the most appropriate to be utilized to extract features on skeletons. We introduce anchored pint, which can express the ideal dependencies between joints for efficient feature extraction, to sort out this issue. A stack of anchored point extraction, fully gated unit features, and fully popovicius features, spatial-temporal FGP-3D blocks to extract characteristics from skeletal sequences, are present. By utilizing several dependency matrices and our suggested finalization technique, we may automatically collect the features focused on various body joints by repeating the fully gated anchored mechanism. We may utilize the number of sequences (s) from 1 to n as a flexible hyper-parameter to enhance the model as it is no longer constrained by the human topology. We used Equation (14) to solve it for each coordinate component, h i , h j , and   h k , determined at each frame. We then used Equation (17) to use f p f , which contains the dependent information between the anchored joint ( i a , j a , k a ) and any specific p t h . We suggested the FPG-3D skeleton’s joints-based action recognition descriptor as a blend of independent as well as dependent joints. Our theory was supported by the ablation investigation (see Equation (18)). Overall, our design approach increases the architecture’s adaptability, efficiency, and genericity, which makes it easier to research cross-domain spatial and temporal feature learning in this field for datasets utilizing various joint distributions.
FGP - 3 D   =   h i h j h k f p f s = n s = 1
As we have computed the FPG-3D action descriptor in the form of a feature vector including the dependent and independent motion of all joints, we can say that it includes both the dependent and independent motion. Using Equation (14), the independent motion is determined, with each component capturing gradually finer motion details. While Equation (17) extracts the primary direction of a joint with regard to anchored joints. This spatial convolution relies on a joint representation during the early learning stage, but during the later learning stage, the relationships coded within the other features contain sparse and the joint connections required to represent a complete joint orientation, which is equivalent to an anchored point. The dependencies converge to a sparse representation in the subsequent stage of creating action descriptors with fully gated unit features, which is locally optimal but fundamentally different from the original Euclidean connectivity of the human body because it calculates the orientation of mutual and independent joint vectors. The dependencies eventually converge to a temporal representation yet again, containing orientation based on fully Popovicius features. This motivates us to modify the “FGP-3D” in this study, which contains all aspects and domains of spatial and temporal features and is prospectively initialized with a joint uniform distribution (Equation (18)), in order to more effectively arrive at the globally optimum representation for action detection.

4. Evaluation of FGP-3D

This section includes a valuation description of the proposed FGP-3D action selection framework. In the previous qualitative analysis section, we discussed the ablation experiments that led to the network’s specific design, provided justification for each component, and assessed the effectiveness of those components. We use three-dimensional skeletal joints based on a single frame and global selection to examine the behavior of each frame selector component separately. On several datasets, we demonstrate the suggested method’s generalizability. Finally, we compare the proposed method to existing cutting-edge frame sampling techniques in an untrimmed environment, demonstrating that it still yields better precision. We contrast our comprehensive model with the state-of-the-art in Table 1, Table 2, Table 3 and Table 4 (Figure 2). Table 1 has the values UTD–MHAD [18,19,20,21,22,23,24]. In Table 2, cutting-edge methods [25,26,27,28,29] and state-of-the-art approaches [28,30,31,32,33] are contrasted in Table 3. The state-of-the-art approaches [34,35,36,37,38,39] are contrasted in Table 4. Our technique surpasses all other methods in every evaluation scenario on all four challenge datasets. Notably, our technique is the first to use a joint multi-aspect model architecture to extract complicated regional spatial-temporal correlations and long-range spatial and temporal dependencies from three-dimensional skeleton sequences. The outcomes support the efficacy of our method.
Table 1 shows that using the FGP-3D network architectures improved accuracy using the UTD-MHAD dataset [40]. Testing and training on the dataset effectively increased the overall classification accuracy from 99.2% to 99.8% when using FGP-3D as the backbone network. This shows the value of pre-training on a sizable dataset for joint recognition. Performance is improved by 0.6% because of the local, global, spatial, and temporal combination classifier that mixes the representations from the two methods. The findings demonstrate that FGP-3D has a better capacity for learning about spatiotemporal representations for action recognition.
Our tests are conducted on the challenging action recognition dataset MSR [41] in order to compare our results against FGP-3D. To confirm the usefulness of the suggested model components based on their recognition performance, we conduct thorough ablation studies on them. The completed model is then assessed using cutting-edge techniques for the skeleton-based action recognition challenge. Last but not least, our suggested strategy combines the four main skeletal motion aspects to attain better accuracy in the dataset. Table 2 displays a comparison of joint accuracy using the dataset.
Numerous tests are run on the action identification dataset [42]. For datasets and details on how each experiment was implemented, it is composed of eighteen challenging datasets. First, we conduct thorough ablation research on KARD to evaluate the efficacy of the dependence feature matrix we have suggested and the fully gated attention. On the dataset (as shown in Table 3), we compare our model to state-of-the-art models [28,30,33], before analyzing how much efficiency gain there is on target datasets. We show that pre-training has a major impact on how effectively our model generalizes. On all datasets, the final proposed FGP-3D models are assessed in order to be compared to other cutting-edge methods for action recognition.
We examine the different elements and how they are arranged in the structural frame. Performance is described as classification accuracy on the FGP-3D of the SBU dataset using only the combined data unless otherwise stated [43]. As shown in Table 4, we compare our model to state-of-the-art models [34,35,36,37,38,39] on the dataset before analyzing how much performance increases on target datasets. On all datasets, the final proposed FGP-3D models are assessed in comparison to other cutting-edge methods for action recognition. Our tests are run on four substantial, widely-applied, and typical action recognition datasets in order to compare their performance to FGP-3D.

5. Experiments and Analysis

This section includes a description of every experiment we do to evaluate the performance of the suggested FGP-3D action selection framework. Here, we evaluate the adaptive FPG-3D against the confusion matrices. In this section, we show a completely anchored method before and after learning to confirm our analysis. In Figure 3, it can be seen that the previously identified classes are now misclassified, with their associations indicated by weights from several frames that are evenly spaced over the feature observed. Our approach, which is based on a feature matrix with a suggested descriptor and explores longer-range relationships, is able to find dependencies in the skeleton for both correctly and incorrectly classified classes (see Figure 3—misclassified classes). Results in Table 1 and Figure 3 quantitatively demonstrate the efficiency of the FGP-3D. Overall, we draw the conclusion that both our technique and FGP-3D-based methods are better for accommodating joint-based motion detection since they contains both fully gated and anchored motion recognition, have various initialization procedures, and include motion mechanisms in the spatial dimension. It becomes clear that fully gated anchored-based dependency matrix learning along with temporal convolutions can be a more general and successful way to learn spatio-temporal dependencies than cutting-edge models for skeleton-based tasks, where the number of nodes (i.e., spatial body joints) is not large.
In this part, we analyze the features of the skeletons recovered by FPG-3D on the MSR dataset in terms of confusion matrices to determine the performance within each class. We analyzed the feature matrix derived from Section 3.4. After a classification using the Fully Gated Anchored Technique, FGP-3D, we freeze the fully anchored skeletal joint approach and retrain classifiers on the MSR dataset. Figure 4 demonstrates that our proposed method with fewer parameters is more effective than classification from scratch. The misclassified classes likewise obtained consistent and reliable results: 99.3, 95.8, 97.8, 98.3, and 98.2 percent accuracy. From the confusion matrix, we conclude that the FPG-3D has a significant advantage for all classes of evaluated datasets at the classification and misclassification steps. Training and testing on MSR are generic and can be utilized for extracting features from skeleton sequences, indicating that the model’s features are well-trained on MSR and have a high transferability.
We first demonstrate the viability of our proposed disentangled multi-scale action feature scheme using a variety of state-of-the-art methods. The various pathways of the FGP-3D action descriptor, which indicates adjacency powering and disentangled aggregation of motion information matrices, are used to do this in Table 3. We are currently describing the performance of the KARD dataset within the class. We incrementally add model elements to test the effectiveness of FGP-3D modules in capturing complicated spatial-temporal aspects. The results are displayed in Figure 5. However, if we just replicate the factorized pathway to learn from multiple feature subspaces and imitate the fully gated anchored design or simply scale up the factorized pathway to bigger capacity (deeper and wider), we notice modest gains. Class number 10 is accurately categorized, while two other classes have an accuracy of 99.3 and only a few frames of misclassification. Contrarily, when the FGP-3D pathway is included, we regularly see better results with the same number of parameters or less, demonstrating FGP-3D’s capacity to detect intricate regional spatial-temporal correlations that were previously missed by factorizing spatial and temporal dependencies. These findings show the effectiveness of the proposed disentangled aggregation technique for joint-based action learning; it enhances the proposed FGP-3D module’s performance in the spatial-temporal domain and across a wide range of number scales.
The features of the skeletons recovered by FGP-3D and analyzed using the SBU dataset are shown in Figure 6. The findings in Table 4 show that transfer learning is more successful with fewer criteria than previous research classifications. The accuracy and fine-tuning plots are displayed as a confusion matrix (see Figure 6). According to the figures, which show ten phases of correct and incorrect classification, we can infer that when the proposed action descriptor is employed, there are only 0.5% losses and a significant improvement in accuracy for the two-person mutual action dataset. This indicates that the descriptor weights of the model we have suggested are well-pre-trained on the SBU dataset, offering a strong transfer ability, and can be utilized to extract features from skeletal sequences.
Each figure and table should be referenced as Figure 3, Figure 4, Figure 5 and Figure 6, Table 1, Table 2, Table 3 and Table 4, etc., in the body of the document. In this part, we discussed our design method that goes beyond FPG-3D by modeling the relationships between joints in our unified formulation utilizing a general dependency matrix FPG-3D (see Equation (18) and Figure 7) and the fully gated anchored approach to identify skeleton-based action recognition.

6. Conclusions

In this research, we offer FGP-3D, a unified framework for skeleton-based action recognition in the real world. Our experimental study demonstrates that FGP-3D is effective and has a robust capacity for generalization in human action recognition. In addition, we have introduced the anchored point and retrieved spatial information with fully gated unit features and all required temporal information with fully Popovicius features for high-dimension skeletal annotations. Our experimental results reveal that the suggested system for action recognition outperforms existing action recognition methods. Future studies will incorporate an analysis of our system for more 2D and 3D skeletal sequence-based tasks.

Author Contributions

Conceptualization, M.S.I.; methodology, M.S.I.; validation, M.S.I., A.A. and W.R.; formal analysis, M.S.I., A.A. and K.B.; investigation, M.S.I., A.A., W.R. and K.B.; writing—original draft preparation, M.S.I. and A.A.; writing—review and editing, M.S.I., A.A., W.R. and K.B.; visualization, M.S.I. and A.A.; funding acquisition, M.S.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deputyship for Research and Innovation, Ministry of Education in Saudi Arabia, grant number “INST029”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors extend their appreciation to the Deputyship for Research and Innovation, Ministry of Education in Saudi Arabia, for funding this research work (Project number INST029).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shi, J.; Zhang, Y.; Wang, W.; Xing, B.; Hu, D.; Chen, L. A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition. Appl. Sci. 2023, 13, 2058. [Google Scholar] [CrossRef]
  2. Kim, M.; Jiang, X.; Lauter, K.; Ismayilzada, E.; Shams, S. Secure human action recognition by encrypted neural network inference. Nat. Commun. 2022, 13, 4799. [Google Scholar] [CrossRef]
  3. Islam, M.S.; Bakhat, K.; Iqbal, M.; Khan, R.; Ye, Z.; Islam, M.M. Representation for action recognition with motion vector termed as: SDQIO. Expert Syst. Appl. 2023, 212, 118406. [Google Scholar] [CrossRef]
  4. Luvizon, D.C.; Picard, D.; Tabia, H. Multi-task deep learning for real-time 3D human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2752–2764. [Google Scholar] [CrossRef]
  5. Yang, D.; Wang, Y.; Dantcheva, A.; Garattoni, L.; Francesca, G.; Bremond, F. Unik: A unified framework for real-world skeleton-based action recognition. arXiv 2021, arXiv:2107.08580. [Google Scholar]
  6. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New York, NY, USA, 9–11 February 2018. [Google Scholar]
  7. Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  8. Zheng, W.; Li, L.; Zhang, Z.; Huang, Y.; Wang, L. Relational network for skeleton-based action recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 826–831. [Google Scholar]
  9. Kim, T.S.; Reiter, A. Interpretable 3D human action analysis with temporal convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 20–28. [Google Scholar]
  10. Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton image representation for 3D action recognition based on tree structure and reference joints. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Rio de Janeiro, Brazil, 28–30 October 2019; pp. 16–23. [Google Scholar]
  11. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055. [Google Scholar]
  12. Lei, S.; Zhang, Y.; Cheng, J.; Lu, H. Decoupled spatial-temporal attention network for skeleton-based action recognition. arXiv 2020, arXiv:2007.03263. [Google Scholar]
  13. Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
  14. Gao, X.; Hu, W.; Tang, J.; Liu, J.; Guo, Z. Optimized skeleton-based action recognition via sparsified graph regression. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 601–610. [Google Scholar]
  15. Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
  16. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
  17. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef]
  18. Elmadany, N.E.D.H.; He, Y.; Guan, L. Multimodal learning for human action recognition via bimodal/multimodal hybrid centroid canonical correlation analysis. IEEE Trans. Multimed. 2018, 21, 1317–1331. [Google Scholar] [CrossRef]
  19. Kamel, A.; Sheng, B.; Yang, P.; Li, P.; Shen, R.; Feng, D.D. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 1806–1819. [Google Scholar] [CrossRef]
  20. Yang, T.; Hou, Z.; Liang, J.; Gu, Y.; Chao, X. Depth Sequential Information Entropy Maps and Multi-Label Subspace Learning for Human Action Recognition. IEEE Access 2020, 8, 135118–135130. [Google Scholar] [CrossRef]
  21. Dawar, N.; Ostadabbas, S.; Kehtarnavaz, N. Data Augmentation in Deep Learning-Based Fusion of Depth and Inertial Sensing for Action Recognition. IEEE Sens. Lett. 2018, 3, 1–4. [Google Scholar] [CrossRef]
  22. Dawar, N.; Kehtarnavaz, N. Action Detection and Recognition in Continuous Action Streams by Deep Learning-Based Sensing Fusion. IEEE Sens. J. 2018, 18, 9660–9668. [Google Scholar] [CrossRef]
  23. Ben Mahjoub, A.; Atri, M. An efficient end-to-end deep learning architecture for activity classification. Analog Integr. Circuits Signal Process. 2019, 99, 23–32. [Google Scholar] [CrossRef]
  24. Ahmad, Z.; Khan, N. Human action recognition using deep multilevel multimodal (${M}^{2} $) fusion of depth and inertial sensors. IEEE Sens. J. 2019, 20, 1445–1455. [Google Scholar] [CrossRef]
  25. Núñez, J.C.; Cabido, R.; Pantrigo, J.J.; Montemayor, A.S.; Vélez, J.F. Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit. 2018, 76, 80–94. [Google Scholar] [CrossRef]
  26. Basly, H.; Ouarda, W.; Sayadi, F.E.; Ouni, B.; Alimi, A.M. CNN-SVM learning approach based human activity recognition. In International Conference on Image and Signal Processing; Springer: Cham, Switzerland, 2020; pp. 271–281. [Google Scholar]
  27. Sial, H.A.; Yousaf, M.H.; Hussain, F. Spatio-temporal RGBD cuboids feature for human activity recognition. Nucleus 2018, 55, 139–149. [Google Scholar]
  28. Dhiman, C.; Vishwakarma, D.K. A Robust Framework for Abnormal Human Action Recognition Using $\boldsymbol {\mathcal {R}} $-Transform and Zernike Moments in Depth Videos. IEEE Sens. J. 2019, 19, 5195–5203. [Google Scholar] [CrossRef]
  29. Jin, K.; Jiang, M.; Kong, J.; Huo, H.; Wang, X. Action recognition using vague division DMMs. J. Eng. 2017, 2017, 77–84. [Google Scholar] [CrossRef]
  30. Ashwini, K.; Amutha, R. Compressive sensing based recognition of human upper limb motions with kinect skeletal data. Multimed. Tools Appl. 2021, 80, 10839–10857. [Google Scholar] [CrossRef]
  31. Islam, M.S.; Bakhat, K.; Khan, R.; Iqbal, M.; Ye, Z. Action recognition using interrelationships of 3D joints and frames based on angle sine relation and distance features using interrelationships. Appl. Intell. 2021, 51, 6001–6013. [Google Scholar] [CrossRef]
  32. Islam, M.S.; Bakhat, K.; Khan, R.; Naqvi, N.; Ye, Z. Applied Human Action Recognition Network Based on SNSP Features. Neural Process. Lett. 2022, 54, 1481–1494. [Google Scholar] [CrossRef]
  33. Bakhat, K.; Kifayat, K.; Islam, M.S. Human activity recognition based on an amalgamation of CEV & SGM features. J. Intell. Fuzzy Syst. 2022, 43, 7351–7362. [Google Scholar]
  34. Manzi, A.; Fiorini, L.; Limosani, R.; Dario, P.; Cavallo, F. Two-person activity recognition using skeleton data. IET Comput. Vis. 2017, 12, 27–35. [Google Scholar] [CrossRef]
  35. Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2016; Volume 30. [Google Scholar]
  36. Jalal, A.; Khalid, N.; Kim, K. Automatic Recognition of Human Interaction via Hybrid Descriptors and Maximum Entropy Markov Model Using Depth Sensors. Entropy 2020, 22, 817. [Google Scholar] [CrossRef] [PubMed]
  37. Waheed, M.; Jalal, A.; Alarfaj, M.; Ghadi, Y.Y.; Al Shloul, T.; Kamal, S.; Kim, D.-S. An LSTM-Based Approach for Understanding Human Interactions Using Hybrid Feature Descriptors Over Depth Sensors. IEEE Access 2021, 9, 167434–167446. [Google Scholar] [CrossRef]
  38. Bakhat, K.; Kifayat, K.; Islam, M.S. Katz centrality based approach to perform human action recognition by using OMKZ. Signal Image Video Process. 2022, 1–9. [Google Scholar] [CrossRef]
  39. Islam, M.S.; Iqbal, M.; Naqvi, N.; Bakhat, K.; Islam, M.M.; Kanwal, S.; Ye, Z. CAD: Concatenated action descriptor for one and two person (s), using silhouette and silhouette’s skeleton. IET Image Process. 2020, 14, 417–422. [Google Scholar] [CrossRef]
  40. Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition uti-lizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 168–172. [Google Scholar]
  41. Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3D points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 9–14. [Google Scholar]
  42. Gaglio, S.; Re, G.L.; Morana, M. Human Activity Recognition Process Using 3-D Posture Data. IEEE Trans. Hum. Mach. Syst. 2014, 45, 586–597. [Google Scholar] [CrossRef]
  43. Yun, K.; Honorio, J.; Chattopadhyay, D.; Berg, T.L.; Samaras, D. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 28–35. [Google Scholar]
  44. Lu, T.-W.; Chang, C.-F. Biomechanics of human movement and its clinical applications. Kaohsiung J. Med Sci. 2012, 28, S13–S25. [Google Scholar] [CrossRef] [PubMed]
  45. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
  46. Popoviciu, T. Sur certaines inégalités qui caractérisent les fonctions convexes. Analele Stiintifice Univ.“Al. I. Cuza”. Iasi Sect. Mat. 1968, 11, 155–164. [Google Scholar]
Figure 1. Skeletal joint relationships are detected, as is the anchored point.
Figure 1. Skeletal joint relationships are detected, as is the anchored point.
Applsci 13 05437 g001
Figure 2. Schematic representation of the proposed system, FGP-3D. To identify the most informative representations for skeleton-based action identification, we offer a fully gated anchoring method. The anchored point infers a point’s inherent topology, which gives contextual information beyond normal physical connectivity. Using fully gated unit features and fully popovicius features, spatial and temporal dependencies will be derived from each joint block of the input skeleton. After classifying the suggested model, the FGP-3D module represents action more accurately by capturing context-dependent intrinsic joint topology.
Figure 2. Schematic representation of the proposed system, FGP-3D. To identify the most informative representations for skeleton-based action identification, we offer a fully gated anchoring method. The anchored point infers a point’s inherent topology, which gives contextual information beyond normal physical connectivity. Using fully gated unit features and fully popovicius features, spatial and temporal dependencies will be derived from each joint block of the input skeleton. After classifying the suggested model, the FGP-3D module represents action more accurately by capturing context-dependent intrinsic joint topology.
Applsci 13 05437 g002
Figure 3. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the UTD-MHAD dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Figure 3. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the UTD-MHAD dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Applsci 13 05437 g003
Figure 4. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the MSR dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Figure 4. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the MSR dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Applsci 13 05437 g004
Figure 5. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the KARD dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Figure 5. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the KARD dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Applsci 13 05437 g005
Figure 6. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the SBU dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Figure 6. In order to visualize the performance of a suggested algorithm, a confusion matrix error matrix is displayed for the SBU dataset. Each row of the matrix represents cases belonging to an actual class, and each column represents instances belonging to a predicted class, or conversely. There are ten levels of accuracy for the results, and it is evident from the graph that our method has low error due to the small number of misclassified frames.
Applsci 13 05437 g006
Figure 7. The proposed FGP-3D models’ predictive abilities evaluate performance in terms of accuracy within the class. Most metrics are built around the idea of percentage predictions made by the model and compared to the actual results.
Figure 7. The proposed FGP-3D models’ predictive abilities evaluate performance in terms of accuracy within the class. Most metrics are built around the idea of percentage predictions made by the model and compared to the actual results.
Applsci 13 05437 g007
Table 1. Comparison of classification accuracy against state-of-the-art approaches on UTD-MHAD between the proposed recognition framework and several existing methods.
Table 1. Comparison of classification accuracy against state-of-the-art approaches on UTD-MHAD between the proposed recognition framework and several existing methods.
MethodAccuracy (%)
BHCCCA [18]
DCNN [19]
MLSL [20]
84.6
88.14
88.37
Fusion of Depth [21]
Sensing Fusion [22]
End-to-end CNN-LSTM [23]
M 2 [24]
89.2
92.8
98.5
99.2
FGP-3D99.8
Table 2. Comparison of classification accuracy against state-of-the-art approaches on MSR between the proposed recognition framework and several existing methods.
Table 2. Comparison of classification accuracy against state-of-the-art approaches on MSR between the proposed recognition framework and several existing methods.
MethodAccuracy (%)
CNN + LSTM [25]
DTR-HAR [26]
63.10
91.56
D-STIP + D-DESC [27]
R-Transform + Zernike [28]
DMM [29]
92.00
94.88
96.50
FGP-3D99.4
Table 3. Comparison of classification accuracy against state-of-the-art approaches on KARD between the proposed recognition framework and several existing methods.
Table 3. Comparison of classification accuracy against state-of-the-art approaches on KARD between the proposed recognition framework and several existing methods.
MethodAccuracy (%)
R-Transform + Zernike [28]96.64
Sensing [30]
ASD-R [31]
SNSP [32]
CEV and SGM [33]
97.22
97.6
97.8
99.4
FGP-3D99.8
Table 4. Comparison of classification accuracy against state-of-the-art approaches on SBU between the proposed recognition framework and several existing methods.
Table 4. Comparison of classification accuracy against state-of-the-art approaches on SBU between the proposed recognition framework and several existing methods.
MethodAccuracy (%)
Skeletal data [34]
Deep LSTM [35]
Hybrid feature [36]
Hybrid descriptors [37]
OMKZ [38]
88
90.41
91.25
91.63
93.80
CAD [39]95.6
FGP-3D99.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Islam, M.S.; Algosaibi, A.; Rafaqat, W.; Bakhat, K. Employing FGP-3D, a Fully Gated and Anchored Methodology, to Identify Skeleton-Based Action Recognition. Appl. Sci. 2023, 13, 5437. https://doi.org/10.3390/app13095437

AMA Style

Islam MS, Algosaibi A, Rafaqat W, Bakhat K. Employing FGP-3D, a Fully Gated and Anchored Methodology, to Identify Skeleton-Based Action Recognition. Applied Sciences. 2023; 13(9):5437. https://doi.org/10.3390/app13095437

Chicago/Turabian Style

Islam, M Shujah, Abdullah Algosaibi, Warda Rafaqat, and Khush Bakhat. 2023. "Employing FGP-3D, a Fully Gated and Anchored Methodology, to Identify Skeleton-Based Action Recognition" Applied Sciences 13, no. 9: 5437. https://doi.org/10.3390/app13095437

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop