Human Action Recognition of Spatiotemporal Parameters for Skeleton Sequences Using MTLN Feature Learning Framework

: Human action recognition (HAR) by skeleton data is considered a potential research aspect in computer vision. Three-dimensional HAR with skeleton data has been used commonly because of its effective and efﬁcient results. Several models have been developed for learning spatiotemporal parameters from skeleton sequences. However, two critical problems exist: (1) previous skeleton sequences were created by connecting different joints with a static order; (2) earlier methods were not efﬁcient enough to focus on valuable joints. Speciﬁcally, this study aimed to (1) demonstrate the ability of convolutional neural networks to learn spatiotemporal parameters of skeleton sequences from different frames of human action, and (2) to combine the process of all frames created by different human actions and ﬁt in the spatial structure information necessary for action recognition, using multi-task learning networks (MTLNs). The results were signiﬁcantly improved compared with existing models by executing the proposed model on an NTU RGB+D dataset, an SYSU dataset, and an SBU Kinetic Interaction dataset. We further implemented our model on noisy expected poses from subgroups of the Kinetics dataset and the UCF101 dataset. The experimental results also showed signiﬁcant improvement using our proposed model.


Introduction
At present, human action recognition (HAR) is an excellent area for research in computer vision. Three-dimensional skeleton data analysis of the path of human skeleton joints is less susceptible to brightness variations and never changes camera views [1]. Action recognition (AR) based on 3D skeleton sequences is gaining increased interest due to the popularity of more accurate and cheap devices [2][3][4]. The main-focus of our research is skeleton-based 3D pose recognition.
Success has been achieved in image classification [5,6]. However, in video action recognition, understanding the human poses to obtain temporal information on the sequence requests is broken into points [7][8][9][10][11], whereas the 3D shape of the human skeleton is also a crucial signal for action recognition [12]. The skeleton sequence is used to manage the joints of the human skeleton. To investigate the spatial and temporal characteristics of human interaction, studies have utilized recurrent neural networks (RNNs) with long short-term memory (LSTM) [13,14] neurons for joints of the skeleton sequence [2,[15][16][17]. With respect to the long-term temporal dependency issues, LSTM networks are implemented to remember the information of the entire sequences across various periods; however, they 1.
We generate points for frames from various human actions as input that originate from datasets.

2.
We develop a new deep CNN network model to transform each skeleton sequence. By learning from the hierarchical structure of deep CNNs, we enable long-term temporal modeling of the skeleton structure for frame images.

3.
We create an MTLN to pick up the skeleton structure's spatial configuration and material factors to produce the CNN frames. 4.
Our experimental results prove that MTLN achieves improvement compared to concatenating or pooling the features of the frames. 5.
For better and efficient results, we implement AlexNet in the proposed network model. The final results also show the significance of our methodology. 6.
Standard datasets are used to establish the presence of the proposed model. Some classic procedures are implemented for comparison.
The remainder of the article is arranged as follows: in Section 2, we introduce related work; in Section 3, we describe our network model in detail; Section 4 represents the datasets used in our approach, their implementation, and a discussion of the results; lastly, the conclusion is presented in Section 5.

Related Work
Aspects of computer vision focus on the skeleton base sequencing of data. Threedimensional skeleton-based data sequences are more compact and robust than traditional image-based data in extracting the more essential 3D elements and removing noise. Many studies have recently focused on skeleton bases and action recognition approaches using either hand-crafted or deep learning techniques. These algorithms are schemes for extracting important features and sequences from raw data [23][24][25].
Deep learning techniques learn features in an end-to-end manner from data in the form of an array using machines. There are two aspects of the learning function: intraframe and inter-frame. Intra-frames present mutual coincidences, whereas inter-frames indicate the skeleton's temporal estimation.
Ke et al. [26] presented a technique for obtaining skeleton data and elementary knowledge to isolate 3D space records and sort grayscale images according to their dimensions. As mentioned, all implemented methods transfer the joint orders of data to create a collective representation. This technique extracts global coincidences from skeleton data, which enables a better evaluation of local coincidences. For converting the frame-level feature, ConvNet can be used to learn from temporal data, while adopting LSTM; these approaches are practical and well-proven [27,28]. For expressing temporal evolution [29,30], optical flow represents another approach. Moreover, C3D [31] allows sequentially extracting features from spatial and temporal data using 3D convolutional layers.
Recurrent neural networks (RNNs) [32] were commonly implemented in previous studies on skeleton-based human action recognition for the temporal arrangement of skeleton sequences. Hierarchical construction [2] signifies spatial associations between body parts. On the other hand, the authors of [16,33,34] recommended a system for simultaneously learning spatial and temporal information, i.e., 2D LSTM. A two-stream RNN structure for spatiotemporal evolutions was proposed in [35]. Neural networks enable the user to determine the outlook position and obtain significant outcomes. Other approaches include neighbor searches [36] and Lie groups [37]; graphical neural networks [38] were also found to perform well with skeleton-based recognition data. Table 1 summarizes the performance of the above methods on two commonly utilized datasets: NTU RGB+D [38] and SBU Kinect Interaction [39].  [37] Graphic NN 72.7% -ST-GCN [38] Graphic NN 81.5% -Ke et al. [26] suggested modifying human skeleton joint images into grayscale images. In the proposed research, we further develop skeleton images using the complexity of the MTLN network. Table 1 shows that CNN-based methods have displayed excellent progress in learning skeleton images associated with RNN techniques.
In skeleton-based action recognition, features created on the movements of joints are primarily used in the experiment. The combined program is divided into two elements: spatial movement and temporal movement [1]. The spatial actions which code the construction information of the human frame are designed between all couples of joints [1,36] or between some position joints and the new joints [3,41]. The temporal displacements that describe the actions of joints are recorded between motion frames [42]. Coordination is another generally used feature to describe the spatial position. The authors of [35] proposed a two-stream RNN to obtain both temporal dynamics and spatial structures. The RNN model is appropriate for material data; however, it lacks the ability to extract features from spatial data.
No specific model exists to address spatial and temporal dependency problems; however, scholars have extracted features from skeleton information using the CNN model and obtained excellent results [43]. CNN is successfully used for both images and videos [44]. Previous studies on skeleton-based action recognition have shown that CNNs can extract features from the temporal relationships of the combined sequences. Moreover, CNN-based models perform better in terms of intra-frame relationships than RNN-based models. However, CNNs cannot extract features from all joints. Accordingly, we proposed determining the spatial connection through significant action recognition based on skeleton points.

Proposed Methodology
This study had three aims: (1) to arrange the joints in every frame to interconnect body parts; (2) to create parallel vectors from joints and skeleton sequences; (3) to use all these data as input for training the neural network. MTLN was used for action recognition in the proposed model architecture.

Implementation Details
We used an eighth Generation Core i7 Quad-Core Processor with 8 GB RAM, 8 GB NVIDIA, and a 520 GB SSD hard drive for our research. Clips were created from all frames of the original skeleton sequence without preprocessing (i.e., normalization, temporal down sampling, or noise filtering) for all datasets. As an initial step, the first layer's remote units were set to 512. The number of teams in the second layer (i.e., the output layer) was 256. Using the RMSprop algorithm with a momentum value of 0.92, the network was trained. The learning rate was set to 0.001 for 1000 steps, then to 0.0001 after 1000 steps, with a batch size of 16 degrees. The training stopped after 32 epochs. In the activity of our model, 32,000 steps were taken. The performance of the suggested method on each dataset was compared to existing approaches using the same testing procedure. Training took 12 h.

Frame Process
As shown in Figure 1, every frame in the skeleton sequence is related to all other structures. The first step was to establish a chain of joints from every frame to interconnect each joint. Thus, two-dimensional arrays were created by comparing all edges of the skeleton sequence. In this article, we converted the two-dimensional network structures of the four matches into a CNN-based model, generating four frames based on the four distant positions of the skeleton.
CNN-based models perform better in terms of intra-frame relationships than RNN-based models. However, CNNs cannot extract features from all joints. Accordingly, we proposed determining the spatial connection through significant action recognition based on skeleton points.

Proposed Methodology
This study had three aims: (1) to arrange the joints in every frame to interconnect body parts; (2) to create parallel vectors from joints and skeleton sequences; (3) to use all these data as input for training the neural network. MTLN was used for action recognition in the proposed model architecture.

Implementation Details
We used an eighth Generation Core i7 Quad-Core Processor with 8 GB RAM, 8 GB NVIDIA, and a 520 GB SSD hard drive for our research. Clips were created from all frames of the original skeleton sequence without preprocessing (i.e., normalization, temporal down sampling, or noise filtering) for all datasets. As an initial step, the first layer's remote units were set to 512. The number of teams in the second layer (i.e., the output layer) was 256. Using the RMSprop algorithm with a momentum value of 0.92, the network was trained. The learning rate was set to 0.001 for 1000 steps, then to 0.0001 after 1000 steps, with a batch size of 16 degrees. The training stopped after 32 epochs. In the activity of our model, 32,000 steps were taken. The performance of the suggested method on each dataset was compared to existing approaches using the same testing procedure. Training took 12 h.

Frame Process
As shown in Figure 1, every frame in the skeleton sequence is related to all other structures. The first step was to establish a chain of joints from every frame to interconnect each joint. Thus, two-dimensional arrays were created by comparing all edges of the skeleton sequence. In this article, we converted the two-dimensional network structures of the four matches into a CNN-based model, generating four frames based on the four distant positions of the skeleton.  In the initial deep CNN data extraction stage, the skeleton sequence of long-term temporal knowledge pulls solid symbols from each frame to represent a distant spatial connection. Using multi-task learning, the CNN properties of all frames are handled in a coordinated manner, allowing them to determine 3D spatiotemporal information for action detection.

CNN Training for Feature Extraction
Each frame defines the temporal dynamics of the skeleton sequence, and the spatialconstant CNN characteristic of each frame provides strong knowledge about the skeleton sequence. Expert CNN models help predict attributes since they are executed with models trained using Image Net [19]. Using a pretrained CNN feature, the CNN features of each frame can be removed using AlexNet [45]. They can be helpful in several cross-domain applications [46,47]. The frame creation and execution process for skeleton sequence is presented in Figure 2.
coordinated manner, allowing them to determine 3D spatiotemporal information for action detection.

CNN Training for Feature Extraction
Each frame defines the temporal dynamics of the skeleton sequence, and the spatialconstant CNN characteristic of each frame provides strong knowledge about the skeleton sequence. Expert CNN models help predict attributes since they are executed with models trained using Image Net [19]. Using a pretrained CNN feature, the CNN features of each frame can be removed using AlexNet [45]. They can be helpful in several cross-domain applications [46,47]. The frame creation and execution process for skeleton sequence is presented in Figure 2.
Moreover, recent skeleton datasets were too small or too noisy to allow effective evaluation of a deep system. Even though the frames were not original images, the edges were inserted into the pretrained CNN model using AlexNet [45]. Realistic photos and the generated frames shared the same structure because they were matrices with patterns. CNN models efficiently extract feature information from enormous image datasets, which enables the user to identify the elements of designs in a group. In the proposed pretrained AlexNet [45], there were eight groups of convolutional layers from conv1 to conv8. Every set of convolutional layers contained four stacks of a layer, each of which had a similar kernel size. There were two fully connected (FC) layers and 32 convolutional layers in the model. In this way, deep neural networks can extract powerful features from different frames and use them in another field. Each layer has its way of pulling elements from another frame. The feature extraction process primarily depends on the new classes. Early layers have greater potential to be transferred to another domain [48]. As a result, this study, which was based on convolutional layer stimulations, aimed to establish the material factors of skeleton sequences.
The convolutional layer feature map enables the user to achieve action recognition and picture recognition [49,50]. In this study, we discarded the FC layers and the last three convolutional layers of the network. Each frame image of the four frames was adjusted to 224 × 224 and embedded into the model. The CNN outcomes were inserted into the temporal mean pooling (TMP) layer as the input frame. The frame's dimensions were 28 × 28 Moreover, recent skeleton datasets were too small or too noisy to allow effective evaluation of a deep system. Even though the frames were not original images, the edges were inserted into the pretrained CNN model using AlexNet [45]. Realistic photos and the generated frames shared the same structure because they were matrices with patterns. CNN models efficiently extract feature information from enormous image datasets, which enables the user to identify the elements of designs in a group.
In the proposed pretrained AlexNet [45], there were eight groups of convolutional layers from conv1 to conv8. Every set of convolutional layers contained four stacks of a layer, each of which had a similar kernel size. There were two fully connected (FC) layers and 32 convolutional layers in the model. In this way, deep neural networks can extract powerful features from different frames and use them in another field. Each layer has its way of pulling elements from another frame. The feature extraction process primarily depends on the new classes. Early layers have greater potential to be transferred to another domain [48]. As a result, this study, which was based on convolutional layer stimulations, aimed to establish the material factors of skeleton sequences.
The convolutional layer feature map enables the user to achieve action recognition and picture recognition [49,50]. In this study, we discarded the FC layers and the last three convolutional layers of the network. Each frame image of the four frames was adjusted to 224 × 224 and embedded into the model. The CNN outcomes were inserted into the temporal mean pooling (TMP) layer as the input frame. The frame's dimensions were 28 × 28 × 512, i.e., 512 feature maps of size 28 × 28; the created frame's rows were parallel to alternative edges in a skeleton sequence. The skeleton sequence was represented by the movements of the row features of the resulting image. Meanwhile, the stimulations of each feature map on the conv8_1 layer corresponded to the regional locations in the original input image [49].
Feature maps were used to extract the temporal information of the skeleton sequence from the row features. In detail, the feature maps were combined with TMP using a kernel size of 28 × 1, i.e., the temporal function was performed under pooling. The activation of the k-th feature map's eighth row and j-th column can be represented as xki, j.
Then, the output of the k-th feature map after TMP is as follows: All feature maps (512) are combined to form a 14,336-D (28 × 512 = 14,336) feature vector, which signifies the temporal dynamics of the skeleton sequence.

Multi-Task Learning Network
Individually, temporal dynamic information is proposed from skeleton sequence vectors and contains one specific spatial association between the joints. Four-component vectors are intrinsically related to one another. Meanwhile, the information obtained from the skeleton sequence is fed into frames that provide temporal information. Then, the features can be extracted from frames to determine the long-term temporal structure of the skeleton sequence. Each frame generated from different images is inserted into a deep CNN to obtain the CNN features. In Figure 3e, the three 14,336-D properties of the four frames in an equal period are combined to procedure a feature vector; overall, four feature vectors were generated. Then, the four CNN features of the four frames (See Figure 3) were combined into one feature vector at a time. Thus, we extracted four feature vectors from all feature vectors indicating the skeleton sequence's temporal information and one specific joint spatial connection. The feature vectors describe distant spatial relationships with core connections between joints. This research recommends applying intrinsic connections among different feature vectors for action recognition with MTLN. The simplification of multiple tasks demonstrates an equal computation of numerous correlated studies utilizing the core connections.
Using MTLN, each feature vector's arrangement was managed as a distinct task. The MTLN was trained from various inputs from one feature vector and obtained multiple outcomes as the final prediction. Then, four features were extracted from a sequence of feature vectors corresponding to the temporal information in the skeleton and one specific joint's spatial connection. The features describe distant spatial relationships with core connections between joints. MTLN explains various tasks as a function of weight, which enhances the presentation of specific tasks [22]. MTLN provided a standard process through which four-component vectors could be used for action recognition utilizing their intrinsic associations. Each component vector's organization was handled as another task with a similar grouping mark of the skeleton sequences.

Architecture of Network
The construction of our developed network is presented in Figure 3f. The deep CNN network model contained four frames as different input joints of additional skeleton images, one max-pooling layer, one rectified linear unit (ReLU) [44] to present extra nonlinearity between two fully connected (FC) layers, and the output using a Softmax layer. Using the four features as information sources, the MTLN produces four edge-level forecasts compared to one assignment.
In detail, for each skeleton sequence, we created four parallel frames for the CNN layers. Frames were derived from the relative joints of four discussion points. Each frame contained the historical elements of the entire skeleton sequence and a specific spatial link between the joints. Each skeleton point was associated with multiple frames of data with different spatial associations, thereby providing critical knowledge at different spatial angles.  Using MTLN, each feature vector's arrangement was managed as a distinct task. The MTLN was trained from various inputs from one feature vector and obtained multiple outcomes as the final prediction. Then, four features were extracted from a sequence of feature vectors corresponding to the temporal information in the skeleton and one specific joint's spatial connection. The features describe distant spatial relationships with core connections between joints. MTLN explains various tasks as a function of weight, which enhances the presentation of specific tasks [22]. MTLN provided a standard process through which four-component vectors could be used for action recognition utilizing their intrinsic associations. Each component vector's organization was handled as another task with a similar grouping mark of the skeleton sequences.

Architecture of Network
The construction of our developed network is presented in Figure 3f. The deep CNN network model contained four frames as different input joints of additional skeleton images, one max-pooling layer, one rectified linear unit (ReLU) [44] to present extra nonlinearity between two fully connected (FC) layers, and the output using a Softmax layer. Using the four features as information sources, the MTLN produces four edge-level forecasts compared to one assignment.
In detail, for each skeleton sequence, we created four parallel frames for the CNN layers. Frames were derived from the relative joints of four discussion points. Each frame contained the historical elements of the entire skeleton sequence and a specific spatial link between the joints. Each skeleton point was associated with multiple frames of data with different spatial associations, thereby providing critical knowledge at different spatial angles.
From one point of view, the impact of information on skeleton sequences is crucial because of the disappearance of various sources of noise such as the background and the naturally structured 3D joint position data. As a second point of view, image-based datasets are more significant than point-based sequences, allowing deeper examination of information, particularly for preparing deep learning models. Furthermore, because of the From one point of view, the impact of information on skeleton sequences is crucial because of the disappearance of various sources of noise such as the background and the naturally structured 3D joint position data. As a second point of view, image-based datasets are more significant than point-based sequences, allowing deeper examination of information, particularly for preparing deep learning models. Furthermore, because of the development of Kinect and robust depth sensors, as well as the increasing number of methods to approximate joint positions, it is essential to access skeleton-based data.
The four-loss values were summed to calculate the system's loss value, used to update the system factors. A final class prediction was generated by averaging results from the four task classes. While training, to calculate the loss value of each task, the class scores were utilized. Thus, all functions were entirely lost such that the last loss of the structure was used. During the testing process, the scores of all tasks were averaged to determine the previous forecast of the action class. Equation (2) indicates the loss of the k-th task (k = 1, . . . , 4).
Exp.x kj − x ki (2) where x k is the vector inserted into the Softmax layer generated by the k-th input feature, n is the number of action classes, and y i is the ground-truth label for class I. Equation (3) can be used to get the network's final loss value as the sum of four specific losses.

1.
The NTU RGB+D dataset [27] is currently the largest dataset for action recognition with more than 56,000 sequences and four million frames. The applied dataset has 50 different human poses and 70 class actions with 40 regular action classes. Crosssubject (CS) and cross-view (CV) are the two suggested elevation protocols. In general, at the beginning of our research, we followed the settings of [27]. In the CS evaluation, 40,500 samples from 40 subjects were used for training a planned model, using the other 18,540 examples for analysis. For the cross-view evaluation, 38,700 samples were taken from the second camera, while the third camera was used for training the model and analyzing the other 18,600 samples from camera 1.

2.
The SBU Kinect Interaction dataset [39] contains 280 skeleton sequences and 6810 frames. We followed the regular research protocol of fivefold cross-validation with delivered splits, yielding eight classes. In each skeleton, frames had two people, and 15 joints were labeled for each person. While training, two samples were used for two skeleton sequences. While testing, the average forecast score was calculated. During the training process, random collection was used to augment data. Five recent crops were taken for prediction scores, and four corners were averaged for the testing calculation.

3.
Kinetics-Motion [45], the most significant RGB action recognition dataset, has 400 action classes, including three lac video clips. Videos were downloaded from YouTube, and each clip has a 10 s duration. Yan et al. [38] provided estimated poses for action recognition based on joints. In the first step, videos were resized to 340,256 pixel resolution at 30 frames per second. Furthermore, an OpenPose toolbox was used Hu, et al., 2016 which cannot separate the particular classes of RGB action recognition datasets because it has no background context and image presence. Yan et al. [38] proposed a Kinetics-Motion dataset, a 30-class subset of Kinetics with action labels associated with body movement. We followed the suggested process on the Kinetics-Motion dataset. The 30 classes were skateboarding, tai chi hopscotch, pull-ups, capoeira, punching bag, squat, deadlifting, clean and jerk, push up, punching bag, belly dancing, country line dancing, surfing crowd, swimming backstroke, front raises, crawling baby, windsurfing, skipping rope, throwing discus, snatch weight lifting tobogganing, hitting a baseball, roller skating, arm wrestling, riding a mechanical bull, salsa dancing, hurling (sport), lunge, hammer throw, and juggling balls. 4.
The SYSU-3D dataset [51] contains 480 sequences and 12 distant actions executed by 40 people. Twenty joints from each frame of the series were connected with 3D coordinates. We set a 0.2 ration of training and validation datasets for the random split of the data, the same ration is used by existing studies [52,53]. 5. UCF101-Motion [24] contains 13,300 videos from 100 action classes fixed at 320 × 240 pixel resolution at 25 FPS. At the same time, as input RGB videos, approximately 16 joint actions were taken using the AlphaPose toolbox. Similarly to Kinetics-Motion, in UCF101, predefined actions such as "cutting in the kitchen" are more closely related to items and actions. To verify this, we followed the method in ST-GCN [38] and established a subset of UCF-101 named "UCF-Motion". UCF-Motion has 24 classes associated with the poses in a total of 3170 videos: jump rope, playing the piano, crawling baby, playing the flute, playing the cello, punch, tai chi, boxing speed bag, pushups, juggling balls, golf swing, clean and jerk, playing the guitar, bowling, ice dancing, playing soccer juggling, playing dhol, the tabla, boxing punching bag, salsa spins, hammer throw, rafting, and writing on board.

1.
All frames were created from the original skeleton sequence with preprocessing steps such as temporal downsampling, noise filtering, and normalization. 2.
In the first FC, the layer had 512 hidden units. In the second FC layer (i.e., the output layer), the action classes in each dataset and the number of the units were identical. The network was trained using the stochastic gradient descent algorithm at a training rate of 0.002 with 200 batch sizes and 50 epochs used in preparing the model.

3.
The performance of the developed model on each dataset was compared with existing methods.

Classification Report
The datasets used in this article included five different types of human actions NTU RGB+D, SYSU, SBU Kinetic Interaction, Kinetics, and UCF101. We applied MTLN to these datasets and obtained the accuracy, recall, and precision, yielding an average of 90% in each case (

Confusion Matrix
We established a confusion matrix as shown in Figure 4. We trained our model with 100 different videos of running, waving, walking, and jumping. It is clear from the matrix that the running class had 21 accurate predictions for running, no prediction for waving, two wrong predictions for walking because the model considered running and walking identical, and no wrong prediction for jumping. The waving class had 22 accurate predictions with one wrong prediction each for running and walking and two wrong predictions for jumping. The walking class had 21 accurate predictions, with one and two wrong predictions for waving and running, respectively, and no wrong prediction for jumping. The jumping class was the most accurate with 26 correct predictions and only one wrong prediction for waving. It is clear from the confusion matrix that our model was very accurate for the most part. Table 3 shows the accuracy rate of our compared models, showing gradual improvement in final accuracy for our model. We can conclude that the proposed technique better determines the correlation information between different joints in HAR.

Comparison with Different Models for Action Recognition
The accuracy of cross-view (CV) and cross-subject (CS) evaluation is clearly shown in Figure 5.
The accuracy of cross-subject (CS) evaluation for models proposed in the last 5 years is shown in Figure 6.
identical, and no wrong prediction for jumping. The waving class had 22 accurate predictions with one wrong prediction each for running and walking and two wrong predictions for jumping. The walking class had 21 accurate predictions, with one and two wrong predictions for waving and running, respectively, and no wrong prediction for jumping. The jumping class was the most accurate with 26 correct predictions and only one wrong prediction for waving. It is clear from the confusion matrix that our model was very accurate for the most part.  Table 3 shows the accuracy rate of our compared models, showing gradual improvement in final accuracy for our model. We can conclude that the proposed technique better determines the correlation information between different joints in HAR.   The accuracy of cross-view (CV) and cross-subject (CS) evaluation is clearly shown in Figure 5. The accuracy of cross-subject (CS) evaluation for models proposed in the last 5 years is shown in Figure 6.  The accuracy of cross-subject (CS) evaluation for models proposed in the l is shown in Figure 6.

Comparison with State-of-the-Art Methods
We conducted a systematic evaluation of the various models using the dat tioned above. The proposed approach showed improved accuracy (Tables 3-7) 1. The NTU RGB+D dataset (Table 3) is currently the most extensive datase posed model achieved greater recognition accuracy than previous ap showing a 0.3% enhancement compared to previous CNN-based model view evaluation. We compared our model-generated accuracy with the + att-DTIs-based model, showing a 1.1% improvement (Table 3). 2. Using the SUSU-3D dataset (Table 4), our proposed model improved the r accuracy compared to previous models by a considerable margin, e.g., Figure 6. Accuracy comparisons for cross-subject (CS) evaluation for different methods using the SYSU dataset.

Comparison with State-of-the-Art Methods
We conducted a systematic evaluation of the various models using the datasets mentioned above. The proposed approach showed improved accuracy (Tables 3-7).

1.
The NTU RGB+D dataset (Table 3) is currently the most extensive dataset; the proposed model achieved greater recognition accuracy than previous approaches, showing a 0.3% enhancement compared to previous CNN-based models for cross-view evaluation. We compared our model-generated accuracy with the latest DPI + att-DTIs-based model, showing a 1.1% improvement (Table 3).

2.
Using the SUSU-3D dataset (Table 4), our proposed model improved the recognition accuracy compared to previous models by a considerable margin, e.g., 1.6% compared with the well-known MANs [55] model. This demonstrates that our suggested model can thoroughly explore the characteristics of skeleton-based sequences to complete the action recognition test at a high level.

3.
For the SBU Kinect Interaction, the MTLN approach exceeded the performance of other studies in the literature in terms of recognition accuracy, similarly to the NTU RGB+D dataset. Our model achieved 95.4% ± 1.7% accuracy across the five splits in the case of SBU Kinect Interaction presented in Table 5 and Figure 7.

4.
For Kinetics-Motion (Table 6), the previously developed models showed inferior performance to our proposed model using MTLN [60]. The accuracy of recognition was also comparable to systems using other modalities such as RGB and optical flow. The proposed model was robust toward noise, whereas deficiencies were commonly produced due to missing or improper pose estimates (Figure 8).

5.
For UCF101-Motion (Table 7), the proposed approach outperformed existing algorithms that only used one modality [28,31] or both the presence feature and optical flow [61]. This experiment demonstrates that joints are a natural modality for identifying motion associated with actions, but that joints alone cannot distinguish all unique action classes ( Figure 9). Knowing specific categories requires an object and part appearances, whereas incorrect pose estimation decreases the limit of recognition accuracy ( Figure 9). Performance can be improved using our model.  Table 5. Recognition accuracy (%) achieved with previous methods using SUB Kinetic Interaction dataset.

Accuracy (%)
RGB CNN [45] 70.4 Flow CNN [45] 72.8 ST-GCN [38] 72.4 Our model 74.5   (Table 6), the previously developed models showed inferior performance to our proposed model using MTLN [60]. The accuracy of recognition was also comparable to systems using other modalities such as RGB and optical flow. The proposed model was robust toward noise, whereas deficiencies were commonly produced due to missing or improper pose estimates ( Figure 8). 5. For UCF101-Motion (Table 7), the proposed approach outperformed existing algorithms that only used one modality [28,31] or both the presence feature and optical flow [61]. This experiment demonstrates that joints are a natural modality for identifying motion associated with actions, but that joints alone cannot distinguish all unique action classes ( Figure 9). Knowing specific categories requires an object and part appearances, whereas incorrect pose estimation decreases the limit of recognition accuracy (Figure 9). Performance can be improved using our model.      Table 4. Accuracy (%) achieved with previous models for cross-subject (CS) evaluation using SYSU dataset.

Conclusions
This article developed a CNN network for feature learning and action recognition to insert a skeleton sequence into the model. The experiment showed that joints are a valid modality for detecting motion associated with actions, but that joints alone cannot distinguish all unique action classes. Understanding a specific type requires knowledge of the object and the parts. Four-frame CNNs were interconnected into a single matrix, while the skeleton sequence was described temporally using a particular spatial connection between joints. Furthermore, we implemented MTLN to mutually learn the feature vectors at a similar phase in parallel, which improved the performance in HAR via these critical connections. We analyzed our model using five different datasets: NTU RGB+D dataset, SBU Kinect Interaction, Kinetics-Motion, SYSU-3D dataset, and UCF101-Motion. The experimental results showed the effectiveness of our newly developed model and the feature learning method.
Furthermore, we contribute to the field by improving the generalizability of our model with respect to previous studies. This feature needs to be tested on a more significant number of sequences to ensure its robustness. It currently recognizes only one action in a series. We intend to address this issue in the next phase of our implementation and expand our system to identify more complex activities and recognize sequences involving combinations of actions.