You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

26 July 2023

A Prior Knowledge-Guided Graph Convolutional Neural Network for Human Action Recognition in Solar Panel Installation Process

,
,
,
and
1
School of Mechanical Engineering, Tianjin University No. 92, Weijin Road, Nankai District, Tianjin 300072, China
2
School of Mechanical Engineering, Tianjin Sino-German University of Applied Sciences No. 2, Ya Shen Road, Haihe Education Park, Tianjin 300350, China
3
School of Aviation and Aerospace, Tianjin Sino-German University of Applied Sciences No. 2, Ya Shen Road, Haihe Education Park, Tianjin 300350, China
*
Author to whom correspondence should be addressed.

Abstract

Human action recognition algorithms have garnered significant research interest due to their vast potential for applications. Existing human behavior recognition algorithms primarily focus on recognizing general behaviors using a large number of datasets. However, in industrial applications, there are typically constraints such as limited sample sizes and high accuracy requirements, necessitating algorithmic improvements. This article proposes a graph convolution neural network model that combines prior knowledge supervision and attention mechanisms, designed to fulfill the specific action recognition requirements for workers installing solar panels. The model extracts prior knowledge from training data, improving the training effectiveness of action recognition models and enhancing the recognition reliability of special actions. The experimental results demonstrate that the method proposed in this paper surpasses traditional models in terms of recognizing solar panel installation actions accurately. The proposed method satisfies the need for highly accurate recognition of designated person behavior in industrial applications, showing promising application prospects.

2. Method

2.1. System Overview

Figure 2 shows the system overview of the method proposed in this paper. For the sequence of worker motion images captured using the camera, firstly, the skeleton of the human body is extracted using the pose detection approach in [30]. Secondly, statistical methods are used to extract the skeleton prior knowledge of different actions. Furthermore, combining this prior knowledge, a GCN network guided by prior knowledge is designed and trained. Finally, the network is used to identify the actions of workers corresponding to the captured image sequences online, providing guidance for robot and human collaborative work.
Figure 2. System overview.

2.2. Joint Modeling

Aiming at the particularity of industrial operation tasks, this paper proposes a method for obtaining prior knowledge based on the importance of action joint points. In this method, the joint points of the human body are simplified to 14 points, as shown in Figure 3. For input, in each RGB image in the video sequences, the human joint points are obtained through the method in [30].
Figure 3. Joint modeling.
These joints are divided into three categories: the end joints, Joint_PE; body joints, Joint_PC; and link joints, Joint_PL. According to this, the human body can be modeled as directed links from Joint_PC to Joint_PL to Joint_PE.
Let F(*) denote the basic pose feature of a joint. For different kinds of joints, they can be expressed as
F ( J o i n t _ P E ) = { X , α } ,   F ( J o i n t _ P C ) = { X } ,   F ( J o i n t _ P J ) = { X , α , β }
where X denotes the 2D/3D spatial coordinates of the joint point. α , β corresponds to the angles of the link edges to that joint point. For example, in Figure 3, joint 4 has an angle α determined by link edges 2–3 and 3–4. Joint 3 has an angle α and an angle β determined by link edges 0–2 and 2–3 and link edges 11–2 and 2–3.
For the corresponding body joints on two skeletons, the distance of features between them can be defined as
| F ( J 1 ) F ( J 2 ) | | X 1 X 2 | J 1 , J 2 { J o i n t _ P C } | X 1 X 2 | + | L / 2 ( s i n ( | α 1 α 2 | ) + s i n ( | α 1 β 2 | ) ) | J 1 , J 2 { J o i n t _ P L } | X 1 X 2 | + | L s i n ( | α 1 α 2 | ) | J 1 , J 2 { J o i n t _ P E }
where L denotes a length factor determined by the size of body. Furthermore, we also define the average feature of the corresponding body joints on different skeletons:
F ( J ¯ ) = 1 n ( F ( J 1 ) + F ( J 2 ) + + F ( J n ) ) = { X n , α n , β n }

2.3. Prior Knowledge Extraction

Let an action have K samples, and each sample can obtain N skeleton frames. Based on the above definition of joint features, we can calculate the irrelevant and significant expressions of all the joint points of the action.
The irrelevance factor: This is a measurement of how much a joint J should not be used as a characteristic criterion of an action. It is represented by the deviation of the corresponding joint of different samples.
For multiple samples of an action, τ ( J t ) means the average deviation statistics of human joint J at time t:
τ ( J t ) = 1 K k = 1 , 2 , , K | F ( J t k ) F ( J t ¯ ) |
In the actual calculation process, considering that different samples may be completely aligned in time t, it is possible to consider the statistics in multiple adjacent time frames:
τ ^ ( J t ) = 1 K k = 1 , 2 , , K min t ( t + t ) ( | F ( J t k ) F ( J t ¯ ) | )
Furthermore, we can obtain a more intuitive metric for joint J, which is the irrelevance factor:
σ ( J ) = 1 N t = 1 , 2 , , N τ ^ ( J t )
The significance factor: This represents the saliency of joint J when performing this action, which is represented by the deviation of the corresponding nodes of the same action sample at different times. It can also be considered as a form of expression of joint feature ‘velocity’.
For multiple samples of an action and human joint J at time t, s ^ J t means the average ‘velocity’ of the basic pose feature:
s ^ ( J t ) = 1 K k = 1 , 2 , , K | F ( J t k ) F ( J t + 1 k ) |
Similarly, the significance factor of joint J is represented as
s ( J ) = 1 N t = 1 , 2 , , N s ^ ( J t )
It should be noted that in the field of general motion recognition, large significant expression does not mean that the action is related to it. For example, the action of making a phone call is identified by the long-term relative static relationship between the elbow and head of the human hand. However, in the field of industrial production operations, the actions and processes of motion are often directly related, which is one of the reasons why this paper uses it as a prominent feature.

2.4. Graph Convolution Networks

In this section, we introduce how to use prior knowledge to guide GCN networks. Inspired by ST-GCN [43], the network in this paper uses temporal and spatial skeleton data for modeling.
Firstly, based on a skeleton sequence with N joints and T frames featuring both intra-body and inter-frame connection, a graph model is constructed:
G = { V ,   E }
where V denotes the node set, which includes all the joint points of a skeleton sequence:
V = { v t i | t = 1 , , T ; i = 1 , , N }
E denotes the edge set:
E = { E s , E t } = { { v t i v t j | ( i , j ) H } , { v t i v ( t + 1 ) i } }
where Es denotes the intra-skeleton joint connection of a body and Et denotes the connection of same body joint at adjacent time t and t + 1. In this paper, we only consider the connection between physically adjacent joints and the connection between the same, temporally adjacent body joints.
Based on the above basic graph structure, we can construct a multi-layer graph convolution neural network. Specifically, for a node at time t of layer m + 1, the calculation method for its graph convolution features is
f m + 1 ( v t j ) = v t i B ( v t j ) 1 Z i j f m ( v t i ) w ( v t i , v t j )
where B ( v t j ) represents the set of nodes in the graph that are connected (including itself); Z i j is a normalizing term, which is added to balance the contributions of different subsets to the output; and w is the weight to learn in the network.
In the first layer of the network, the features of each joint and the connections in the skeleton sequence obtained from the image sequence are used as the initial input, which is f 0 ( v t j ) . The most commonly used joint features are the spatial position coordinates and the motion speed in this step.
In this paper, prior knowledge is applied to the GCN in the feature setup procedure at the beginning. For the feature setup step, we encode the following features as feature vectors for a certain joint:
  • The basic pose feature F ( J ) introduced in Section 2.1;
  • The features of τ ^ ( J t ) and s ^ ( J t ) introduced in Section 2.1;
  • The Euclidean distances from the body’s center of gravity to the joint.
In the training step, an attention mechanism is also introduced, based on prior knowledge:
Before the input of the network, joints are divided into two categories based on prior knowledge. One type is marked as 1, for joint points that significantly reflect the action, and the other type is marked as 0, indicating joints unrelated to the action. For a certain action, this classification is based on the following method:
Firstly, divide the human skeleton into orderly connections from body joints to limb joints. One example of this is listed in Table 1.
Table 1. Body parts of joints.
Then, for each body part P, based on the irrelevant and significant expressions of each joint, the significant feature of the body part is calculated as:
S ( P ) = J i P ρ i s ( J i ) σ ( J i )
where ρ i is a weight factor, which varies depending on the type of joint.
For a certain action, the joints on the body part with larger S ( P ) , will be marked as 1, while the remaining joints will be marked as 0. For the input of the network, only the features of joints marked 1 are used for training, while joints marked 0 are not part of the training. This approach can improve efficiency by introducing prior knowledge and setting inputs in a form similar to supervision.

3. Experiment Results

3.1. Datasets of the Action of Solar Panel Installation

In order to verify the effectiveness of the algorithm in this paper, we collected 2460 videos of the operation of workers during the installation process of solar panels. These videos are standard actions performed by different workers from different camera perspectives.
There are four main types of actions: ‘pendulum’ (Figure 4), ‘stay wire’ (Figure 5), ‘threading’ (Figure 6) and ‘tie’ (Figure 7). They correspond to four different processes in the installation process. The purpose of the application algorithms studied in this paper is to correctly distinguish these four types of actions from video, providing guidance for the automatic operation and control of robotic arms that cooperate with humans.
Figure 4. Example of ‘pendulum’ step.
Figure 5. Example of ‘stay wire’ step.
Figure 6. Example of ‘threading’ step.
Figure 7. Example of ‘tie’ step.
Before training, in order to prevent over-fitting, a data enhancement method was used to pre-process all video data, and the size of the video frames was adjusted to 256 × 340, then they were cut to 224 × 224, and horizontal flipping, corner cropping, and multi angle cropping were performed. It should be noted that considering that many image-based action recognition methods are limited by GPU performance and difficult-to-support high-resolution video data, in order to facilitate comparison with relevant methods, we unified the resolution to a lower resolution.

3.2. Experiment Setup

The hardware device for conducting the experiment in this article is an x86 architecture computer, with a CPU of CoRE™I7-12700 processor produced by Intel, RAM memory of 32GB, and a graphic computing card of GeForce RTX 3090 produced by Nvidia. The software system is Linux Ubuntu 18.04, equipped with the Python deep learning framework and CUDA 10.0.
In the implementation process, firstly, for videos containing humans, a human skeleton recognition algorithm is used to extract the human skeleton. Subsequently, the skeleton data of all actions will be extracted according to the method in this article, including prior knowledge, features, and node regions labeled with significant actions; then, skeleton data are input into a deep neural network for training. The deep neural network used in this article is similar to ST-GCN [43], which consists of nine layers of spatio-temporal graph convolution operators. The first three layers have 64 channels for output. The three layers below have 128 channels for output. The last three layers have 256 output channels. The applied structure of the graph convolution layers is shown in Figure 8. The model adopts a batch size of 64, random gradient descent as the optimization strategy, and cross entropy as the loss function of gradient back-propagation, and the initial value of weight attenuation is 0.0005. The algorithms compared in this article include the original ST-GCN [43] and slowfast [41], both of which use default parameters.
Figure 8. The applied structure of the graph convolution layers.
Detected significant joints for each action: For the four actions involved in the datasets collected in this paper, the significant joints identified for each action are shown in Table 2. The second column shows significant joints (body parts) obtained through the prior detection and statistics method in Section 2. The third column shows manually designed significant joints of the specific action. It can be seen that all movements are closely related to the hand joints, which is also closely related to the standard operation design of each process, because in terms of process production, the effective operation of humans is mainly with the hands, while the extension or bending of legs is mainly used to move the position of human operations. Meanwhile, the prior detection and statistics method relies on diverse datasets, and it can also discover hidden relationships with other joints that have not been found manually. In order to better compare the effects based on statistics with those specified manually, we also tested the recognition results using our method in the case of manually specifying significant joints.
Table 2. The significant joints.
Pre-detection of human skeleton and body area: Figure 9 shows some intermediate results of human skeleton detection and human body area detection using methods provided by Detectron2 [30] from raw image sequences, called the pre-detection step. Then, the human skeleton data are used as inputs for the GCN in this paper and in [36]. The human body area data are used as the inputs of the method in [34].
Figure 9. The intermediate results. (a) Human skeleton detection; (b) human body area detection.
It should be noted that due to the limitations of perspective, image quality, and the algorithm used for human area detection and skeleton detection, there are certain errors in extracting human skeletons and mask regions from the original video. With regard to the datasets collected in this paper, the accuracy of the skeleton recognition algorithm is about 93.6% (defined as all skeleton joints being correctly recognized). The accuracy of human body recognition is 98.7% (defined as the human body being correctly extracted, with non-human or missing areas less than 5%). The main interference area comes from the reflective area of the solar panel. In this paper, as we do not consider errors caused by failed pre-detection, we first removed all the error pre-detection data manually.
The experiment randomly extracts 70% of the datasets, then expands and enhances the datasets and trains the model. Meanwhile, the remaining 30% of the datasets will be used for validation. The accuracy of the recognition results for top1 and top2 was calculated, and the final experimental results are shown in Table 3. As mentioned earlier, our method adopts two ways: one is to directly use manually defined significant joints as training for different actions, and the other is to use statistically detected significant joints as training for different actions. Overall, the method proposed in this paper outperforms the comparison method in terms of accuracy in identifying specific actions, especially in the difficult action of ‘tie’, where recognition accuracy has been significantly improved.
Table 3. The experiment results.
Time consumption: Due to most behavior recognition algorithms being designed for fixed-length image sequence datasets, they cannot be directly applied to online action recognition. In this paper, in order to verify the feasibility of the proposed algorithm in engineering applications, we also designed relevant experiments to test the execution time of the algorithm. Specifically, for continuously collected image sequences, firstly, a human pose recognition algorithm is run to detect and extract skeleton poses. Then, a sliding window is used to package the continuous N (N can be taken as 120 frames, 5 s) skeleton postures to extract features as input to the network of the algorithm proposed in this paper. Finally, the proposed GCN prediction thread is processed to obtain the output result. We have calculated the average execution time of each step mentioned above, and the results are shown in Table 4.
Table 4. The time consumption of each step.
It should be noted that even without detailed optimization of the algorithm in this paper, after a buffering time of about 5 s, it can output action recognition results at an average of more than two times per second continuously, which can essentially meet the needs of online applications.

4. Discussion

For the application of recognizing the action of workers’ operation steps, the algorithm of directly using RGB image input [34] without using a skeleton has a relatively poor effect, especially in the ‘tie’ action and ‘pendulum’ action. This is mainly due to the lack of diversity of training data. On many occasions, we can only find a few people to produce train data, but their body shape and clothing are relatively simple, which leads to over-fitting. When different people appear in the test set, such recognition methods that directly use the entire RGB image of the human body collapse easily. From this point of view, approaches based on skeleton data can significantly improve the performance.
However, it must be noted that in the pre-detection stage, compared to the mask detection of the human body in RGB images, the accuracy of human skeleton detection is several percentage points lower because it is more difficult. In other words, when using action recognition algorithms based on the human skeleton, it is also necessary to balance the accuracy of its pre-detection. On the other hand, in our current work, we did not consider the problem of occlusion [50], which will happen in many real scenes.
Overall, compared to the representative scheme of general action recognition, the accuracy of the method proposed in this article is higher. This is mainly because in the installation of solar panels, there are some movements that have a small overall range of human body movement, and the difference is only reflected in the small range of small hand movements, which are difficult to distinguish using general motion recognition methods. The prior knowledge-based attention enhancement mechanism proposed in this article improves this problem by amplifying small changes. The comparison between manually designed significant joints and automatically detected significant joints shows that the automatic detection method achieves better detection results. This may be because although a person is performing a certain action locally, other body parts still participate in collaboration, and automatic detection can discover key information that is not directly found by human definition.

5. Conclusions and Future Work

In some special industrial application scenarios, human action recognition is needed for the collaboration of human and robots. This may be difficult because of high accuracy requirements and lack of training data or low action discrimination. In this paper, we propose a new method for extracting prior knowledge and integrating it into a GCN network to improve the recognition reliability. The experiments show that the proposed method can achieve better action recognition results in a solar panel installation action data-set.
Our work still has room for improvement. In the follow-up work, we plan to improve on the following aspects: (1) using multi-view image input to improve the inaccurate motion recognition caused by inaccurate skeleton recognition due to partial occlusion and (2) thoroughly combining close-range hand motion recognition with overall human motion recognition to enhance motion discrimination and improve accuracy.

Author Contributions

Software, X.Z.; formal analysis, C.W.; resources, Y.Z. and J.L.; writing—original draft preparation, J.W.; writing—review and editing, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2021 Tianjin Applied Basic Research Multiple Investment Fund (Grant No. 21JCQNJC00830, 21JCZDJC00820).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhang, J.; Wang, P.; Gao, R.X. Hybrid machine learning for human action recognition and prediction in assembly. Robot. Comput. Integr. Manuf. 2021, 72, 102184. [Google Scholar] [CrossRef]
  2. Inkulu, A.K.; Raju Bahubalendruni, M.V.A.; Dara, A.; SankaranarayanaSamy, K. Challenges and opportunities in human robot collaboration context of Industry 4.0—A state of the art review. Ind. Robot 2022, 49, 226–239. [Google Scholar] [CrossRef]
  3. Garcia, P.P.; Santos, T.G.; Machado, M.A.; Mendes, N. Deep Learning Framework for Controlling Work Sequence in Collaborative Human–Robot Assembly Processes. Sensors 2023, 23, 553. [Google Scholar] [CrossRef]
  4. Chen, C.; Wang, T.; Li, D.; Hong, J. Repetitive assembly action recognition based on object detection and pose estimation. J. Manuf. Syst. 2020, 55, 325–333. [Google Scholar] [CrossRef]
  5. Ramanathan, M.; Yau, W.-Y.; Teoh, E.K. Human action recognition with video data: Research and evaluation challenges. IEEE Trans. Hum. Mach. Syst. 2014, 44, 650–663. [Google Scholar] [CrossRef]
  6. Gupta, N.; Gupta, S.K.; Pathak, R.K.; Jain, V.; Rashidi, P.; Suri, J.S. Human activity recognition in artificial intelligence framework: A narrative review. Artif. Intell. Rev. 2022, 55, 4755–4808. [Google Scholar] [CrossRef]
  7. Zhou, X.; Liang, W.; Wang, K.I.-K.; Wang, H.; Yang, L.T.; Jin, Q. Deep-Learning-Enhanced Human Activity Recognition for Internet of Healthcare Things. IEEE Internet Things J. 2020, 7, 6429–6438. [Google Scholar] [CrossRef]
  8. Mendes, N. Surface Electromyography Signal Recognition Based on Deep Learning for Human-Robot Interaction and Collaboration. J. Intell. Robot. Syst. 2022, 105, 42. [Google Scholar] [CrossRef]
  9. Yao, L.; Sheng, Q.Z.; Li, X.; Gu, T.; Tan, M.; Wang, X.; Wang, S.; Ruan, W. Compressive Representation for Device-Free Activity Recognition with Passive RFID Signal Strength. IEEE Trans. Mob. Comput. 2017, 17, 293–306. [Google Scholar] [CrossRef]
  10. Zhang, H.; Parker, L.E. CoDe4D: Color-Depth Local Spatio-Temporal Features for Human Activity Recognition From RGB-D Videos. IEEE Trans. Circuits Syst. Video Technol. 2014, 26, 541–555. [Google Scholar] [CrossRef]
  11. Chen, C.; Jafari, R.; Kehtarnavaz, N. A survey of depth and inertial sensor fusion for human action recognition. Multimedia Tools Appl. 2017, 76, 4405–4425. [Google Scholar] [CrossRef]
  12. Bobick, A.; Davis, J. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 257–267. [Google Scholar] [CrossRef]
  13. Weinland, D.; Ronfard, R.; Boyer, E. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 2006, 104, 249–257. [Google Scholar] [CrossRef]
  14. Laptev, I. On space-time interest points. Int. J. Comput. Vis. 2005, 64, 107–123. [Google Scholar] [CrossRef]
  15. Dollár, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; pp. 65–72. [Google Scholar] [CrossRef]
  16. Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar] [CrossRef]
  17. Klaser, A.; Marszałek, M.; Schmid, C. A spatio-temporal descriptor based on 3d-gradients. In Proceedings of the 19th British Machine Vision Conference, Leeds, UK, 1–4 September 2008; pp. 275:1–275:10. [Google Scholar]
  18. Wang, H.; Kläser, A.; Schmid, C.; Liu, C.-L. Action Recognition by Dense Trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3169–3176. [Google Scholar] [CrossRef]
  19. Sun, J.; Wu, X.; Yan, S.; Cheong, L.-F.; Chua, T.-S.; Li, J. Hierarchical spatio-temporal context modeling for action recognition. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2004–2011. [Google Scholar] [CrossRef]
  20. Zelnik-Manor, L.; Irani, M. Statistical analysis of dynamic actions. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1530–1535. [Google Scholar] [CrossRef] [PubMed]
  21. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; MIT Press: Cambridge, MA, USA, 2014; Volume 27, pp. 1345–1367. [Google Scholar]
  22. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Li, F.-F. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar] [CrossRef]
  23. Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2718–2726. [Google Scholar] [CrossRef]
  24. Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar] [CrossRef]
  25. Sharma, S.; Kiros, R.; Salakhutdinov, R. Action recognition using visual attention. arXiv 2015, arXiv:1511.04119. [Google Scholar]
  26. Wu, Z.; Wang, X.; Jiang, Y.-G.; Ye, H.; Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 461–470. [Google Scholar] [CrossRef]
  27. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
  28. Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef]
  29. Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4489–4497. [Google Scholar] [CrossRef]
  30. Zhou, Y.; Sun, X.; Luo, C.; Zha, Z.; Zeng, W. Spatiotemporal fusion in 3d CNNs: A probabilistic view. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1725–1732. [Google Scholar] [CrossRef]
  31. Kim, J.; Cha, S.; Wee, D.; Bae, S.; Kim, J. Regularization on spatio-temporally smoothed feature for action recognition. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12103–12112. [Google Scholar] [CrossRef]
  32. Wang, L.; Huynh, D.Q.; Koniusz, P. A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 2019, 29, 15–28. [Google Scholar] [CrossRef]
  33. Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H.-P.; Xu, W.; Casas, D.; Theobalt, C. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar] [CrossRef]
  34. Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 2018, 51, 1–36. [Google Scholar] [CrossRef]
  35. Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar] [CrossRef]
  36. Girshick, R.; Radosavovic, I.; Gkioxari, G.; Dollár, P.; He, K. Detectron. 2018. Available online: https://github.com/facebookresearch/detectron (accessed on 20 June 2023).
  37. Yuxin, W.; Alexander, K.; Francisco, M.; Wan-Yen, L.; Ross, G. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 20 June 2023).
  38. Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar] [CrossRef]
  39. Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proceedings of the 2019 IEEE International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 1227–1236. [Google Scholar] [CrossRef]
  40. Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 807–811. [Google Scholar] [CrossRef]
  41. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar] [CrossRef]
  42. Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7912–7921. [Google Scholar] [CrossRef]
  43. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  44. Zhang, Y.; Wu, B.; Li, W.; Duan, L.; Gan, C. Gan, STST: Spatial-temporal specialized transformer for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 3229–3237. [Google Scholar] [CrossRef]
  45. Ahmad, T.; Mao, H.; Lin, L.; Tang, G. Action recognition using attention-joints graph convolutional neural networks. IEEE Access 2019, 8, 305–313. [Google Scholar] [CrossRef]
  46. Chen, Y.; Ma, G.; Yuan, C.; Li, B.; Zhang, H.; Wang, F.; Hu, W. Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recognit. 2020, 103, 107321. [Google Scholar] [CrossRef]
  47. Tunga, A.; Nuthalapati, S.V.; Wachs, J. Pose-based Sign Language Recognition using GCN and BERT. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikola, HI, USA, 5–9 January 2021; pp. 31–40. [Google Scholar] [CrossRef]
  48. Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A large scale dataset for 3d human activity analysis. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar] [CrossRef]
  49. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
  50. Sahoo, S.P.; Modalavalasa, S.; Ari, S. A sequential learning framework to handle occlusion in human action recognition with video acquisition sensors. Digit. Signal Process. 2022, 131, 103763. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.