Special Issue on Deep Learning-Based Action Recognition

Lee, Hyo Jong

doi:10.3390/app12157834

Open AccessEditorial

Special Issue on Deep Learning-Based Action Recognition

by

Hyo Jong Lee

Division of Computer Science and Engineering, CAIIT, Jeonbuk National University, Jeonju 54896, Korea

Appl. Sci. 2022, 12(15), 7834; https://doi.org/10.3390/app12157834

Submission received: 1 August 2022 / Accepted: 3 August 2022 / Published: 4 August 2022

(This article belongs to the Special Issue Deep Learning-Based Action Recognition)

Download Versions Notes

1. Introduction

Human action recognition (HAR) has gained popularity because of its various applications, such as human–object interaction [1], intelligent surveillance [2], virtual reality [3], and autonomous driving [4]. The demand for HAR applications as well as gesture and pose estimation is growing rapidly. In response to this growing demand, various methods to apply human action recognition have been introduced. Features from images or videos can be extracted by multiple descriptors, such as local binary pattern, scale-invariant feature transformation, histogram of oriented gradient, and histogram of optic flow identifying action types. Recently, deep learning networks have been deployed in many challenging areas, such as image classification and object detection. Action recognition is also an ideal area for the application of deep learning networks. One of the primary advantages of deep learning is its ability to automatically learn representative features from large-scale data. As long as sufficient data are available, action recognition coupled with a deep learning network can perform more efficiently than traditional image processing methods.

2. Scope of Action Recognition

Based on the above understanding, the research results of deep learning-based HAR were primarily interpreted. However, given the challenging nature of HAR, further research is needed to study it from various aspects.

The recognition of an object’s posture must precede the action recognition. The pose estimation is usually based on a skeleton model, which consists of joint points and their connections. It is possible to predict specific action by estimating the pose of a person using the joints and skeletal information.

The common network of action recognition may be either a regular convolution neural network (CNN) or a graph CNN. Unlike the recognition of a pose estimation from a fixed time point, it is possible to increase the efficiency of action recognition by adding temporal information along with the spatial information of an object. In some cases, the subject of action recognition is one person, but when multiple people are apparent in the same scene, it is important to process the action recognition of all the people in the scene. Including temporal information of an object’s movement is of great help in recognizing specific actions because it can detect movements each minute that cumulatively constitute specific actions. This technique can be used to target action recognition for when multiple people are present in a scene. If the static action recognition is provided with sufficient temporal data, it will be possible to use static action to analyze action captured from videos.

Gestures can convey intentions through various local movements of the arms or fingers in a confined space with a limited range of motions. Therefore, gesture recognition can be used as an important component of action recognition. Thus, this special issue has published research papers focused on gesture recognition.

3. Deep Learning-Based Action Recognition

Many researchers are interested in and conducting deep learning-based action recognition research. Approximately 25 papers have been submitted to this special issue, and 12 of them were accepted (i.e., 34.2% acceptance rate). This special issue mainly consists of training data, pose estimation of objects, action recognition and gesture recognition. Rey et al. [5] present an approach on how to solve a data shortage problem in deep learning, by extracting synthesized acceleration and gyro norms data from video for human activity recognition scenarios.

There are two papers focused on pose estimation—the first one by S. Kim and H. Lee introduces the Lightweight Stacked Hourglass Network [6], which expands the convolutional receptive field while also reducing the computational load and providing scale invariance. The second paper, authored by J. Wu and H. Lee [7], proposes a Partition Pose Representation, which integrates the instance of a person and their body joints based on joint offset. They also propose a Partitioned Center Pose Network, which can detect people and their body joints simultaneously, then group all body joints.

Four papers deal with action recognition directly by using convolutional networks. The first paper, authored by Dong et al. [8], introduces high-order spatial and temporal features of skeleton data, such as velocity, acceleration, and relative distance, to construct graph convolutional networks. The other three papers adapt the spatio-temporal concept to extract better features. Tasnim et al. [9] suggest a spatio-temporal image formation technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination. J. Kim and J. Cho [10] proposes a low-cost embedded model to extract spatial feature maps by applying CNN to the images that develop the video and using the frame change rate of sequential images as temporal information. The low complexity was achieved by transforming the weighted spatial feature maps into spatio-temporal features, and then inputting the spatio-temporal features into multilayer perceptrons. K. Hu et al. [11] propose an improved Long Short-Term Memory (LSTM) network, which is able to extract time information. They enhanced the input differential feature module and spatial memory state differential module to enhance features of actions. A. Stergiou et al. [12] introduce the concept of class regularization, which regularizes feature map activations based on the classes of the examples used. The proposed method essentially amplifies or suppresses activations based on an educated guess of the given class.

There are four papers focused on gesture recognition. Gesture recognition generally consists of a series of continuous actions, so it is necessary to memorize past actions. Four papers independently propose a unique method for gesture recognition. N. Nguyen et al. [13] present a dynamic gesture recognition approach using multi-features extracted from RGB frame and 3D skeleton joint information. N. Do et al. [14] exploit depth and skeletal data for the dynamic hand gesture recognition problem. The paper also explores a multi-level feature LSTM with pyramid and the LSTM block, which deal with the diversity of hand features. Y. Chu et al. [15] present a neural network for sensor-based hand gesture recognition, which is extended from the PairNet. N. Nguyen et al. [16] present another dynamic hand gesture recognition approach with two modules: gesture spotting and gesture classification, which uses bidirectional LSTM and a single LSTM, respectively.

4. Future Action Recognition

Traditionally, action recognition has been performed directly from videos or images in a single layered manner. The spatio-temporal features are extracted as 2D feature descriptors. Classes of action recognition are rather simple, such as walking, jumping or raising a hand. However, as computing power improves and deep learning techniques are naturally applied to action recognition, many researchers are optimistic about the potential of action recognition. Every day, new data on human actions are being accumulated and learning skills are improving. The need for various applications related to action recognition is also rapidly increasing. Therefore, recognition is attempted by extracting 3D feature values for each intrinsic action. Various modifications of deep networks to reduce the complexity of computations are also being attempted. Ultimately, a deep learning method that can recognize complex actions occurring in the real world is expected to be developed in the future.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (GR2019R1D1A3A03103736) and in part by project for Joint Demand Technology R&D of Regional SMEs funded by Korean Ministry of SMEs and Startups in 2021. (S3035805).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Fortes Rey, V.; Garewal, K.K.; Lukowicz, P. Translating Videos into Synthetic Training Data for Wearable Sensor-Based Activity Recognition Systems Using Residual Deep Convolutional Networks. Appl. Sci. 2021, 11, 3094. [Google Scholar] [CrossRef]
Dawar, N.; Kehtarnavaz, N. Continuous detection and recognition of actions of interest among actions of non-interest using a depth camera. In Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017. [Google Scholar]
Hu, K.; Zheng, F.; Weng, L.; Ding, Y.; Jin, J. Action Recognition Algorithm of Spatio–Temporal Differential LSTM Based on Feature Enhancement. Appl. Sci. 2021, 11, 7876. [Google Scholar] [CrossRef]
Wei, H.; Laszewski, M.; Kehtarnavaz, N. Deep Learning-Based Person Detection and Classification for Far Field Video Surveillance. In Proceedings of the 13th IEEE Dallas Circuits and Systems Conference, Dallas, TX, USA, 2–12 November 2018. [Google Scholar]
Chu, Y.-C.; Jhang, Y.-J.; Tai, T.-M.; Hwang, W.-J. Recognition of Hand Gesture Sequences by Accelerometers and Gyroscopes. Appl. Sci. 2020, 10, 6507. [Google Scholar] [CrossRef]
Fangbemi, A.; Liu, B.; Yu, N.; Zhang, Y. Efficient Human Action Recognition Interface for Augmented and Virtual Realty Applications Based on Binary Descriptor. In Proceedings of the 5th International Conference, AVR 2018, Ontranto, Italy, 24–27 June 2018. [Google Scholar]
Wu, J.; Lee, H.-J. A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network. Appl. Sci. 2021, 11, 4241. [Google Scholar] [CrossRef]
Kim, S.-T.; Lee, H.J. Lightweight Stacked Hourglass Network for Human Pose Estimation. Appl. Sci. 2020, 10, 6497. [Google Scholar] [CrossRef]
Tasnim, N.; Islam, M.K.; Baek, J.-H. Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints. Appl. Sci. 2021, 11, 2675. [Google Scholar] [CrossRef]
Chen, L.; Ma, N.; Wang, P.; Li, J.; Wang, P.; Pang, G.; Shi, X. Survey of pedestrian action recognition techniques for autonomous driving. Tsinghua Sci. Technol. 2020, 25, 458–470. [Google Scholar] [CrossRef]
Nguyen, N.-H.; Phan, T.-D.-T.; Kim, S.-H.; Yang, H.-J.; Lee, G.-S. 3D Skeletal Joints-Based Hand Gesture Spotting and Classification. Appl. Sci. 2021, 11, 4689. [Google Scholar] [CrossRef]
Kim, J.; Cho, J. Low-Cost Embedded System Using Convolutional Neural Networks-Based Spatiotemporal Feature Map for Real-Time Human Action Recognition. Appl. Sci. 2021, 11, 4940. [Google Scholar] [CrossRef]
Dong, J.; Gao, Y.; Lee, H.J.; Zhou, H.; Yao, Y.; Fang, Z.; Huang, B. Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features. Appl. Sci. 2020, 10, 1482. [Google Scholar] [CrossRef] [Green Version]
Nguyen, N.-H.; Phan, T.-D.-T.; Lee, G.-S.; Kim, S.-H.; Yang, H.-J. Gesture Recognition Based on 3D Human Pose Estimation and Body Part Segmentation for RGB Data Input. Appl. Sci. 2020, 10, 6188. [Google Scholar] [CrossRef]
Stergiou, A.; Poppe, R.; Veltkamp, R.C. Learning Class-Specific Features with Class Regularization for Videos. Appl. Sci. 2020, 10, 6241. [Google Scholar] [CrossRef]
Do, N.-T.; Kim, S.-H.; Yang, H.-J.; Lee, G.-S. Robust Hand Shape Features for Dynamic Hand Gesture Recognition Using Multi-Level Feature LSTM. Appl. Sci. 2020, 10, 6293. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, H.J. Special Issue on Deep Learning-Based Action Recognition. Appl. Sci. 2022, 12, 7834. https://doi.org/10.3390/app12157834

AMA Style

Lee HJ. Special Issue on Deep Learning-Based Action Recognition. Applied Sciences. 2022; 12(15):7834. https://doi.org/10.3390/app12157834

Chicago/Turabian Style

Lee, Hyo Jong. 2022. "Special Issue on Deep Learning-Based Action Recognition" Applied Sciences 12, no. 15: 7834. https://doi.org/10.3390/app12157834

APA Style

Lee, H. J. (2022). Special Issue on Deep Learning-Based Action Recognition. Applied Sciences, 12(15), 7834. https://doi.org/10.3390/app12157834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Special Issue on Deep Learning-Based Action Recognition

1. Introduction

2. Scope of Action Recognition

3. Deep Learning-Based Action Recognition

4. Future Action Recognition

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI