1. Introduction
The research on Human Action Recognition (HAR) has attracted the widespread attention of the computer vision research community during the last decade. Indeed, the vast spectrum of applications of HAR in daily life has stimulated researchers to be dedicated to the issue significantly. Because of the developments in HAR automated systems, the machine intelligence penetration has increased in applications such as human–machine and human–object interaction, content-based video summarizing, education and learning, healthcare systems, entertainment systems, safety and surveillance systems, and sports video analysis [
1,
2,
3,
4,
5,
6]. However, the earlier attempts to recognize actions mostly relied on RGB videos [
7,
8,
9,
10]. These methods may result in promising performance on HAR in a limited number of cases; however, RGB data-based recognition approaches have some serious limitations, as they are susceptible to illumination variation, occlusions, and cluttered backgrounds.
To address the limitations of RGB data-based methods, the imaging technology society has invented the depth sensor (e.g., Kinect sensor). The depth sensor works as a multi-modal sensor and thus simultaneously delivers the depth and RGB videos of a scene. However, the depth video-based approaches are illumination, color, and texture invariant [
11]. Moreover, depth video preserves the 3D structure of an object accurately, which helps the system to alleviate the intra-class variation and the cluttered background noise issues [
11]. Thus, computer vision researchers have shown an increasing interest in approaching the task of action recognition by employing depth data features. Furthermore, the skeleton action sequences can be easily obtained from the depth action sequences. Hence, the skeletal action features have also been utilized in building a recognition system, such as [
12,
13].
Previous Work: In recent years, deep learning models and convolutional neural networks (CNNs) [
14] have been used massively for recognizing image contents. CNNs extract dominant and discriminating object characteristics automatically, and hence, they became popular for extracting features as compared to handcrafted descriptors. Being inspired by the performance of CNNs in image classification tasks, many researchers have applied them in action video classification challenges. However, those action classification works were mostly developed on RGB and skeleton data. For example, deep models as reported in [
4,
6,
13,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28] are developed on RGB and skeleton action data. There are only a small number of deep models based on the depth video streams only such as those illustrated in [
29,
30,
31,
32,
33,
34,
35,
36,
37]. However, the existing depth databases, excluding the recent NTU-RGB-D databases [
38,
39], are not large enough for training deep models. In a few studies, the depth data have been complemented with other data modalities such as RGB and skeleton data to develop multi-modal/hybrid deep models [
40,
41,
42].
Many hand-designed methods [
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62] were proposed by researchers before the work on deep learning methods for depth action recognition. These methods usually involve many operations that require researchers to carry out careful feature engineering and tuning [
63]. In addition, hand-crafted features and methods are always shallow and dataset dependent [
32]. On the other hand, deep learning methods reduce the need for feature engineering. As a result, researchers have attempted to work with deep learning in action recognition from depth videos. For example, in [
64], 2D CNNs and 3D CNNs were proposed for depth action recognition. To preserve the temporal information of depth action sequences in DMMs-based action representation, the DMM pyramid was constructed and fed into the 2D CNN as input. The DMM cube was used as input of 3D CNN. The 2D CNN model on DMM pyramid provided comparable and considerable results. Wang et al. [
37] proposed a deep model to address the action recognition task on a small-scale training dataset. They utilized three weighted hierarchical depth motion maps (WHDMMs) and the three-streams convolutional neural networks to build their architecture. In fact, the introduction of weights in WHDMMs helps to preserve the temporal order of motion segments to reduce the inter-class similarity problem. The three WHDMMs were constructed from the projection of depth videos onto three-dimensional space. They were converted to pseudo-color versions and fed into three individual CNNs (trained on ImageNets) for training the deep model. The fusion of classification outcomes of the three deep networks was treated as the final classification outcome.
A four-channel CNN pipeline was proposed by [
34], where three channels adopted the three types of depth motion maps obtained from depth data, and the fourth channel received the RGB data-based motion history images as input. In the method discussed by [
32], an action was described through dynamic depth images, dynamic depth normal images and dynamic depth motion normal images. The three descriptions of the action were treated as input of three-stream CNN architecture for action classification. As a different approach, the depth action representation was considered through the RGB data features directly by domain adaptation in [
33]. Wu et al. [
11] constructed the hierarchical dynamic depth projected difference images for three projection images and fed them into three uniform CNN. In [
65], depth videos were projected on the 3D space with multiple viewpoints, and multi-view dynamic images were constructed. These dynamic images were fed into a novel CNN for feature learning. The fully connected layers of CNN were different for different dynamic images. Finally, with the deep features, the actions were classified using the linear SVM after dimension reduction with PCA.
Keceli et al. [
66] fused spatial and temporal deep features obtained from the 2D CNN (pre-trained) and 3D CNN, respectively. The 2D and 3D representations of depth action videos were prepared prior to pass them to the 2D and 3D deep CNN architectures. However, the Relieff algorithm [
67] was applied for selecting the most potential features from the fused version. Finally, the SVM classifier was used for the action classification with the selecting features. Li et al. [
68] derived a set of three motion images against each input video and then employed the local ternary pattern encoded images for representing action with rich texture information and less noise. The encoded images were passed to a CNN for the action classification. Indeed, the threshold value choosing of the local ternary pattern is a bit difficult. However, Wu et al. [
69] represented a depth action video through dynamic image sequences. Then, a channel was proposed to highlight the most dominant channels in CNNs. In addition, a spatial–temporal interest points (STIPs) attention model was proposed to extract the discriminating motion regions from the dynamic images. In their work, an LSTM model was utilized for gaining the temporal dependencies and for accomplishing the classification task. Recently, unlike extracting features from dynamic images, Tasnim et.al. [
29] proposed a method extracting features from raw depth images. They used a 3D CNN model for the key-frame-based feature extraction and classification tasks. The key frames were selected by structural similarity index measure (SSIM) and correlation coefficient measure (CCM) metrics for removing the redundant frames as well as preserving more informative frames. In [
30], the spatiotemporal action features were extracted using a 3D fully convolutional neural network from raw depth images. The same network also allows action classification. The method was evaluated on the large-scale dataset, i.e., the NTU RGB+D [
38] dataset. The statistical features and 1D CNN features were fused for developing an action recognition model from depth action sequences [
31]. Multi-channel CNN and a classifier ensemble were utilized in [
35]. The method described in [
36] employed the 2DCNN and 1DCNN consecutively as pre-processing tools to extract statistical features from depth frames. Those features were fused with the Dynamic Time Wrapping (DTW) algorithm-based statistical features. For the feature classification task, a classifier ensemble was determined from 1000 sets of classifiers. This method seems very complicated since, in the pre-processing stage, it trained a separate CNN model for each action class.
Since the development of deep models based on depth action data only is hard due to limited training data, researchers have been motivated to incorporate other data modalities with depth data. For example, deep learning-based action recognition was presented in [
70] using depth sequences and skeleton joint information combined. A 3D CNN structure was used to learn the spatiotemporal features from depth sequences, and then joint-vector features were computed for each sequence. Finally, the SVM classification results of the two types of features were fused for action recognition. In work [
71], the fuzzy weighted multi-resolution DMMs (FWMDMMs) were constructed by using the fuzzy weight functions on depth videos. The FWMDMMs were fed into a convolution neural network deep model for the compact representation of actions. In addition to the motion features, the appearance features were also extracted from the RGB and depth data through the pre-trained AlexNet network. Multiple feature fusion techniques were used to obtain the most discriminating features. The multi-class SVM was implemented to classify actions. In [
72], the authors used the RGB data features with the depth data features to propose a deep framework. The framework inputs four streams such as Dynamic image, DMM-front, DMM-side and DMM-top. The first one was obtained from the RGB data, and the remaining three streams were generated from the depth data. Those four streams were passed to four pre-trained VGG networks for feature extraction and training. The obtained four classification scores from the classification layers of the four networks were fused using a weighted product model. In [
40], the authors proposed a two-stream 3D deep model using depth and RGB action data. The depth residual dynamic image sequence and pose estimation map sequence were calculated simultaneously from depth and RGB modalities of an action. For describing and obtaining the classification score of the action with two modalities, 3D CNN was employed on two individual data streams. The action class was determined by the fusion of the classification scores provided by the 3D CNN on the two data streams. In [
42], an action classification algorithm was developed using RGB, depth, and skeleton data modalities. On one hand, the RGB and depth videos were passed to 3D CNN for extraction. On the other hand, 3D CNN and LSTM were employed to capture action features from the skeleton data. Three sets of extracted features were fed into three SVM to obtain probability scores. Two evolutionary algorithms were used to fuse those scores and to output the class label of the input video.
Research Motivation and Key Contribution: The aforementioned depth data-based existing deep models (except the model in [
69]) are not able to classify a depth frame sequence directly using sequence classification models such as LSTM, bi-directional LSTM (BLSTM), GRU, bi-directional GRU, or attention models. However, there are many approaches based on the RGB and skeleton data that are capable of classifying a frame sequence automatically using those models [
73,
74,
75,
76,
77]. The size of the available depth training dataset is the key barrier to developing a depth video-dependent sequence learning deep model. Currently, only two large-scale datasets, NTU RGB+D [
38] and NTU RGB+D120 [
39], are available with a large number of depth training samples for the sequence learning framework development. Otherwise, the existing depth action video datasets have insufficient depth training videos for the task. Up-to-date, depth data-based deep models (except the model in [
69] and 3D CNN-based models) are mainly predicting an action class for an input action video-based on an image classification strategy instead of a direct sequence classification strategy [
29,
30,
31,
32,
33,
34,
35,
36,
37]. Actually, a large number of video sequences is needed in the training stage to develop a promising sequence classification framework using the deep sequence modeling algorithm, which is not available in depth datasets except in the two above datasets. However, there is a need to make progress on deep models trained on small-scale depth datasets. In this work, we propose a deep model for small-scale depth datasets for directly classifying a depth frame sequence. Being inspired by the excellent performance of CNN in automatic feature extraction and representation in depth, RGB, and skeleton action recognition methods [
13,
17,
20,
21,
24,
25,
26,
27,
28,
30,
31,
32,
73], we also utilize a pre-trained 2D CNN named DenseNet121 [
78], trained on an ImageNet image dataset [
79], for capturing dominant features to represent independent action frames. With the extracted features, the combination of the BLSTM [
80] and the multi-head self-attention (MHSA) [
81] mechanisms are considered to build a sequence classification model. To the best of our knowledge, no previous work has utilized BLSTM and MHSA individually or jointly with deep features to propose such a sequence classification model in-depth video classification problem. We evaluate our method on two public depth action datasets. The performance evaluation shows that our method achieves superiority over many state-of-the-art methods.
Our research contributions are highlighted as follows:
Learned patterns extraction using deep models with a small-scale dataset is very challenging. To address this issue, we employed a unified framework of BLSTM and MHSA to achieve better sequence-based action recognition in depth videos.
We propose a single depth video representation through four data streams to boost the depth action representation. The four data streams have a single depth frame sequence and three temporal motion frame sequences. The depth frame sequence is the original input sequence, and the other three sequences are derived from the original one. The other three motion sequences preserve the spatiotemporal motion cues of the front, side, and top flank performers.
Frame level features extraction is an essential step for sequence-based decisions for action recogntion. We employ a pre-trained 2D CNN model with a transfer learning strategy for robust depth features representations.
The sequence classification model is developed with the one-to-one integration of BLSTM and MHSA layers. A set of optimal parameters for the BLSTM-MHSA combination is determined, providing the key support for the performance improvement of the proposed method.
BLSTM-MHSA correlation features are encoded with fully connected layers with a features dropout strategy to achieve model generalization for the unseen test set.
An ablation study is also provided for different 2D CNN models and the number of data streams for robust action classification.
The proposed method is assessed in terms of two public datasets, MSRAction3D [
82] and DHA [
83], and our results are compared with other state-of-the-art methods. In summary, our method exhibits superiority over the recent (published on 20 April 2022) state-of-the-art 3D CNN-based recognition method [
29] by 1.9% for MSRAction3D and by 2.3% for DHA. In contrast to the 3D CNN model, our approach involves fewer video frames in each sequence and fewer trainable parameters.
The rest of this paper is oriented as follows: The proposed framework is illustrated in detail in
Section 2. Experimental evaluation is discussed in
Section 3. Finally,
Section 4 concludes the paper.