Recent research studies have shown that hand-crafted feature based methods are not suitable for all types of datasets. Consequently, a new relatively and important class of machine learning technique referred to as deep learning has been established. Multiple levels of feature representations can be learnt that can make sense of different data such as speech, image and text. Such methods are capable of automatically processing raw image and video data for feature extraction, description, and classification. Trainable filters and multiple layer based models are often employed in these methods for action recognition and representation.
This section presents descriptions of some important deep learning models that have been used for human action recognition. However, it is very difficult to train a deep learning model from scratch with limited data. Thus, models are often limited to appearance based data or some described representation. Deep learning based models can be classified into three categories which are: generative models e.g., Deep Belief Networks (DBNs), Deep Boltzmann machines (DBMs), Restricted Boltzmann Machines (RBMs), and regularized auto-encoders; supervised models e.g., Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs); and hybrid models. However, hybrid models are not discussed in this work.
5.1. Unsupervised (Generative) Models
The key idea of deep learning based generative models is that they do not need target labels for the learning process. Such models are appropriate when labelled data are scarce or unavailable. The evolutionary of deep learning models can be traced back [158
] where a Deep Belief Network (DBN) was presented with a training algorithm based on Restricted Boltzmann Machines (RBMs) [159
]. This was followed by a dimensional reduction technique by [160
]. The parameters were learnt with an unsupervised training process which were then fine-tuned in a supervised approach using back-propagation.
This inspired great interest in deep learning models particularly on different applications such as human action recognition, image classification, object recognition, and speech recognition. Unsupervised learning based methods have been proposed by, e.g., [161
], to automatically learn features from video data for action recognition. An independent subspace analysis algorithm was used to learn space-time features and combined with convolution and stacking based deep learning techniques for action representation.
], the researchers proposed to train DBNs with RBMs for human action recognition. The experimental results on two public datasets demonstrated the impressive performance of the proposed method over hand-crafted feature based approaches.
An unsupervised deep learning based model was proposed by [163
] to continuously learn from unlabelled video streams. In addition, DBNs based methods were used by [164
] to learn features from an unconstrained video stream for human action recognition.
Generative or unsupervised learning based models have played a substantial role in inspiring researchers’ interest in the deep learning field. Nevertheless, the great development of the Convolution Neural Networks (CNNs) based supervised learning methods for object recognition has somewhat obscured the unsupervised learning based approaches; see, e.g., [165
5.2. Supervised (Descriminative) Models
In line with the recent literature surveys for human action recognition, the most common technique used in supervised learning based models is Convolution Neural Networks (CNN)s. These were first proposed by [166
]. CNNs can be considered to be a type of the deep learning model which has shown great performance in various recognition tasks such as pattern recognition, digit classification, image classification, and human action recognition see, e.g., [165
]. The efficient utilisation of CNNs in image classification [165
] opened a new era to employ deep learning based methods for human action recognition. The key advantage of CNNs is their ability to learn straight from the raw data such as RGB or depth map data. Consequently, it is possible to obtain discriminative features which can effectively describe the data and thus make the recognition process easier. Since this approach is susceptible to overfitting, one should be careful in the training process. CNN includes regularisation and has a significant requirement for a large amount of labeled data. These can help to prevent overfitting. Recently, it was shown that deep learning based methods outperform many state-of-the-art handcrafted features for image classification; see, e.g., [27
Convolution Neural Networks (CNN)s have a hierarchical structure with multiple hidden layers to help translate a data sample into a set of categories. Such models consist of a number of different types of layers such as convolutional layers, pooling layers and fully connected layers. The temporal domain is introduced as an additional dimension in the case of videos. Since CNNs were originally designed for static image processing, it was not initially clear on how to incorporate motion information. Therefore, most research at that time used CNNs on still images to model appearance information for action recognition [165
]. Thereafter, different ways were proposed to utilise motion information for action recognition. An extension was presented by [169
] where stacked video frames were used as an input to a CNN for action recognition from video. However, the experimental results were worse than hand-crafted feature based approaches. An investigation made by [32
] about this issue and developed the idea of having separate spatial and temporal CNN streams for action recognition.
illustrates the spatio-temporal CNN streams similar to [32
] where the two streams are implemented as independent CNNs. One stream was the spatial stream which recognised actions from static images. The other stream was the temporal stream which recognised actions from stacked video frames based on motion information of dense optical flow. The output of the two streams was combined using a late fusion technique. The experiments showed improved performance for this method compared to hand-crafted feature based approaches. However, this type of architecture has additional hardware requirements to be suitable for a variety of applications.
A lot of research on action recognition is based on works that have previously achieved relatively good performance in image classification problems. Recent works extended what was implemented in two dimensions to 3D to include the temporal domain. Most CNN models proposed for action recognition have been limited to deal with 2D input data. Nonetheless, some applications may include 3D data that requires a specialised deep learning model. To this end, 3D Convolution Neural Networks (3D-CNNs) based models were presented by [40
] for surveillance tasks at airports. Spatio-temporal features were automatically utilised by employing 3D convolutions in the convolutional layers with respect to spatial and temporal dimensions. The experimental results demonstrated superior performance for this method in comparison to other state-of-the-art methods.
In general, there has been much success with 2D and 3D CNN in e.g., image classification, object recognition, speech recognition and action recognition. Nonetheless, some issues still need to be considered such as the immense amount of image or video data needed for training purposes. Collecting and annotating large amounts of image or video data are quite exhausting and requires a substantial amount of time. Fortunately, the availability of rich and relatively large action recognition datasets has provided a great support for designing such models in terms of their training and evaluation. A factorised 3D-CNN was proposed by [170
] for human action recognition. The 3D-CNN was factorised into a standard 2D-CNN for spatial information at the lower layers and a 1D-CNN for the temporal information at the higher layers. This factorisation was to reduce the number of learning parameters and consequently reduce the computational complexity. Two benchmark datasets were used to evalauate the proposed method: UCF101 and HMDB51. The results showed comparable performance with state-of-the-art methods. Another spatio-temporal 3D-CNN approach was proposed by [171
] for human action recognition. The authors used four public datasets to evaluate the proposed method. The 3D-CNN achieved improved performance with spatio-temporal features compared to a 2D-CNN. The authors also found that a small filter size such as the one used in their method i.e.,
was the best choice for spatio-temporal features. Overall, the experimental results demonstrated competitive performance for the proposed method with a linear classifier.
Some research works have combined supervised and unsupervised learning models for action recognition. A Slow Feature Analysis (SFA) based method has used by [172
] to extract slowly varying features from an input in an unsupervised manner. These were combined with a 3D-CNN for action recognition. This work achieved competitive performance compared to state-of-the-art approaches. Three standard action recognition datasets were used: KTH [98
], UCF sports [85
] and Hollywood2 [99
], a hierarchical framework combining 3D CNN and hidden Markov model (HMM) was proposed. This was used to recognise and segment continuous actions simultaneously. 3D CNN was used to learn a powerful high level features directly from raw data, and use it to extract effective and robust action features. The statistical dependencies over adjacent sub-actions was then modeled by HMM to infer actions sequences. The KTH and Weizmann dataset were used to evaluate the proposed method. The experimental results showed improved performance of the proposed method over some state-of-the-art approaches.
For efficient learning of spatio-temporal features in video action recognition, a hybrid CNN was introduced in [174
] used a fusion convolutional architecture. 2D and 3D CNN was fused to present temporal encoding with fewer parameters. Three models are used to build the proposed model (semi-CNN) including: VGG-16, ResNets and DenseNets. The UCF-101 dataset was used in the evaluation to compare the performance of each model with its corresponding 3D models. Figure 2
shows the performance of the used models over 50 epochs.
Another way to model motion information in video was proposed by [39
] for action recognition using Recurrent Neural Networks (RNN)s. CNN discriminative features were computed for each video frame and then they were fed into an RNN model. The key advantage of an RNN architecture is its ability to deal with sequential inputs as a single copy of the network is created for each sequence. In the RNN hidden layers, connections between neurons are found between each replica where the same weights are shared by each replica and with the others. The authors highlighted that local motion information can be obtained from video by optical flow through CNNs. On the other hand, global motion information can be modeled through the use of the RNN. RNN based supervised learning was used by [175
] across five parts (right arm, left arm, right leg, left leg, trunk) of skeleton information. These were used as inputs to five separate sub-nets for action recognition. The outcomes of these sub-nets were then hierarchically fused to form the inputs to the higher layers. Thereafter, the final representation was fed into a single-layer perceptron to get the final decision. Three datasets were used to evaluate the proposed method including: MSR Action3D [74
], Berkeley Multimodal Human Action (Berkeley Mhad) [176
], and Motion Capture HDM05 [177
] datasets. The results demonstrated state-of-the-art performance. However, RNN is not capable of processing very long sequences and it can not be stacked into very deep models. In addition, it lacks the capability of keeping track of long-term dependencies; which makes training of an RNN difficult.
New recurrent modules that improved long-range learning, Long Short-Term Memory (LSTM), has firstly proposed by [178
]. LSTM units have hidden state augmented with nonlinear mechanisms, in which simple learned gating functions are utilised to enable state propagation with either no modification, update or reset. LSTMs have a significant impact on vision problems as these models are straightforward to fine-tune end-to-end. Moreover, LSTMs have the ability to deal with sequential data and are not limited to fixed length inputs or outputs. This helps to simply model a sequential data of varying lengths, such as text or video [179
LSTMs have recently been introduced to be efficient to large-scale learning of speech recognition [180
] and language translation models [181
]. LSTM was also proposed for action recognition by [179
]. A hybrid deep learning architecture was proposed using a long-term recurrent CNN (LRCN). Raw data and optical flow information were used as input to this unique system. The proposed methods were evaluated using a UCF101 dataset and showed an improvement in the performance in comparison with the baseline architecture.
Deep learning based approaches have achieved relatively high recognition performance. This is on the same level or better than hand-crafted features based methods. Some researchers have also proposed using multiple deep learning models alongside hand-crafted features to achieve even better results such as [32
5.3. Multiple Modality Based Methods
A new insight is provided into human action recognition by using deep learning methods to extract action features from RGB, depth, and/or skeleton information. Different feature learning can be utilised [117
] from deep networks such as appearance, optical flow, depth and/or skeleton sequences. It is very often that different modalities are provided with respect to the same dataset such as RGB, depth, and skeleton information or at least two of them. Therefore, a lot of research has been proposed to utilise combinations of different modalities or their hand-crafted features. They then merge them using fusion based strategies. A separate framework architecture is often employed for each modality; then, classification scores are often obtained for each one.
Some research has highlighted that significant improvements in performance of an action recognition system can be achieved by utilising hand-crafted features within CNN based deep learning models. A CNN model based on multiple sources of information was proposed by [184
] to process spatially varying soft-gating. A fusion technique was then used to combine the multiple CNN models that were trained on various sources. A Stratified Pooling based CNN (SPCNN) was proposed by [185
] to handle the issue of different feature levels of each frame in video data. To come up with video based features, the authors fine-tuned a pre-trained CNN model on target datasets. Frame-level features were extracted, then principal component analysis was used for dimensionality reduction. Stratified pooling of frame-level features was then used to convert them into video-level features, and finally fed them into an SVM classifier for classification. The method was evaluated on HMDB51 [27
] and UCF101 [186
] datasets. The experiments showed that the proposed method outperformed the state-of-the-art.
An extension of this two stream network approach was proposed in [117
] using dense trajectories for more effective learning of motion information.
A general residual network architecture for human activity recognition was presented in [187
] using cross-stream residual connections in the form of multiplicative interaction between appearance and motion streams. The motion information was exploited using stacked inputs of horizontal and vertical optical flow.
A fusion study was presented in [182
] for human activity recognition using two streams of the pre-trained Visual Geometry Group (VGG) network model to compute spatio-temporal information combining RGB and stacked optical flow data. Various fusion mechanisms at different positions of the two streams were evaluated to determine the best possible recognition performance.
Some research studies have paid particular attention to auxiliary information which can improve the performance of action recognition. In some studies, audio has been combined with the video to detect the actions such as [188
], where a combination of Hidden Markov Models (HMM) with audio were used to determine the actions. The main disadvantage of using audio recordings is the surrounding noise that can affect the results.
All of the above approaches suffer from a shortage of long-term temporal information. For example, the number of frames used in the optical flow stacking ranged between 7 and 15 frames, such as 7, 10, and 15 frames as used in [40
], respectively. Often, people will perform the same action over different periods of time depending on many factors and particularly for different people. Consequently, multi-resolution hand-crafted features computed over different periods of time are used by [189
] in order to avoid this problem. Furthermore, different weight phases are applied using a time-variant approach in the computation process of the DMMs to enable adaptation to different important regions of an action. Different fusion techniques are employed to merge spatial and motion information for best action recognition. Figure 3
illustrates the impact of different window frame lengths on the performance of action recognition systems.
5.4. Pose Estimation and Multi-View Action Recognition
Another considerable challenge in human action recognition is view variance. The same action can be viewed from different angles and thus looks excessively different. This issue was taken into account by [190
]. Training data were generated by fitting a synthetic 3D human model to real motion information. Poses were then extracted from different view-points. A CNN based model was found to outperform a hand-crafted feature based approach for multi-view action recognition.
Dynamic image information was extracted by [191
] from synthesised multi-view depth videos. Multi-view dynamic images were constructed from the synthesised data. A CNN model was then proposed to perform feature learning from the multi-view dynamic images. Multiple batches of motion history images (MB-MHIs) have been constructed by [192
]. This information is then used to compute two descriptors by using: a deep residual network (ResNet) and histogram of oriented gradients (HOG). Later, an orthogonal matching pursuit approach was used to obtain the sparse codes of feature descriptions. A final view-invariant feature representation was formed and used to train SVM classifier for action recognition. MuHAVi-MAS [193
] and MuHAVi-uncut [194
] datasets are used to evaluate the proposed approach. Figure 4
illustrates the accuracy variations of the recognition model over different components.
A CNN model obtained from ImageNet was used by [195
] to learn from multi-view DMM features for action recognition when video was projected onto different view-points within the 3D space. Different temporal scales were then used from the synthesised data to constitute a range of spatio-temporal pattern for each action. Finally, three fine-tuned models were employed independently for each DMM map. However, some actions including object interactions can be very difficult to be recognise from the raw depth data alone. This helps to justify the inclusion of RGB data for the recognition of such actions.
], Multi-View Regional Adaptive Multi temporal-resolution DMMs (MV-RAMDMM) and Multi temporal-resolution RGB information is learnt with multiple 3D-CNNs stream for action recognition. The Adaptive Multi-resolution DMM is applied across multiple views to extract view and time invariant action information. It is adapted based on human movement to be used eventually in the deep learning model for action recognition. In addition, multi temporal raw appearance information is used to exploit various spatio-temporal features of the RGB scenes. This helps to capture more specific information which might be difficult to obtain purely from depth sequences. For instance, object-interaction information is more apparent in RGB space.
In a different way, semantic features based on pose can be seen to be very important cues that can describe the category of an action. Human joint information was utilised by [197
] to compute the temporal variation between joints during actions. Time-variant functions were used to confirm the pose related with each action and considered for feature extraction. The feature representation for action recognition was constructed using the temporal variation of values associated with these time functions. Then, CNNs were trained to recognize human actions from the local patterns in the feature representation. The Berkeley MHAD dataset [176
] was used to evaluate the proposed method and the results demonstrated the effectiveness of this approach. Similar to [197
], a Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition was proposed by [198
]. Descriptor aggregated motion and appearance information were used with respect to tracks of human body parts. This utilised skeleton information along with RGB raw data. JHMDB [199
] and MPII [200
] cooking datasets were used to evaluate the proposed method. However, it can be difficult to accurately capture skeleton information of a person in different environment conditions. This might be due to the need of accurate body-parts detection to precisely estimate skeleton information.
Some common datasets of human action recognition are introduced in Table 2
. In addition, an extensive comparison between deep learning based models and hand-crafted based models are presented in Table 3
for human action recognition.
Furthermore, some recent works based on deep learning models for human action recognition are included in Table 4