1. Introduction
During the last decade, human action recognition (HAR) has become a well-developing research area and has received substantial attention. It has been applied in a wide range of applications such as sports video analysis, content-based video retrieval, video surveillance, scene recognition, video annotation and human-computer interaction [
1]. In particular, there is strong interest in the use of video and depth sensors to monitor the activity of elderly or disabled people in domestic environments to support their independence in a safe environment. The use of dedicated intelligent sensors in such cases could also contribute to maintaining privacy by avoiding the need to stream potentially intrusive video feeds to external human carers. The main aim of action recognition is to classify the action present in a test video irrespective of actors’ locations. Action recognition in realistic scenarios is clearly much more challenging than in constrained environments because background objects can interfere and be mistaken as features of interest.
State-of-the-art approaches have shown notable increases in performance in realistic and complex scenarios. However, there are still some unsolved issues such as background clutter, viewpoint variation, illumination change and class variability [
2]. Recently, significant progress has been demonstrated with spatio-temporal feature representation along with variations of the most popular and widely used bag of visual words approaches (BoVW) [
3], which have the ability to handle viewpoint independence, occlusion and scale invariance [
4,
5]. Therefore, there has been a growing interest in exploring the potential of possible variants of the classical BoVW approach, which characterizes actions using a histogram of feature occurrence after clustering [
6].
However, since BoVW approaches only consider the frequency of each visual word and ignore the spatio-temporal relation among visual words, such representation is not able to exploit the contextual relationship between visual words. To overcome this drawback, we propose an extension of our previously proposed Bag of Expressions (BoE) model [
7], which incorporates contextual relationships of visual words while preserving inherent qualities of the classical BoVW approach. The main idea of visual expression formation is to store the spatio-temporal contextual information that is usually lost during formation of visual words by encoding neighboring spatio-temporal interest points’ information.
Both Kovashka et al. [
8] and Gilbert et al. [
9] have shown that performance can be improved by the inclusion of neighborhood amongst visual words. Although Reference [
8] managed to preserve scale invariance, their solution led to a decline in the capability of handling different viewpoints and occlusions. Alternatively, the inclusion of neighborhoods by Gilbert et al. [
9] affected the ability to handle scale and they had to integrate a hierarchy to preserve some scale invariance.
Constructing neighborhoods with a specific number of neighbors, as in Reference [
7], can reduce performance in scenarios with occlusion and changing background. For example, under partial occlusion, neighborhoods would contain irrelevant neighbors which would add noise, making them less discriminative. We propose to overcome this by dynamically estimating the number of suitable neighbors, as described in
Section 3.4.1, according to the local density of neighboring Spatio-Temporal Interest Points (STIPs) in a Spatio-Temporal Cube (ST-Cube).
Figure 1 illustrates the difference between a neighborhood-based approach that uses a fixed number of neighbors with one that uses a spatio-temporal cube’s density. In case of an viewpoint variation, the likelihood of obtaining the same visual expression representation is higher with an ST-Cube based neighborhood.
Kovashka et al. [
8] were unable to address viewpoint changes because of restrictions they placed in the formation of neighborhoods. To overcome this, we propose to form a visual expression as an independent pair of visual words and neighboring STIPs. Using the density of a spatio-temporal cube of size
n, each neighboring STIP in the ST-cube is paired with a visual word to obtain
n different visual expressions. This provides a local representation of visual expressions and enables some tolerance to viewpoint variation and occlusion.
To support view independence, we represent each action as a visual expression and, as in Reference [
7], only the frequencies of these visual expressions are stored. They are formed in such a way that they represent the distribution of visual expressions in the spatio-temporal domain. In contrast to Reference [
7], visual expressions are constructed to incorporate the spatio-temporal contextual information of the visual words by combining the visual words and the neighboring STIPs present in each spatio-temporal cube. Similarly, these visual expression representations discard all information related to other visual words and only consider the relationship between a visual word and its neighboring STIPs in the ST-Cube. It focuses on the individual contribution of visual expression and is expected to enhance discriminative power on occlusion. As will be shown here, the proposed model outperforms state-of-the-art results for KTH, UCF Sports, UCF11 and UCF50 datasets.
The rest of the paper is organized as follows. We discuss related work in
Section 2.
Section 3 describes in detail the proposed dynamic spatio-temporal Bag of Expressions (D-STBoE) model for human action recognition. Evaluation of parameters, experimentation results and discussion are provided in
Section 4, followed by conclusions and suggestions for future work in
Section 5.
2. Related Work
Bag of visual words (BoVW) along with its variations has become a popular approach for many visual interpretation tasks, including human action recognition. In BoVW, video representation is obtained using three steps: feature representation, dictionary learning and feature encoding. In such representation, performance is highly dependent on the discriminative power of each step. As work on action recognition using a BoVW approach is too broad to be covered here, we focus on the related approaches for feature representation and neighborhood features.
Feature representation for video, including holistic and local representation approaches, have been reviewed in detail in survey papers such as Reference [
1]. Holistic representations use a global representation of human body, structure, shape and movement [
10,
11,
12], while local representations are based on the extraction of local features [
13]. Local representation is favored nowadays because holistic ones are too rigid to capture viewpoint variation, illumination changes and occlusions, which is typical of realistic scenarios [
14,
15]. Local feature representation emerges as an outcome of the spatio-temporal interest point detector proposed by Laptev et al. [
16]. Other traditional feature detectors like Harris corner detector [
17], Gabor filtering [
18] and the scale invariant feature transform (SIFT) detector [
19] are also popular interest point detectors. Based on the popular Harris corner detector, 3D Harris spatio-temporal interest point detector was proposed by Laptev et al. [
16] to capture significant variations in both space and time, which has made it particularly applicable to action recognition. An example of such interest points is shown in
Figure 2.
Local descriptors are then computed for these detected interest points. Local interest point descriptors can rely on 3D cuboid or trajectories. 3D HoG (Histogram of Gradients) [
31], inspired by HoG [
32], is used to integrate motion information in the spatio-temporal domain. Histogram of optical flow (HoF) [
33] encodes optical flow of pixel value motion in spatio-temporal local regions. Another popular 3D interest point descriptor is 3D SIFT [
34].
After computing features, these are further transformed into visual words in a BoVW approach. Compared to text that is usually semi-structured, visual appearance is less structured and exhibits large variations when the same underlying visual pattern is present in varying illumination, different viewpoints and occluded environments. A possible solution to resolve the ambiguity of visual pattern representation is to consider its spatial-temporal context [
7].
Yuan et al. [
35] introduced the concept of visual phrase to overcome the underlying over-representation and under-representation challenges in visual word representation. Visual phrases are used to represent the spatial co-occurrence of visual words. Meng et al. [
36] have used spatio-temporal cues for object search in video and find the top
k trajectories that are the possible candidates containing the object.
For video representation, Zhao et al. [
37] used deep spatial net (VGCNet [
38]) and temporal net (Two-Stream ConNets [
39]). They extracted the proposed frame-diff layer and convolutional layers and pooled them with trajectory temporal pooling and line temporal pooling strategies. These descriptors were further pooled with VLAD (Vector of Locally Aggregated Descriptors) for video representation. Xu et al. [
40] combined a trainable VLAD encoding process and the RCNs (randomly connected neurons) architecture and developed a novel SeqVLAD (Sequential VLAD). Murtaza et al. [
41] also proposed a novel encoding algorithm, DA-VLAD, for video representation by exploiting the code-word representation of each action.
Rahmani et al. [
42] proposed a novel deep fully-connected neural network for the recognition of human actions from novel viewpoints which they called non-linear knowledge transfer model (R-NKTM). By using dense trajectories for feature representation, their proposed model performed better than traditional methods. In addition, they stated that performance could be further improved by combining multiple feature representation approaches like dense trajectories, HOG, HOF and motion boundary histogram.
Wang et al. [
43] trained a deep neural network using spatio-temporal features. They used a Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) units for the extraction of spatial and temporal motion information respectively. Local and semantics characteristics were used to distinguish an object from its background and in addition its spatial information was extracted using a CNN. Different spatial and temporal positions in video were learned using a temporal-wise attention model.
Khan et al. [
44] improved the performance of human action recognition methods using a hybrid feature extraction model. For action classification, they used a multilayer neural network on selected fused features and achieved good performance on the KTH dataset.
Many researchers have extended the traditional 2D convolutional neural network to integrate temporal information. 3D CNN was proposed by Hara et al. [
45] to include temporal information using a 3D kernel to extract information from both the spatial and the temporal domains. Limiting spatial space looks reasonable whereas, for the extraction of motion information, limiting time domain to a few frames seems deficient. Ng et al. [
46] proposed that instead of learning temporal information from few frames, temporal pooling can be applied to integrate the temporal domain information in network layers.
Simonyan et al. [
39] extended the convolutional neural network by proposing a two-stream architecture. Similar to the way human vision cortex learns appearance and motion information using ventral (object recognition) and dorsal (motion recognition) streams, they used two pathways to learn spatial and temporal information for appearance and motion respectively. The spatial stream was learned on a model pretrained on ImageNet while the temporal stream was trained on an action recognition dataset to address the unavailability of pretrained models on flow information.
Following the success of the two-stream architecture, Feichtenhofer et al. [
47] extended the two-stream CNN and proposed a spatio-temporal residual network. Motivated by the success of ResNet in ILSVRC 2015, they pretrained ResNet on ImageNet for action recognition. They embedded the residual connection between appearance and motion streams to allow the hierarchical learning of spatiotemporal features for both streams.
Hara et al. [
45] also utilized the ResNet model for human action classification. They made use of 3D convolutional kernels to integrate temporal information in a residual network. They trained and evaluated their proposed 3D ResNet model on relatively large action recognition datasets (Kinetics and Activity Net). Temporal ResNet [
48] also integrated temporal information in a spatial residual network by modifying the basic residual function using an additional temporal conv block.
5. Conclusions and Future Work
In this paper, a dynamic spatio-temporal Bag of Expressions model for human action recognition has been introduced. Not only does it preserve the inherent qualities of the classical BoVW approach but it also acquires contextual information that would be lost with a conventional formation of visual words representation. We proposed a new Bag of Expressions generation method based on the density of spatio-temporal cube of visual words. The new model addresses the inter-class variation challenge observed in realistic scenarios by using a class-specific representation for visual expression generation. Viewpoint variation and occlusion challenges are also partially addressed by considering the spatio-temporal relation of visual words representation. The model was tested on four publicly available datasets that is, KTH, UCF Sports, UCF11 and UCF50. D-STBoE consistently outperformed all state-of-the-art ’traditional’ methods while achieving competitive results when compared to deep learning approaches, which demonstrated best performance on two of the four datasets. In terms of practical deployment, D-STBoE could be a competitive solution compared to deep learning based systems, since, unlike them, it does not require a large dataset for training that is, D-STBoE is better suited for practical applications where availability of training data is usually limited.
Future work will explore the potential advantage of using spatio-temporal cube’s density based neighborhood information for deep learning features and methods and also investigate exploiting spatio-temporal neighborhood information for action detection in even more realistic and complex scenarios.