A Comprehensive Review on Handcrafted and Learning-Based Action Representation Approaches for Human Activity Recognition

Human activity recognition (HAR) is an important research area in the fields of human perception and computer vision due to its wide range of applications. These applications include: intelligent video surveillance, ambient assisted living, human computer interaction, human-robot interaction, entertainment, and intelligent driving. Recently, with the emergence and successful deployment of deep learning techniques for image classification, researchers have migrated from traditional handcrafting to deep learning techniques for HAR. However, handcrafted representation-based approaches are still widely used due to some bottlenecks such as computational complexity of deep learning techniques for activity recognition. However, approaches based on handcrafted representation are not able to handle complex scenarios due to their limitations and incapability; therefore, resorting to deep learning-based techniques is a natural option. This review paper presents a comprehensive survey of both handcrafted and learning-based action representations, offering comparison, analysis, and discussions on these approaches. In addition to this, the well-known public datasets available for experimentations and important applications of HAR are also presented to provide further insight into the field. This is the first review paper of its kind which presents all these aspects of HAR in a single review article with comprehensive coverage of each part. Finally, the paper is concluded with important discussions and research directions in the domain of HAR.


Introduction
In recent years, automatic human activity recognition (HAR) based on computer vision has drawn much attention of researchers around the globe due to its promising results.The major applications of HAR include: Human Computer Interaction (HCI), intelligent video surveillance, ambient assisted living, human-robot interaction, entertainment, and content-based video search.In HCI, the activity recognition systems observe the task carried out by the user and guide him/her to complete it by providing feedback.In video surveillance, the activity recognition system can automatically detect a suspicious activity and report it to the authorities for immediate action.Similarly, in entertainment, these systems can recognize the activities of different players in the game.Depending on the complexity and duration, activities fall into four categories, i.e., gestures, actions, interactions, and group activities [1] as shown in Figure 1.
examples of human-human interaction, while 'a person using an ATM', 'a person using a computer', 'and a person stealing a bag' are examples of human-object interaction.
Group Activity: This is the most complex type of activity.Certainly, it is a combination of gestures, actions, and interactions.It involves more than two humans and a single or multiple objects.A 'group of people protesting', 'two teams playing a game', and a 'group meeting', are good examples of group activities.

Gesture Action Human-object interaction
Human-human interaction Group activity Since the 1980s, researchers have been working on human action recognition from images and videos.One of the important directions that researchers have been following for action recognition is similar to the working of the human vision system.At low level, the human vision system can receive the series of observations regarding the movement and shape of the human body in a short span of time.Then, these observations are passed to the intermediate human perception system for further recognition of the class of these observations, such as walking, jogging, and running.In fact, the human vision and perception system is robust and very accurate in recognition of observed movement and human activities.In order to achieve a similar level of performance by a computerbased recognition system, researchers have carried out a lot of efforts during the past few decades.However, unfortunately, due to many challenges and issues involved in HAR such as environmental complexities, intra-class variations, viewpoint variations, occlusions, and non-rigid shape of the humans and objects, we are still very far from the level of the human vision system.What we have achieved so far may be a fraction of what a mature human vision system can do.
Based on the comprehensive investigation of the literature, vision-based human activity recognition approaches can be divided into two major categories.(1) The traditional handcrafted representation-based approach, which is based on the expert designed feature detectors and descriptors such as Hessian3D, Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), Enhanced Speeded-Up Robust Features (ESURF), and Local Binary Pattern (LBP).This is followed by a generic trainable classifier for action recognition as shown in Figure 2; (2) Learning-based representation approach, which is a recently emerged approach with capability of learning features automatically from the raw data.This eliminates the need of handcrafted feature detectors and descriptors required for action representation as in the traditional approach.Unlike the traditional handcrafted approach, it uses the concept of a trainable feature extractor followed by a trainable classifier, introducing the concept of end-to-end leaning, as shown in Figure 3. Gesture: A gesture is defined as a basic movement of the human body parts that carry some meaning.'Head shaking', 'hand waving', and 'facial expression' are some good examples of gestures.Usually, a gesture takes a very short amount of time and its complexity is the lowest among the four mentioned categories.
Action: An action is a type of an activity that is performed by a single person.In fact, it is a combination of multiple gestures (atomic actions).'Walking' 'running', 'jogging', and 'punching' are some good examples of human actions.
Interaction: It is a type of an activity performed by two actors.One actor must be a human and the other one may be a human or an object.Thus, it could be human-human interaction or human-object interaction.'Fighting between two persons', 'hand shaking', and 'hugging each other' are examples of human-human interaction, while 'a person using an ATM', 'a person using a computer', 'and a person stealing a bag' are examples of human-object interaction.
Group Activity: This is the most complex type of activity.Certainly, it is a combination of gestures, actions, and interactions.It involves more than two humans and a single or multiple objects.A 'group of people protesting', 'two teams playing a game', and a 'group meeting', are good examples of group activities.
Since the 1980s, researchers have been working on human action recognition from images and videos.One of the important directions that researchers have been following for action recognition is similar to the working of the human vision system.At low level, the human vision system can receive the series of observations regarding the movement and shape of the human body in a short span of time.Then, these observations are passed to the intermediate human perception system for further recognition of the class of these observations, such as walking, jogging, and running.In fact, the human vision and perception system is robust and very accurate in recognition of observed movement and human activities.In order to achieve a similar level of performance by a computer-based recognition system, researchers have carried out a lot of efforts during the past few decades.However, unfortunately, due to many challenges and issues involved in HAR such as environmental complexities, intra-class variations, viewpoint variations, occlusions, and non-rigid shape of the humans and objects, we are still very far from the level of the human vision system.What we have achieved so far may be a fraction of what a mature human vision system can do.
Based on the comprehensive investigation of the literature, vision-based human activity recognition approaches can be divided into two major categories.(1) The traditional handcrafted representation-based approach, which is based on the expert designed feature detectors and descriptors such as Hessian3D, Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), Enhanced Speeded-Up Robust Features (ESURF), and Local Binary Pattern (LBP).This is followed by a generic trainable classifier for action recognition as shown in Figure 2; (2) Learning-based representation approach, which is a recently emerged approach with capability of learning features automatically from the raw data.This eliminates the need of handcrafted feature detectors and descriptors required for action representation as in the traditional approach.Unlike the traditional handcrafted approach, it uses the concept of a trainable feature extractor followed by a trainable classifier, introducing the concept of end-to-end leaning, as shown in Figure 3.The handcrafted representation-based approach mainly follows the bottom-up strategy for HAR.Generally, it consists of three major phases (foreground detection, handcrafted feature extraction and representation, and classification) as shown in Figure 4.A good number of survey papers have been published on different phases of handcrafted representation-based HAR processes.Different taxonomies have been used in the survey papers to discuss the HAR approaches.A survey presented in [1], divides the activity recognition approaches into two major categories: single layered approaches and hierarchical approaches.Single-layered approaches recognize the simple activities from the sequence of a video, while hierarchical approaches recognize more complex activities by decomposing them into simple activities (sub-events).These are further sub-categorized such as space-time volumes, and trajectories, based on the feature representation and classification methods used for recognition.A detailed survey on object segmentation techniques is presented in [2], discussing the challenges, resources, libraries and public datasets available for object segmentation.Another study, presented in [3], discussed the three levels of HAR, including core technology, HAR systems, and applications.Activity recognition systems are significantly affected by the challenges such as occlusion, anthropometry, execution rate, background clutter, and camera motion as discussed in [4].This survey categorized the existing methods based on their abilities for handling these challenges.Based on these challenges, potential research areas were also identified in [4].In [5], human action recognition methods based on the feature representation and classification were discussed.Similarly, Weinland, D. et al. [6] surveyed the human activity recognition methods by categorizing them into segmentation, feature representation and classification.A review article on semantic-based human action recognition methods is presented in [7].It presents the state-of-the-art methods for activity recognition which use semantic-based features.In this paper, semantic space, and semantic-based features such as pose, poselet, related objects, attributes, and scene context are also defined and discussed.Different handcrafted features extraction and representation methods have been proposed for human action recognition [8][9][10][11][12].
On the other hand, a learning-based representation approach, specifically deep learning, uses computational models with multiple processing layers based on representation learning with multiple levels of abstraction.This learning encompasses a set of methods that enable the machine to process the data in raw form and automatically transform it into a suitable representation needed for classification.This is what we call trainable feature extractors.This transformation process is handled at different layers, for example, an image consists of an array of pixels, and then the first layer transforms it into edges at particular location and orientation.The second layer represents it as collection of motifs by recognising the particular arrangement of edges in an image.The third layer may combine the motifs into parts and the following layers would turn it into the recognizable objects.These layers are learned from the raw data using a general purpose learning procedure which does not need to be designed manually by the experts [13].This paper further examines various computer-based fields such as 3D games and animations systems [14,15], physical sciences, health-related issues [16][17][18], natural sciences and industrial academic systems [19,20].
One of the important components of vision-based activity recognition system is the camera/sensor used for capturing the activity.The use of appropriate cameras for capturing the activity has great impact on the overall functionality of the recognition system.In fact, these cameras have been instrumental to the progression of research in the field of computer vision [21][22][23][24][25].According to the  The handcrafted representation-based approach mainly follows the bottom-up strategy for HAR.Generally, it consists of three major phases (foreground detection, handcrafted feature extraction and representation, and classification) as shown in Figure 4.A good number of survey papers have been published on different phases of handcrafted representation-based HAR processes.Different taxonomies have been used in the survey papers to discuss the HAR approaches.A survey presented in [1], divides the activity recognition approaches into two major categories: single layered approaches and hierarchical approaches.Single-layered approaches recognize the simple activities from the sequence of a video, while hierarchical approaches recognize more complex activities by decomposing them into simple activities (sub-events).These are further sub-categorized such as space-time volumes, and trajectories, based on the feature representation and classification methods used for recognition.A detailed survey on object segmentation techniques is presented in [2], discussing the challenges, resources, libraries and public datasets available for object segmentation.Another study, presented in [3], discussed the three levels of HAR, including core technology, HAR systems, and applications.Activity recognition systems are significantly affected by the challenges such as occlusion, anthropometry, execution rate, background clutter, and camera motion as discussed in [4].This survey categorized the existing methods based on their abilities for handling these challenges.Based on these challenges, potential research areas were also identified in [4].In [5], human action recognition methods based on the feature representation and classification were discussed.Similarly, Weinland, D. et al. [6] surveyed the human activity recognition methods by categorizing them into segmentation, feature representation and classification.A review article on semantic-based human action recognition methods is presented in [7].It presents the state-of-the-art methods for activity recognition which use semantic-based features.In this paper, semantic space, and semantic-based features such as pose, poselet, related objects, attributes, and scene context are also defined and discussed.Different handcrafted features extraction and representation methods have been proposed for human action recognition [8][9][10][11][12].
On the other hand, a learning-based representation approach, specifically deep learning, uses computational models with multiple processing layers based on representation learning with multiple levels of abstraction.This learning encompasses a set of methods that enable the machine to process the data in raw form and automatically transform it into a suitable representation needed for classification.This is what we call trainable feature extractors.This transformation process is handled at different layers, for example, an image consists of an array of pixels, and then the first layer transforms it into edges at particular location and orientation.The second layer represents it as collection of motifs by recognising the particular arrangement of edges in an image.The third layer may combine the motifs into parts and the following layers would turn it into the recognizable objects.These layers are learned from the raw data using a general purpose learning procedure which does not need to be designed manually by the experts [13].This paper further examines various computer-based fields such as 3D games and animations systems [14,15], physical sciences, health-related issues [16][17][18], natural sciences and industrial academic systems [19,20].
One of the important components of vision-based activity recognition system is the camera/sensor used for capturing the activity.The use of appropriate cameras for capturing the activity has great impact on the overall functionality of the recognition system.In fact, these cameras have been instrumental to the progression of research in the field of computer vision [21][22][23][24][25].According to the The handcrafted representation-based approach mainly follows the bottom-up strategy for HAR.Generally, it consists of three major phases (foreground detection, handcrafted feature extraction and representation, and classification) as shown in Figure 4.A good number of survey papers have been published on different phases of handcrafted representation-based HAR processes.Different taxonomies have been used in the survey papers to discuss the HAR approaches.A survey presented in [1], divides the activity recognition approaches into two major categories: single layered approaches and hierarchical approaches.Single-layered approaches recognize the simple activities from the sequence of a video, while hierarchical approaches recognize more complex activities by decomposing them into simple activities (sub-events).These are further sub-categorized such as space-time volumes, and trajectories, based on the feature representation and classification methods used for recognition.A detailed survey on object segmentation techniques is presented in [2], discussing the challenges, resources, libraries and public datasets available for object segmentation.Another study, presented in [3], discussed the three levels of HAR, including core technology, HAR systems, and applications.Activity recognition systems are significantly affected by the challenges such as occlusion, anthropometry, execution rate, background clutter, and camera motion as discussed in [4].This survey categorized the existing methods based on their abilities for handling these challenges.
Based on these challenges, potential research areas were also identified in [4].In [5], human action recognition methods based on the feature representation and classification were discussed.Similarly, Weinland,D. et al. [6] surveyed the human activity recognition methods by categorizing them into segmentation, feature representation and classification.A review article on semantic-based human action recognition methods is presented in [7].It presents the state-of-the-art methods for activity recognition which use semantic-based features.In this paper, semantic space, and semantic-based features such as pose, poselet, related objects, attributes, and scene context are also defined and discussed.Different handcrafted features extraction and representation methods have been proposed for human action recognition [8][9][10][11][12].
On the other hand, a learning-based representation approach, specifically deep learning, uses computational models with multiple processing layers based on representation learning with multiple levels of abstraction.This learning encompasses a set of methods that enable the machine to process the data in raw form and automatically transform it into a suitable representation needed for classification.This is what we call trainable feature extractors.This transformation process is handled at different layers, for example, an image consists of an array of pixels, and then the first layer transforms it into edges at particular location and orientation.The second layer represents it as collection of motifs by recognising the particular arrangement of edges in an image.The third layer may combine the motifs into parts and the following layers would turn it into the recognizable objects.These layers are learned from the raw data using a general purpose learning procedure which does not need to be designed manually by the experts [13].This paper further examines various computer-based fields such as 3D games and animations systems [14,15], physical sciences, health-related issues [16][17][18], natural sciences and industrial academic systems [19,20].
The traditional approach for action recognition is based on the handcrafted action representation.This approach has been popular among the HAR community and has achieved remarkable results on different public well-known datasets.In this approach, the important features from the sequence of image frames are extracted and the feature descriptor is built up using expert designed feature detectors and descriptors.Then, classification is performed by training a generic classifier such as Support Vector Machine (SVM) [27].This approach includes space-time, appearance-based, local binary patterns, and fuzzy logic-based techniques as shown in Figure 4.One of the important components of vision-based activity recognition system is the camera/sensor used for capturing the activity.The use of appropriate cameras for capturing the activity has great impact on the overall functionality of the recognition system.In fact, these cameras have been instrumental to the progression of research in the field of computer vision [21][22][23][24][25].According to the nature and dimensionality of images captured by these cameras, they are broadly divided into two categories, i.e., 2D and 3D/depth cameras.The objects in the real world exist in 3D form: when these are captured using 2D cameras then one dimension is already lost, which causes the loss of some important information.To avoid the loss of information, researchers are motivated to use 3D cameras for capturing the activities.For the same reason, 3D-based approaches provide higher accuracy than 2D-based approaches but at higher computational cost.Recently, some efficient 3D cameras have been introduced for capturing images in 3D form.Among these, 3D Time-of-flight (ToF) cameras, and Microsoft Kinect have become very popular for 3D imaging.However, these sensors also have several limitations such as these sensors only capture the frontal surfaces of the human and other objects in the scene.In addition to this, these sensors also have limited range about 6-7 m, and data can be distorted by scattered light from the reflective surfaces [26].However, there is no universal rule for selecting the appropriate camera; it mainly depends on the nature of the problem and its requirements.
A good number of survey and review papers have been published on HAR and related processes.However, due to the great amount of work about it, already published reviews are always out-of-date.For the same reason, writing a review paper on human activity recognition is hard work and a challenging task.In this paper we provide the discussion, comparison, and analysis of state-of-the-art methods of human activity recognition based on both handcrafted and learning-based action representations along with well-known datasets and important applications.This is the first review article of its kind that covers all these aspects of HAR in a single article with more recent publications.However, this review is more focused towards human gesture, and action recognition techniques, and provides little coverage regarding complex activities such as interactions and group activities.The rest of the paper is organized as follows.Handcrafted representation and recognition-based approaches are covered in Section 2, learning-based representation approaches are discussed in Section 3, Section 4 discusses the well-known public datasets and important application of HAR are presented in Section 5, discussions and conclusion are presented in Section 6.

Handcrafted Representation-Based Approach
The traditional approach for action recognition is based on the handcrafted action representation.This approach has been popular among the HAR community and has achieved remarkable results on different public well-known datasets.In this approach, the important features from the sequence of image frames are extracted and the feature descriptor is built up using expert designed feature detectors and descriptors.Then, classification is performed by training a generic classifier such as Support Vector Machine (SVM) [27].This approach includes space-time, appearance-based, local binary patterns, and fuzzy logic-based techniques as shown in Figure 4.

Space-Time-Based Approaches
Space-time-based approaches have four major components: space time interest point (STIP) detector, feature descriptor, vocabulary builder, and classifier [28].The STIP detectors are further categorized into dense and sparse detectors.The dense detectors such as V-FAST, Hessian detector, dense sampling, etc., densely cover all the video content for detection of interest points, while sparse detectors such as cuboid detector , Harris3D [29], and Spatial-Temporal Implicit Shape Model (STISM), etc., use a sparse (local) subset of these contents.Various STIP detectors have been developed by different researchers, such as [30,31].The feature descriptors are also divided into local and global descriptors.The local descriptors such as cuboid descriptor, Enhanced Speeded-Up Robust Features (ESURF), N-jet are based on the local information such as texture, colour, and posture, while global descriptors use global information such as illumination changes, phase changes, and speed variation of a video.The vocabulary builders or aggregating methods are based on bag-of-words (BOW) or state-space model.Finally, for the classification, a supervised or unsupervised classifier is used, as shown in Figure 5.

Space-Time-Based Approaches
Space-time-based approaches have four major components: space time interest point (STIP) detector, feature descriptor, vocabulary builder, and classifier [28].The STIP detectors are further categorized into dense and sparse detectors.The dense detectors such as V-FAST, Hessian detector, dense sampling, etc., densely cover all the video content for detection of interest points, while sparse detectors such as cuboid detector , Harris3D [29], and Spatial-Temporal Implicit Shape Model (STISM), etc., use a sparse (local) subset of these contents.Various STIP detectors have been developed by different researchers, such as [30,31].The feature descriptors are also divided into local and global descriptors.The local descriptors such as cuboid descriptor, Enhanced Speeded-Up Robust Features (ESURF), N-jet are based on the local information such as texture, colour, and posture, while global descriptors use global information such as illumination changes, phase changes, and speed variation of a video.The vocabulary builders or aggregating methods are based on bagof-words (BOW) or state-space model.Finally, for the classification, a supervised or unsupervised classifier is used, as shown in Figure 5.

Space-Time Volumes (STVs)
The features in the space-time domain are represented as 3D spatio-temporal cuboids, called space-time volumes (STVs).The core of STV-based methods is similarity measure between two volumes for action recognition.In [32], an action recognition system was proposed using template matching, instead of using space-time volumes, they used templates composed of 2D binary motionenergy-image (MEI) and motion-history-image (MHI) for action representation followed by a simple template matching technique for action recognition.This work was extended in [33] where MHI and two appearance-based features namely foreground image and histogram of oriented gradients (HOG) were combined for action representation, followed by simulated annealing multiple instance learning support vector machine (SMILE-SVM) for action classification.The method proposed in [34] also extended the [32] from 2D to 3D space for view-independent human action recognition using a volume motion template.The experimental results using these techniques are presented in Table 1.

Space-Time Volumes (STVs)
The features in the space-time domain are represented as 3D spatio-temporal cuboids, called space-time volumes (STVs).The core of STV-based methods is similarity measure between two volumes for action recognition.In [32], an action recognition system was proposed using template matching, instead of using space-time volumes, they used templates composed of 2D binary motion-energy-image (MEI) and motion-history-image (MHI) for action representation followed by a simple template matching technique for action recognition.This work was extended in [33] where MHI and two appearance-based features namely foreground image and histogram of oriented gradients (HOG) were combined for action representation, followed by simulated annealing multiple instance learning support vector machine (SMILE-SVM) for action classification.The method proposed in [34] also extended the [32] from 2D to 3D space for view-independent human action recognition using a volume motion template.The experimental results using these techniques are presented in Table 1.

Space-Time Trajectory
Trajectory-based approaches interpret an activity as a set of space-time trajectories.In these approaches, a person is represented by 2-dimensional (XY) or 3-dimensional (XYZ) points corresponding to his/her joints position of the body.As person performs an action, there are certain changes in his/her joint positions according to the nature of the action.These changes are recoded as space-time trajectories, which construct a 3D XYZ or 4D XYZT representation of an action.The space-time trajectories work by tracking the joint position of the body for distinguishing different types of actions.Following this idea many approaches have been proposed for action recognition based on the trajectories [46,65,66].
Inspired by the dense sampling in image classification, the concept of dense trajectories for action recognition from videos was introduced in [42].The authors sampled the dense points from each image frame and tracked them using displacement information from a dense optical flow field.These types of trajectories cover the motion information and are robust to irregular motion changes.This method achieved state-of-the-art-results on challenging datasets.In [48], an extension to [42] was proposed for the improvement of performance regarding camera motion.For estimation of camera motion authors used Speeded-Up Robust Features (SURF) descriptor and dense optical flow.This significantly improved the performance of motion-based descriptors such as histogram of optical flow (HOF), and motion boundary histogram (MBH).However, when incorporating high density with trajectories within the video, it increases the computational cost.Many attempts have been made to reduce the computational cost of the dense trajectory-based methods.For this purpose, saliency-map to extract the salient regions within the image frame was used in [57].Based on the saliency-map a significant number of dense trajectories can be discarded without compromising the performance of the trajectory-based methods.
Recently, a human action recognition method from depth movies captured by the Kinect sensor was proposed in [62].This method represents dynamic skeleton shapes of the human body as trajectories on Kendall's shape manifold.This method is invariant to execution rate of the activity and uses transported-square root vector fields (TSRVFs) of trajectories and standard Euclidean norm to achieve the computational efficiency.Another method for recognition of actions of construction workers using dense trajectories was proposed in [67].In this method, different descriptors such as HOG, HOF, and motion boundary histogram (MBH) were used for the trajectories.Among these descriptors, authors reported the highest accuracy with codebook of size 500 using MBH descriptor.Human action recognition in unconstrained videos is a challenging problem and few methods have been proposed at this end.For this purpose, a human action recognition method was proposed using explicit motion modelling [68].This method used visual code words generated from the dense trajectories for action representation without using the foreground-background separation method.

Space-Time Features
The space-time features-based approaches extract features from space-time volumes or space-time trajectories for human action recognition.Generally, these features are local in nature and contain discriminative characteristics of an action.According to the nature of space-time volumes and trajectories, these features can be divided into two categories: sparse and dense.The features detectors that are based on interest point detectors such as Harris3D [30], and Dollar [69] are considered as sparse, while feature detectors based on optical flow are considered as dense.These interest point detectors provide the base for most of the recently proposed algorithms.In [70], the interest points were detected using Harris3D [30], based on these points they build the feature descriptor and used PCA (principal component analysis)-SVM for classification.In [59], authors proposed a novel local polynomial space-time descriptor based on optical flow for action representation.
The most popular action representation methods in this category are based on the Bag-of-Visual-Words (BoVW) model [71,72] or its variants [73,74].The BoVW model consists of four steps, feature extraction, codebook generation, encoding and pooling, and normalization.We extract the local features from the video; learn visual dictionary in training set by Gaussian Mixture Model (GMM) or K-mean clustering, encode and pool features, and finally represent the video as normalized pooled vectors followed by a generic classifier for action recognition.The high performance of the BoVW model is due to an effective low level feature such as dense trajectory features [48,75], encoding methods such as Fisher Vector [74], and space-time co-occurrence descriptors [39].The improved dense trajectory (iDT) [48] provides the best performance among the space-time features on serval public datasets.
The coding methods have played an important role in boosting the performance of these approaches.Recently, a new encoding method named Stacked Fisher Vector (SFV) [52] was developed as an extension of traditional single layer Fisher Vector (FV) [74].Unlike traditional FV, which encodes all local descriptors at once, SFV first performs encoding in dense sub-volumes, then compresses these sub-volumes into FVs, and finally applies another FV encoding based on the compressed sub-volumes.For the detail comparison of single layer FV and stacked FV readers are encouraged to refer [52].

Discussion
Space-Time-based approaches have been evaluated by many researchers on different well-known datasets-including simple and complex activities as recoded in Table 1.Some merits of these approaches are as follows: (1) STVs-based approaches are suitable for recognition of gestures and simple actions.However, these approaches have also produced comparable results on complex datasets such as Human Motion database (HMDB-51), Hollywood2, and University of Central Florida (UCF-101); (2) The space-time trajectory-based approaches are especially useful for recognition of complex activities.With the introduction of dense trajectories, these approaches have become popular due to their high accuracy for challenging datasets.In recent years, trajectory-based approaches are getting lot of attention due to their reliability under noise and illumination changes; (3) Space-time feature-based approaches have achieved state-of-the art results on many challenging datasets.It has been observed that descriptors such as HOG3D, HOG/HOF, and MBH are more suitable for handling intra-class variations and motion challenges in complex datasets as compared to local descriptors such as N-jet.
However, these approaches have some limitations as follows: (1) STVs-based approaches are not effective in recognising multiple persons in a scene; these methods use a sliding window for this purpose which is not very effective and efficient; (2) Trajectory-based approaches are good at analysing the movement of a person in view invariant manner but to correctly localize the 3D XYZ joint position of a person is still a challenging task; (3) Space-time features are more suitable for simple datasets; for effective results on complex datasets, combination of different features is required which raises the computational complexity.These limitations can cause hindrance for real-time applications.

Appearance-Based Approaches
In this section we discuss the 2D (XY) and 3D (XYZ) depth image-based approaches which use effective shape, motion, or combination of shape and motion features for action recognition.The 2D shape-based approaches [76,77] use shape and contour-based features for action representation and motion-based approaches [78,79] use optical flow or its variants for action representations.Some approaches use both shape and motion feature for action representation and recognition [80].In 3D-based approaches, a model of a human body is constructed for action representation; this model can be based on cylinders, ellipsoids, visual hulls generated from silhouettes or surface mesh.Some examples of these methods are 3D optical flow [81], shape histogram [82], motion history volume [83], and 3D body skeleton [84].

Shape-Based Approaches
The shape-based approaches capture the local shape features from the human image/silhouette [85].These methods first obtained the foreground silhouette from an image frame using foreground segmentation techniques.Then, they extract the features from the silhouette itself (positive space) or from the surrounding regions of the silhouette (negative space) between canvas and the human body [86].Some of the important features that can be extracted from the silhouette are contour points, region-based features, and geometric features.The region-based human action recognition method was proposed in [87].This method divides the human silhouette into a fixed number of grids and cells for action representation and used a hybrid classifier Support Vector Machine and Nearest Neighbour (SVM-NN) for action recognition.For practical applications, the human action recognition method should be computationally lean.In this direction, an action recognition method was proposed using Symbolic Aggregate approximation (SAX) shapes [88].In this method, a silhouette was transformed into time-series and these time-series were converted into a SAX vector for action representation followed by a random forest algorithm for action recognition.
In [89], a pose-based view invariant human action recognition method was proposed based on the contour points with sequence of the multi-view key poses for action representation.An extension of this method was proposed in [90].This method uses the contour points of the human silhouette and radial scheme for action representation and support vector machine as a classifier.In [86], and [91] a region-based descriptor for human action representation was developed by extracting features from the surrounding regions (negative space) of the human silhouette.Another method used pose information for action recognition [92].In this method, first, the scale invariant features were extracted from the silhouette, and then these features were clustered to build the key poses.Finally, the classification was performed using a weighted voting scheme.

Motion-Based Approaches
Motion-based action recognition approaches use the motion features for action representation followed by a generic classifier for action recognition.A novel motion descriptor was proposed in [93] for multi-view action representation.This motion descriptor is based on motion direction and histogram of motion intensity followed by the support vector machine for classification.Another method based on 2D motion templates using motion history images and histogram of oriented gradients was proposed in [94].In [50], action recognition method was proposed based on the key elements of motion encoding and local changes in motion direction encoded with the bag-of-words technique.

Hybrid Approaches
These approaches combine shape-based and motion-based features for action representation.An optical flow and silhouette-based shape features were used for view invariant action recognition in [95] followed by principal component analysis (PCA) for reducing the dimensionality of the data.Some other methods based on shape and motion information were proposed for action recognition in [96,97].The coarse silhouette features, radial grid-based features and motion features were used for multi-view action recognition in [97].Meanwhile, [80] used shape-motion prototype trees for human action recognition.The authors represented action as a sequence of porotypes in shape-motion space and used distance measure for sequence matching.This method was tested on five public datasets and achieved state-of-the-art results.In [98], the authors proposed a method based on action key poses as a variant of Motion Energy Images (MEI), and Motion History Images (MHI) for action representation followed by simple nearest-neighbour classifier for action recognition.

Other Approaches
In this section we discuss the two important approaches that do not fit under the headings of above mentioned categories.These approaches include Local Binary Pattern (LBP), and fuzzy logic-based methods.

Local Binary Pattern (LBP)-Based Approaches
Local binary patterns (LBP) [99] is a type of visual descriptor for texture classification.Since its inception, several modified versions of this descriptor such as [100][101][102] have been proposed for different classification-related tasks in computer vision.A human action recognition method was proposed in [103] based on LBP combined with appearance invariance and patch matching method.This method was tested on different public datasets and proved to be efficient for action recognition.Another method for activity recognition was proposed using LBP-TOP descriptor [104].In this method, the action volume was partitioned into sub-volumes and feature histogram was generated by concatenating the histograms of sub-volumes.Using this representation, they encoded the motion at three different levels: pixel-level (single bin in the histogram), region-level (sub-volume histogram), and global-level (concatenation of sub-volume-histograms).The LBP-based methods have also been employed for multi-view human action recognition.In [105], a multi-view human action recognition method was proposed based on contour-based pose features and uniform rotation-invariant LBP followed by SVM for classification.Recently, another motion descriptor named Motion Binary Pattern (MBP) was introduced for multi-view action recognition [106].This descriptor is combination of Volume Local Binary Pattern (VLBP) and optical flow.This method was evaluated on multi-view INRIA Xmas Motion Acquisition Sequences (IXMAS) dataset and achieved 80.55% recognition results.

Fuzzy Logic-Based Approaches
Traditional vision-based human action recognition approaches employ the spatial or temporal features followed by a generic classifier for action representation and classification.However, it is difficult to scale up these approaches for handling uncertainty and complexity involved in real world applications.For handling these difficulties, fuzzy-based approaches are considered as a better choice.In [107], a fuzzy-based framework was proposed for human action recognition based on the fuzzy log-polar histograms and temporal self-similarities for action representation followed by SVM for action classification.The evaluation of proposed method on two public datasets confirmed the high accuracy and its suitability for real world applications.Another method based on fuzzy logic was proposed in [108].This method utilized the silhouette slices and movement speed features as input to the fuzzy system, and employed the fuzzy c-means clustering technique to acquire the membership function for the proposed system.The results confirmed the better accuracy of the proposed fuzzy system as compared to non-fuzzy systems for the same public dataset.
Most of the human action recognition methods are view dependent and can recognize the action from a fixed view.However, a real-time human action recognition method must be able to recognize the action from any viewpoint.To achieve this objective, many state-of-the-art methods use multi-camera setup in their processing.However, this is not a practical solution because calibration of multiple cameras in real world scenarios is quite difficult.The use of a single camera should be the ultimate solution for view invariant action recognition.Along these lines, a fuzzy logic-based method was proposed in [109] for view invariant action recognition using a single camera.This method extracted human contour from fuzzy qualitative Poisson human model for view estimation followed by clustering algorithms for view classification as shown in Figure 6.The results indicate that the proposed method is quite efficient for view independent action recognition.Some methods based on neuro-fuzzy systems (NFS) have also been proposed for human gesture and action recognition [110,111].In addition to this, evolving systems [112,113] are also very successful in behaviour recognition.

Fuzzy Logic-Based Approaches
Traditional vision-based human action recognition approaches employ the spatial or temporal features followed by a generic classifier for action representation and classification.However, it is difficult to scale up these approaches for handling uncertainty and complexity involved in real world applications.For handling these difficulties, fuzzy-based approaches are considered as a better choice.In [107], a fuzzy-based framework was proposed for human action recognition based on the fuzzy log-polar histograms and temporal self-similarities for action representation followed by SVM for action classification.The evaluation of proposed method on two public datasets confirmed the high accuracy and its suitability for real world applications.Another method based on fuzzy logic was proposed in [108].This method utilized the silhouette slices and movement speed features as input to the fuzzy system, and employed the fuzzy c-means clustering technique to acquire the membership function for the proposed system.The results confirmed the better accuracy of the proposed fuzzy system as compared to non-fuzzy systems for the same public dataset.
Most of the human action recognition methods are view dependent and can recognize the action from a fixed view.However, a real-time human action recognition method must be able to recognize the action from any viewpoint.To achieve this objective, many state-of-the-art methods use multicamera setup in their processing.However, this is not a practical solution because calibration of multiple cameras in real world scenarios is quite difficult.The use of a single camera should be the ultimate solution for view invariant action recognition.Along these lines, a fuzzy logic-based method was proposed in [109] for view invariant action recognition using a single camera.This method extracted human contour from fuzzy qualitative Poisson human model for view estimation followed by clustering algorithms for view classification as shown in Figure 6.The results indicate that the proposed method is quite efficient for view independent action recognition.Some methods based on neuro-fuzzy systems (NFS) have also been proposed for human gesture and action recognition [110,111].In addition to this, evolving systems [112,113] are also very successful in behaviour recognition.In this section we compare the appearance, LBP and fuzzy logic-based methods.These approaches are simple and have produced state-of-the-art results on Weizmann, KTH, and multiview IXMAS datasets as recorded in Table 2.There are two major approaches for multi-view human action recognition base on shape and motion features: 3D approach and 2D approach [26].As

Discussion
In this section we compare the appearance, LBP and fuzzy logic-based methods.These approaches are simple and have produced state-of-the-art results on Weizmann, KTH, and multi-view IXMAS datasets as recorded in Table 2.There are two major approaches for multi-view human action recognition base on shape and motion features: 3D approach and 2D approach [26].As indicated in Table 2, 3D approaches provide higher accuracy than 2D approaches but at higher computational cost which makes these approaches less applicable for real time applications.In addition to this, it is difficult to reconstruct a good quality 3D model because it depends on the quality of extracted features or silhouettes of different views.Hence, the model is exposed to deficiencies which might have occurred due to segmentation errors in each view point.Moreover, a good 3D model of different views can only be constructed when the views overlap.Therefore, a sufficient number of viewpoints have to be available to reconstruct a 3D model.[120] Motion Feature (3D) 100 Weinland et al. 2006 [83] Motion Features (3D) 93.33 Turaga et al. 2008 [121] Shape-motion (3D) 98.78 Pehlivan and Duygulu 2011 [122] Shape Features (3D) 90.91 Baumann et al. 2016 [106] LBP 80.55

Learning-Based Action Representation Approach
The performance of the human action recognition methods mainly depends on the appropriate and efficient representation of data.Unlike handcrafted representation-based approaches where the action is represented by handcrafted feature detectors and descriptors; learning-based representation approaches have capability to learn the feature automatically from the raw data, thus introducing the concept of end-to-end learning which means transformation from pixel level to action classification.Some of these approaches are based on evolutionary approach (genetic programming) and dictionary learning while others employ deep learning-based models for action representation.We have divided these approaches into two categories: non-deep learning-based approaches and deep learning-based approaches as shown in Figure 7.

Non-Deep Learning-Based Approaches
These approaches are based on genetic programming and dictionary learning as discusses in the following section.

Dictionary Learning-Based Approaches
Dictionary learning is a type of representation learning which is generally based on the sparse representation of input data.The sparse representation is suitable for the categorization tasks in images and videos.Dictionary learning-based approaches have been employed in wide range of computer vision applications such as image classification and action recognition [123].The concept of dictionary learning is similar to BoVW model because both are based on the representative vectors learned from the large number of samples.These representative vectors are called code words, forming a codebook in BoVW model, and dictionary atoms in the context of dictionary learning.One way to get the sparse representation of input data is to learn over-complete basis (dictionary).In [124], three over-complete dictionary learning frameworks were investigated for human action recognition.An over-complete dictionary was constructed from a set of spatio-temporal descriptors, where each descriptor was represented by a linear combination of small number of dictionary elements for compact representation.A supervised dictionary learning-based method was proposed for human action recognition in [125] based on the hierarchical descriptor.Cross-view action recognition problem was addressed by using transferable dictionary pair in [126].In this approach authors learned the view specific dictionaries where each dictionary corresponds to one camera view.Moreover, authors extended this work and a common dictionary which shares information from different views [127].The proposed approach outperforms state-of-the-art methods on similar datasets.A weakly supervised cross-domain dictionary learning-based method was proposed for visual recognition in [128].This method learns discriminative, domain-adaptive, and reconstructive dictionary pair and corresponding classifier parameters without any prior information.
Dictionary leaning-based methods also use unsupervised learning, for example, Zhu, F. et al. [129] proposed an unsupervised approach for cross-view human action recognition.This method does not require target view label information or correspondence annotations for action recognition.The set of low-level trajectory features are coded using locality-constrained linear coding (LLC) [130] to form the coding descriptors, then peak values are pooled to form a histogram that captures the local structure of each action.

Genetic Programming
Genetic programming is a powerful evolutionary technique inspired by the process of natural evolution.It can be used to solve the problems without having the prior knowledge of solutions.In

Non-Deep Learning-Based Approaches
These approaches are based on genetic programming and dictionary learning as discusses in the following section.

Dictionary Learning-Based Approaches
Dictionary learning is a type of representation learning which is generally based on the sparse representation of input data.The sparse representation is suitable for the categorization tasks in images and videos.Dictionary learning-based approaches have been employed in wide range of computer vision applications such as image classification and action recognition [123].The concept of dictionary learning is similar to BoVW model because both are based on the representative vectors learned from the large number of samples.These representative vectors are called code words, forming a codebook in BoVW model, and dictionary atoms in the context of dictionary learning.One way to get the sparse representation of input data is to learn over-complete basis (dictionary).In [124], three over-complete dictionary learning frameworks were investigated for human action recognition.An over-complete dictionary was constructed from a set of spatio-temporal descriptors, where each descriptor was represented by a linear combination of small number of dictionary elements for compact representation.A supervised dictionary learning-based method was proposed for human action recognition in [125] based on the hierarchical descriptor.Cross-view action recognition problem was addressed by using transferable dictionary pair in [126].In this approach authors learned the view specific dictionaries where each dictionary corresponds to one camera view.Moreover, authors extended this work and a common dictionary which shares information from different views [127].The proposed approach outperforms state-of-the-art methods on similar datasets.A weakly supervised cross-domain dictionary learning-based method was proposed for visual recognition in [128].This method learns discriminative, domain-adaptive, and reconstructive dictionary pair and corresponding classifier parameters without any prior information.
Dictionary leaning-based methods also use unsupervised learning, for example, Zhu, F. et al. [129] proposed an unsupervised approach for cross-view human action recognition.This method does not require target view label information or correspondence annotations for action recognition.The set of low-level trajectory features are coded using locality-constrained linear coding (LLC) [130] to form the coding descriptors, then peak values are pooled to form a histogram that captures the local structure of each action.

Genetic Programming
Genetic programming is a powerful evolutionary technique inspired by the process of natural evolution.It can be used to solve the problems without having the prior knowledge of solutions.In human activity, recognition genetic programming can be employed to identify the sequence of unknown primitive operations that can maximize the performance of recognition task.Recently, a genetic programming-based approach was introduced for action recognition in [131].In this method, instead of using handcrafted features, authors automatically learned the spatio-temporal motion features for action recognition.This motion feature descriptor evolved on population of 3D operators such as 3D-Gabor filter and wavelet.In this way, effective features were learnt for action recognition.This method was evaluated on three challenging datasets and outperformed the handcrafted as well as other learning-based representations.

Deep Learning-Based Approaches
Recent studies show that there are no universally best hand-crafted feature descriptors for all datasets, therefore learning features directly from the raw data may be more advantageous.Deep learning is an important area of machine learning which is aimed at learning multiple levels of representation and abstraction that can make sense of data such as speech, images, and text.These approaches have the ability to process the images/videos in their raw forms and automate the process of feature extraction, representation, and classification.These approaches use trainable feature extractors and computational models with multiple processing layers for action representation and recognition.Based on the research study on deep learning presented in [132], we have classified the deep learning models into three categories: (1) generative/unsupervised models (e.g., Deep Belief Networks (DBNs), Deep Boltzmann machines (DBMs), Restricted Boltzmann Machines (RBMs), and regularized auto-encoders); (2) Discriminative/Supervised models (e.g., Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs); (3) Hybrid models, these models use the characteristics of both models, for example a goal of discrimination may be assisted with the outcome of the generative model.However, we are not going to discuss hybrid models separately rather discuss these either in the supervised or unsupervised category.

Generative/Unsupervised Models
Generative/unsupervised deep learning models do not require the target class labels during the learning process.These models are specifically useful when labelled data are relatively scarce or unavailable.Deep learning models have been investigated since the 1960s [133] but researchers have paid little attention towards these models.This was mainly due to the success of shallow models such as SVMs [134], and unavailability of huge amount of data, required for training the deep models.
A remarkable surge in the history of deep models was triggered out by the work of [135] where the highly efficient DBN and training algorithm was introduced followed by the feature reduction technique [136].The DBN was trained layer by layer using RBMs [137], the parameters learned during this unsupervised pre-training phase were fine-tuned in a supervised manner using backpropagation.Since the introduction of this efficient model there has been a lot of interest in applying deep learning models to different applications such as speech recognition, image classification, object recognition, and human action recognition.
A method using unsupervised feature learning from video data was proposed in [138] for action recognition.The authors used independent subspace analysis algorithm for learning spatio-temporal features and combined with deep learning techniques such as convolutional and staking for action representation and recognition.Deep Belief Networks (DBNs) trained with RBMs were used for human action recognition in [139].The proposed approach outperforms the handcrafted learning-based approaches on two standard datasets.Learning continuously from the streaming video without any labels is an important but challenging task.This issue was addressed in [140] by using unsupervised deep learning model.Most of action datasets have been recoded under controlled environment; action recognition from unconstrained videos is a challenging task.A method for human action recognition from unconstrained video sequences was proposed in [141] using DBNs.
Unsupervised learning played a pivotal role in reviving the interests of the researchers in deep learning.However, it has been overshadowed by the purely supervised learning since the major breakthrough in deep learning used CNNs for object recognition [142].However, an important study by the pioneers of latest deep learning models suggest that unsupervised learning is going to be far more important than its supervised counterpart in the long run [13] since we discover the world by observing it rather being told the name of every object.The human and animal learning is mostly unsupervised.

Discriminative/Supervised Models
According to the literature survey of human action recognition, the most frequently used model under the supervised category is Convolutional Neural Networks (CNN or Convnet).The CNN [143] is a type of deep learning model which has shown excellent preformation at tasks such as pattern recognition, hand-written digit classification, image classification and human action recognition [142,144].This is a hierarchical learning model with multiple hidden layers to transform the input volume into output volume.Its architecture consists of three main types of layers: convolutional layer Convolution and Rectifier Linear Unit (CONV + ReLU), pooling layer, and fully-connected layer as shown in Figure 8. Understanding the operation of different layers of CNN require mapping back these activities into pixel space, this is done with the help of Deconvolutional Networks (Deconvnets) [145].The Deconvnets use the same process as CNN but in reverse order for mapping from feature space to pixel space.
Appl.Sci.2017, 7, 110 14 of 37 observing it rather being told the name of every object.The human and animal learning is mostly unsupervised.

Discriminative/Supervised Models
According to the literature survey of human action recognition, the most frequently used model under the supervised category is Convolutional Neural Networks (CNN or Convnet).The CNN [143] is a type of deep learning model which has shown excellent preformation at tasks such as pattern recognition, hand-written digit classification, image classification and human action recognition [142,144].This is a hierarchical learning model with multiple hidden layers to transform the input volume into output volume.Its architecture consists of three main types of layers: convolutional layer Convolution and Rectifier Linear Unit (CONV + ReLU), pooling layer, and fully-connected layer as shown in Figure 8. Understanding the operation of different layers of CNN require mapping back these activities into pixel space, this is done with the help of Deconvolutional Networks (Deconvnets) [145].The Deconvnets use the same process as CNN but in reverse order for mapping from feature space to pixel space.Initially, the deep CNN [143] was used for representation and recognition of objects from still images [142].This was extended to action recognition from videos in [147] using stacked video frames as input to the network but the results were worse than even the handcrafted shallow representations [48,72].This issue was investigated in [148] and they came up with the idea of two-stream (spatial and temporal) CNN for action recognition.An example of two-stream convolutional neural network is shown in Figure 9.These both streams were implemented as Convnet, the spatial stream recognizes the action from still video frames and the temporal stream performs action recognition from the motion in the form of dense optical flow.Afterwards, these two streams were combined using late fusion for action recognition.This method achieved superior results to one of the best shallow handcrafted-based representation methods [48].However, the two-stream architecture may not be suitable for real-time applications due to its computational complexity.Initially, the deep CNN [143] was used for representation and recognition of objects from still images [142].This was extended to action recognition from videos in [147] using stacked video frames as input to the network but the results were worse than even the handcrafted shallow representations [48,72].This issue was investigated in [148] and they came up with the idea of two-stream (spatial and temporal) CNN for action recognition.An example of two-stream convolutional neural network is shown in Figure 9.These both streams were implemented as Convnet, the spatial stream recognizes the action from still video frames and the temporal stream performs action recognition from the motion in the form of dense optical flow.Afterwards, these two streams were combined using late fusion for action recognition.This method achieved superior results to one of the best shallow handcrafted-based representation methods [48].However, the two-stream architecture may not be suitable for real-time applications due to its computational complexity.
n the form of dense optical flow.Afterwards, these two streams were combined us or action recognition.This method achieved superior results to one of the best fted-based representation methods [48].However, the two-stream architecture may for real-time applications due to its computational complexity.st of the deep CNN models for action recognition are limited to handling inputs in 2 r, some applications do have data in 3D form that requires a 3D CNN model.This p ressed in [149] by introducing the 3D convolutional neural networks model for the nce.This model uses features from both spatial and temporal dimensions by perform tions at the convolutional layer.This method achieved state-of-the-art results in airpo nce datasets.supervised learning CNN model 2D or 3D can also be accompanied by some unsup urs.One of the unsupervised endeavours is slow feature analysis (SFA) [150], which arying features from the input signal in an unsupervised manner.Beside other reco s, it has proved to be effective for human action recognition as well [151].In [152 Most of the deep CNN models for action recognition are limited to handling inputs in 2D form.However, some applications do have data in 3D form that requires a 3D CNN model.This problem was addressed in [149] by introducing the 3D convolutional neural networks model for the airport surveillance.This model uses features from both spatial and temporal dimensions by performing 3D convolutions at the convolutional layer.This method achieved state-of-the-art results in airport video surveillance datasets.
The supervised learning CNN model 2D or 3D can also be accompanied by some unsupervised endeavours.One of the unsupervised endeavours is slow feature analysis (SFA) [150], which extracts slowly varying features from the input signal in an unsupervised manner.Beside other recognition problems, it has proved to be effective for human action recognition as well [151].In [152], two-layered SFA learning was combined with 3D CNN for automated action representation and recognition.This method achieved state-of-the-art results on three public datasets including KTH, UCF sports, and Hollywood2.Other types of supervised models include Recurrent Neural Networks (RNNs).A method using RNN was proposed in [153] for skeleton-based action recognition.The human skeleton was divided into five parts and then separately fed into five subnets.The results of these subnets were fused into the higher layers and final representation was fed into the single layer.For further detail regrading this model reader may refer to [153].
The deep learning-based models for human action recognition require huge amount of video data for training.However, collecting and annotating huge amount of video data is immensely laborious and requires huge computational resources.A remarkable success has been achieved in the domains of image classification, object recognition, speech recognition, and human action recognition using the standard 2D and 3D CNN models.However, still there exist some issues such as high computational complexity of training CNN kernels, and huge data requirements for training.To curtail these issues researchers have been working to come up with variations/adaptations of these models.In this direction, factorized spatio-temporal convolutional networks (FSTCN) was proposed in [154] for human action recognition.This network factorizes the standard 3D CNN model as a 2D spatial kernels at lower layers (spatial convolutional layers) based on sequential learning process and 1D temporal kernels in the upper layers (temporal convolutional layers).This reduced the number of parameters to be learned by the network and thus reduced the computational complexity of the training CNN kernels.The detailed architecture is presented in [154].
Another approach using spatio-temporal features with a 3D convolutional network was proposed in [155] for human action recognition.The evaluation of this method on four public datasets confirmed three important findings: (1) 3D CNN is more suitable for spatio-temporal features than 2D CNN; (2) The CNN architecture with small 3 × 3 × 3 kernels is the best choice for spatio-temporal features; (3) The proposed method with linear classifier outperforms the state-of-the-art methods.Some studies have reported that incorporating handcrafted features into the CNN model can improve the performance of action recognition.Along this direction, combining information from multiple sources with CNN was proposed in [156].The authors used handcrafted features to perform spatially varying soft-gating and used fusion method for combining multiple CNNs trained on different sources.Recently, another variation of CNN was proposed in [157], called stratified pooling-based CNN (SP-CNN).Since each video has a different number of frame-level features, to combine and get a video-level feature is a challenging task.The SP-CNN method addressed this issue by proposing variation in the CNN model as follows: (a) adjustment of pre-trained CNN on target dataset; (b) extraction of features at frame-level; (c) using principal component analysis (PCA) for dimensionality reduction; (d) stratified pooling frame-level features into video-level features; (e) SVM for multiclass classification.This architecture is shown in Figure 10.
proposed in [155] for human action recognition.The evaluation of this method on four public datasets confirmed three important findings: (1) 3D CNN is more suitable for spatio-temporal features than 2D CNN; (2) The CNN architecture with small 3 × 3 × 3 kernels is the best choice for spatio-temporal features; (3) The proposed method with linear classifier outperforms the state-of-the-art methods.
Some studies have reported that incorporating handcrafted features into the CNN model can improve the performance of action recognition.Along this direction, combining information from multiple sources with CNN was proposed in [156].The authors used handcrafted features to perform spatially varying soft-gating and used fusion method for combining multiple CNNs trained on different sources.Recently, another variation of CNN was proposed in [157], called stratified poolingbased CNN (SP-CNN).Since each video has a different number of frame-level features, to combine and get a video-level feature is a challenging task.The SP-CNN method addressed this issue by proposing variation in the CNN model as follows:  Sematic-based features such as pose, poselet are important cues for describing the category of an action being performed.In this direction, some methods based on fuzzy CNN were proposed in [158,159] using local posed-based features.These descriptors are based on the motion and appearance Sematic-based features such as pose, poselet are important cues for describing the category of an action being performed.In this direction, some methods based on fuzzy CNN were proposed in [158,159] using local posed-based features.These descriptors are based on the motion and appearance information acquired from tracking human body parts.These methods were evaluated on Human Motion Database (HMDB) produced superior results than other state-of-the-art methods.It has been observed that the context/scene where the action is carried out also provides important cues regarding the category of an action.In [160], the contextual information was exploited for human action recognition.They adapted the Region-based Convolutional Neural Network (RCNN) [161] to use more than one region for classification.It considered the actor as a primary region and contextual cues as a secondary region.
One of the major challenges in human action recognition is view variance.The same action viewed from different angles looks quite different.This issue was addressed in [162] using CNN.This method generates the training data by fitting synthetic 3D human model to real motion and renders human poses from different viewpoints.Convolutional Neural Networks model has shown better performance than handcrafted representation-based methods for multi-view human action recognition.Table 3 shows the comparison of non-deep learning and deep learning-based methods on different public datasets.

Discussion
In this section we summarize and discuss the learning-based action representation approaches.These approaches have been divided into genetic programming, dictionary learning and supervised and unsupervised deep learning-based approaches according to the learning representation used in each category.However, this division boundary is not strict and approaches may overlap.
The dictionary learning-based approaches have attracted increasing interest of researchers in computer vision, specifically in human activity recognition.These approaches have introduced the concept of unified learning of dictionary and corresponding classifier into a single learning procedure, which leads to the concept of end-to-end learning.On the other hand, genetic programming (GP) is a powerful evolutionary method inspired by natural selection, used to solve the problem without prior domain knowledge.In human action recognition, GP is used to design the holistic descriptors that are adaptive, and robust for action recognition.These methods have achieved state of the art results on challenging action recognition datasets.
Deep learning has emerged as highly popular direction within the machine learning which has outperformed the traditional approaches in many applications of computer vision.The highly advantageous property of deep learning algorithms is their ability to learn features from the raw data, which eliminates the need of handcrafted feature detectors and descriptors.There are two categories of deep learning models, i.e., unsupervised/generative and supervised/discriminative models.The DBN is a popular generative model which has been used for human action recognition.This model has already achieved high performance on challenging datasets as compared to its traditional handcrafted counterparts [139].On the other hand, CNN is one of the most popular deep learning models in the supervised learning category.Most of the existing learning-based representations either directly apply CNN to video frames or variations of CNN for spatio-temporal features.These models have also achieved excellent results on challenging human activity recognition datasets as recoded in Table 3.So far, supervised deep learning models have achieved better performance but some studies suggest that unsupervised learning is going to be far more important in the long run.Since we discover the world by observing it rather being told the name of every object, human and animal learning is mostly unsupervised [13].
The deep learning models have also some limitations: These models require huge amount of data for training the algorithm.Most of the action recognition datasets such as KTH [35], IXMAS [118], HDMB-51 [47], and UCF Sports [43,44] are comparatively small for training these models.However, recently a large-scale ActivityNet dataset [176] was proposed with 200 action categories, 849 hours of video in total.This dataset is suitable to train the deep learning-based algorithms.We can expect a major breakthrough with development of algorithms that could produce remarkable results on this dataset.

Datasets
In this section well-known public datasets for human activity recognition are discussed.We focus on recently developed datasets which have been frequently used for experimentations.

Weizmann Human Action Dataset
This dataset [114] was introduced by the Weizmann institute of Science in 2005.This dataset consists of 10 simple actions with static background, i.e., walk, run, skip, jack, jump forward or jump, jump in place or pjump, gallop-sideways or side, bend, wave1, and wave2.It is considered as a good benchmark for evaluation of algorithms proposed for recognition of simple actions.Some methods such as [86,87] have reported 100% accuracy on this dataset.The background of the dataset is simple and only one person performs the action in each frame as shown in Figure 11.
benchmark for evaluation of algorithms proposed for recognition of simple actions.Some methods such as [86,87] have reported 100% accuracy on this dataset.The background of the dataset is simple and only one person performs the action in each frame as shown in Figure 11.

KTH Human Action Dataset
The KTH dataset [35] was created by the Royal Institute of Technology, Sweden in 2004.This dataset consists of six types of human actions (walking, jogging, running, boxing, hand clapping and hand waving) performed by 25 actors with 4 different scenarios.Thus, it contains 25 × 6 × 4 = 600 video sequences.These videos were recorded with static camera and background; therefore, this dataset is also considered relatively simple for evaluation of human activity recognition algorithms.The method proposed in [36] achieved 98.2% accuracy on this dataset, which is the highest accuracy reported so far.The one frame example of each action from four different scenarios is shown in Figure 12.

KTH Human Action Dataset
The KTH dataset [35] was created by the Royal Institute of Technology, Sweden in 2004.This dataset consists of six types of human actions (walking, jogging, running, boxing, hand clapping and hand waving) performed by 25 actors with 4 different scenarios.Thus, it contains 25 × 6 × 4 = 600 video sequences.These videos were recorded with static camera and background; therefore, this dataset is also considered relatively simple for evaluation of human activity recognition algorithms.The method proposed in [36] achieved 98.2% accuracy on this dataset, which is the highest accuracy reported so far.The one frame example of each action from four different scenarios is shown in Figure 12.
Appl.Sci.2017, 7, 110 18 of 37 benchmark for evaluation of algorithms proposed for recognition of simple actions.Some methods such as [86,87] have reported 100% accuracy on this dataset.The background of the dataset is simple and only one person performs the action in each frame as shown in Figure 11.

KTH Human Action Dataset
The KTH dataset [35] was created by the Royal Institute of Technology, Sweden in 2004.This dataset consists of six types of human actions (walking, jogging, running, boxing, hand clapping and hand waving) performed by 25 actors with 4 different scenarios.Thus, it contains 25 × 6 × 4 = 600 video sequences.These videos were recorded with static camera and background; therefore, this dataset is also considered relatively simple for evaluation of human activity recognition algorithms.The method proposed in [36] achieved 98.2% accuracy on this dataset, which is the highest accuracy reported so far.The one frame example of each action from four different scenarios is shown in Figure 12.

IXMAS Dataset
INRIA Xmas Motion Acquisition Sequences (IXMAS) [118] a multi-view dataset was developed for evaluation of view-invariant human action recognition algorithms in 2006.This dataset consists of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13. of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The

cross arms
Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The wave Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The punch Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The kick Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The

Pick up
Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The Appl.Sci.2017, 7, 110 19 of 37 of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47]  of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The of 13 daily life actions performed by 11 actors 3 times each.These actions include crossing arms, stretching head, sitting down, checking watch, getting up, walking, turning around, punching, kicking, waving, picking, pointing, and throwing.These actions were recoded with five calibrated cameras including 4 side cameras and a top camera.The extracted silhouettes of the video sequences are also provided for experimentation.Basically, two types of approaches have been proposed for multi-view action recognition, i.e., 2D and 3D-based approaches.The 3D approaches have reported higher accuracies than the 2D approaches on this dataset but at higher computational cost.The highest accuracy reported on this dataset is 100% in [120] using 3D motion descriptors (HOF3D descriptors and 3D spatial pyramids (SP)).The example frames for each action from five different camera views are shown in Figure 13.

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The

HMDB-51
The HMDB-51 [47] is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The highest accuracy reported so far on this dataset is 74.7% in [157] using SP-CNN (as shown in Table 4).One frame example for each action is shown in Figure 14a,b.

Hollywood2
Hollywood2 [56] action dataset was created by INRIA (Institut National de Recherche en Informatique et en Automatique), France in 2009.This dataset consists of 12 actions (get out of car, answer phone, kiss, hug, handshake, sit down, stand up, sit up, run, eat, fight, and drive car) with dynamic background features.This dataset is very challenging, consists of short unconstrained movies with multiple persons, cluttered background, camera motion, and large intra-class variations.This dataset is meant for evaluation of HAR algorithms in real life scenarios.Many researchers have evaluated their algorithms on this dataset, the best accuracy achieved so far is 75.2% in [169] using rank pooling and CNN.Some example frames from Hollywood2 dataset are shown in Figure 15.
movies with multiple persons, cluttered background, camera motion, and large intra-class variations.This dataset is meant for evaluation of HAR algorithms in real life scenarios.Many researchers have evaluated their algorithms on this dataset, the best accuracy achieved so far is 75.2% in [169] using rank pooling and CNN.Some example frames from Hollywood2 dataset are shown in Figure 15.

UCF-101 Action Recognition Dataset
UCF-101 action recognition dataset [173] was created by the Centre for Research in Computer Vision, University of Central Florida, USA in 2012.This is one of the largest action dataset contains 101 action categories collected from YouTube.This dataset is an extension of UCF-50 [177] dataset with 50 action categories.UCF-101 contains 13,320 videos in total, aimed at encouraging the researchers to develop their algorithms for human action recognition is realistic scenarios.The example frames for each action are shown in Figure 16

UCF-101 Action Recognition Dataset
UCF-101 action recognition dataset [173] was created by the Centre for Research in Computer Vision, University of Central Florida, USA in 2012.This is one of the largest action dataset contains 101 action categories collected from YouTube.This dataset is an extension of UCF-50 [177] dataset with 50 action categories.UCF-101 contains 13,320 videos in total, aimed at encouraging the researchers to develop their algorithms for human action recognition is realistic scenarios.The example frames for each action are shown in Figure 16a,b.

UCF Sports Action Dataset
UCF sports action dataset was created by the Centre for Research in Computer Vision, University of Central Florida, USA in 2008 [43,44].It consists of 11 sports action categories (walking, swing-side, swing-bench, skateboarding, running, lifting, kicking, golf swing, riding, and diving) broadcasted on television channels.The dataset includes total 150 video sequences of realistic scenarios.The best accuracy achieved on this dataset so far is 95.0% in [36] using STVs as shown in Table 4.The example frames for each action are shown in Figure 17.

UCF Sports Action Dataset
UCF sports action dataset was created by the Centre for Research in Computer Vision, University of Central Florida, USA in 2008 [43,44].It consists of 11 sports action categories (walking, swing-side, swing-bench, skateboarding, running, lifting, kicking, golf swing, riding, and diving) broadcasted on television channels.The dataset includes total 150 video sequences of realistic scenarios.The best accuracy achieved on this dataset so far is 95.0% in [36] using STVs as shown in Table 4.The example frames for each action are shown in Figure 17.

YouTube Action Dataset
YouTube action dataset [64] was developed in 2009.This is a challenging dataset due to camera motion, viewpoint variations, illumination conditions, and cluttered backgrounds.It contains 11 action categories: biking, diving, basketball shooting, horse riding, swinging, soccer juggling,

YouTube Action Dataset
YouTube action dataset [64] was developed in 2009.This is a challenging dataset due to camera motion, viewpoint variations, illumination conditions, and cluttered backgrounds.It contains 11 action categories: biking, diving, basketball shooting, horse riding, swinging, soccer juggling, trampoline jumping, volleyball spiking, golf swinging, tennis swinging, and walking with a dog.The highest accuracy achieved so far on this dataset is 93.38% in [52] using FV and SFV.The example frames for each action are shown in Figure 18.

YouTube Action Dataset
YouTube action dataset [64] was developed in 2009.This is a challenging dataset due to camera motion, viewpoint variations, illumination conditions, and cluttered backgrounds.It contains 11 action categories: biking, diving, basketball shooting, horse riding, swinging, soccer juggling, trampoline jumping, volleyball spiking, golf swinging, tennis swinging, and walking with a dog.The highest accuracy achieved so far on this dataset is 93.38% in [52] using FV and SFV.The example frames for each action are shown in Figure 18.

ActivityNet Dataset
ActivityNet [176] was created in 2015.This is a large-scale video dataset covering wide range of complex human activities.It provides 203 action categories in total 849 hours of video data.This dataset is specifically helpful for training the classifiers which require a huge amount of data for training such as deep neural networks.According to the results reported in [176], authors achieved 42.2% accuracy on untrimmed videos and 50.2% on trimmed videos classification.They used deep features (DF), motion features (MF), and static features (SF) as shown in Table 3.Some example frames from this dataset are shown in Figure 19.

ActivityNet Dataset
ActivityNet [176] was created in 2015.This is a large-scale video dataset covering wide range of complex human activities.It provides 203 action categories in total 849 hours of video data.This dataset is specifically helpful for training the classifiers which require a huge amount of data for training such as deep neural networks.According to the results reported in [176], authors achieved 42.2% accuracy on untrimmed videos and 50.2% on trimmed videos classification.They used deep features (DF), motion features (MF), and static features (SF) as shown in Table 3.Some example frames from this dataset are shown in Figure 19.

Applications
There are numerous applications of human activity recognition methods.These applications include but are not limited to intelligent video surveillance, entertainment, ambient assisted living, human-robot interaction, and intelligent driving.These applications and state-of-the-art methods and techniques developed for these applications are discussed in the subsequent sections.

Applications
There are numerous applications of human activity recognition methods.These applications include but are not limited to intelligent video surveillance, entertainment, ambient assisted living, human-robot interaction, and intelligent driving.These applications and state-of-the-art methods and techniques developed for these applications are discussed in the subsequent sections.

Intelligent Video Surveillance
Traditional security surveillance systems use many cameras and require laborious human monitoring for video content analysis.On the other hand, intelligent video surveillance systems are aimed at automatically tracking the individuals or a crowd and can recognize their activities.These kinds of systems can also detect the suspicious or criminal activities and report to the authorities for immediate action.In this way, the workload of the security personnel can be reduced and security events can be alerted, which can be helpful to prevent dangerous situations.
Different techniques have been proposed for the video surveillance systems.In [178], a visual surveillance system was proposed for tracking and detection of moving objects in real time.A novel method for tracking the pedestrians in crowd video scenes using local spatio-temporal patterns exhibited by the pedestrians was proposed in [179].It used the collection of Hidden Markov Models (HMMs) trained on local spatio-temporal motion patterns.By employing this representation authors were able to predict the next local spatio-temporal patterns of the tracked pedestrians based on the observed frames of the videos.In [180], a framework was developed for an anomaly detection and automatic behaviour profiling without manual labelling of the training dataset.This framework consists of different components required to develop an effective behaviour representation for event detection.
A real time approach for novelty detection and anomaly recognition in video surveillance systems was proposed in [181].It is based on on-line clustering technique for the anomaly detection in videos using two-step process.In the first step, recursive density estimation (RDE) was used for novelty detection using frame wise Cauchy type kernel.In the second step, multi-feature on-line clustered trajectory was used for the identification of anomalies in the video stream.The detail survey on video surveillance can be found at [182,183].

Ambient Assisted Living
HAR-based systems have important applications in the ambient assisted living (AAL).These systems are used as healthcare systems to understand and analyse the patients' activities, to facilitate health workers in treatment, diagnosis, and general healthcare of the patients.In addition to this, these systems are also useful for smart homes such as monitoring daily life activities of elderly people, meant to provide a safe, independent and comfortable stay for elderly people.Usually, these systems capture the continuous movement of elderly people, automatically recognize their activities and detect any abnormality as it happens, such as falling down, having a stroke or respiration issues.
Among these abnormal activities, 'fall' is a major cause for fatal injury, especially for elderly people and is considered a major hindrance for independent living.Different methods have been proposed in the literature for daily life monitoring of elderly people [184].The fall detection approaches are divided into three types; ambience-device-based, wearable-device-based, and vision-based.Among these, vision-based approaches are more popular and offer multiple advantages as compared to ambience-device-based and wearable-device-based approaches.In [184], a method for fall detection was proposed using combination of integrated time motion image (ITMI) and Eigen space method.The ITMI is a spatio-temporal database which contains motion information and its time stamp and Eigen space technique was used for feature reduction.Then, the reduced feature vector was passed to the neural network for activity recognition.A classification method for fall detection using deformation of the human shape was proposed in [185].In this method, edge points from the silhouettes were extracted using Canny edge operator for matching two shapes.The Procrustes analysis and mean matching cost were applied for shape analysis.Finally, the fall is characterised by the peak of the smoothed mean matching curve or Procrustes curve.A detailed survey on fall detection approaches can be found in [186].
Monitoring patients' activities remotely is very important for assessing their well-being when they are shifted from hospital to their homes.The homes which are equipped with monitoring facilities are known as smart homes.Different techniques have been proposed in literature for smart homes [187].Beside, vision-based techniques, a number of sensor-based techniques have also been presented in the literature [188].In [189], a genetic programming-based classifier was proposed for activity recognition in a smart home environment.It combined the measurement level decisions of the four different classifiers Artificial Neural Network (ANN), Hidden Markov Model (HMM), and Support Vector Machine (SVM) with respect to the assigned weights to each activity class.The weights are optimized using genetic programming for each classifier, where the weights are represented as chromosomes in the form of strings of real values.The results indicate the better performance of the ensemble classifier as compared to a single classifier.

Human-Robot Interaction
Vision-based activity recognition has important applications in human-robot interaction (HRI).Giving robot the ability to recognize human activities is very important for HRI.This makes the robots useful for the industrial setup and well as in the domestic environment as a personal assistant.In the domestic environment, one of the applications of HRI can be seen as humanoid robots that could recognize human emotions from the sequence of images.A method based on the neural networks was proposed in [190] for the humanoid robot.In this method, six basic emotions (neutral, happiness, sad, fear, disgust and anger) were recognized from the facial expressions, and the topics embedded in the conversation were also identified and analysed.This method is effective for recognizing basic emotions but was not effective for compound emotions such as anger and surprise.In [191], an activity recognition method was proposed for HRI in industrial settings.It was based on the spatial and temporal features from skeletal data of the human workers performing the assembly task.The 3D coordinates of skeletal joints were acquired from Red Green Blue-Depth (RGB-D) data with the help of two Microsoft Kinect sensors.In order to select the best features, the random forests algorithm was applied.Finally, three groups of activities (movement, gestures, and object handling) were recognized using a hidden Markov model.Results indicated an average accuracy of 74.82%.However, this method was unable to recognize more complex activities such as entering and leaving the scene.
Since robots are becoming part of our lives, it is very important that robots should understand human's emotions, intentions, and behaviour.This problem is termed as robot-centric activity recognition.A method for robot-centric activity recognition was proposed in [192] from the videos recorded during HRI.The purpose of this method was recognition of the human activities from the actor's own viewpoint.Unlike conventional third-person activity recognition, the actor (robot) wearing the camera, was involved in ongoing activity.This is not only recognition of activity in real time but also recognizing the activity before its completion, which is really a challenging task.A method for this type of activity recognition was proposed in [193] from RGB-D videos captured by the robot while physically interacting with the humans.The authors used four different descriptors (spatio-temporal points in RGB and depth data, 3D optical flow, and body posture descriptors) as features.With the combination of these descriptors and SVM, authors achieved recognition accuracy of 85.60%.

Entertainment
Human activity recognition systems are used for recognition of entertainment activities such as dance [114], and sports [194].In [194], an object-based method was proposed for sports video sequences.In this method, bowling, pitching, golf swing, downhill skiing, and ski jump actions were recognized.The Dynamic Bayesian Networks (DBNs) were employed for recognition of these actions.
The modelling of a player's action in the game has achieved much attention from the sports community in recent years due to its important implications.In addition to this, modelling the behaviour of a player in real-time can be helpful in adapting the change in the game as it occurs.In [195], a method for tracking the behaviour of players during interaction with the game was proposed using an incremental learning technique.The authors used a stream mining change detection technique based on the novelty detection and incremental learning, and performed a set of simulations on UT2004 commercial game.

Intelligent Driving
Human activity recognition techniques are also employed to assist drivers by providing different cues regarding the state of the driver while driving a vehicle.It has been reported that the secondary tasks performed by the drivers such as answering the phone, sending or receiving text messages, eating or drinking while operating a vehicle cause inattentiveness which may lead to accidents [196,197].A multi-modal vision frame-work for driver's activity recognition was proposes in [198].This method extracted the head, eyes, and hand cues to describe the state of the driver.These cues were fused using support vector machine for activity classification.Another technique for driver-activity recognition was proposed in [199].This method is based on head and eye tracking of the drivers while operating the vehicle.

Conclusions
In this review, we provide a compressive survey of state-of-the-art human action representation and recognition approaches including both handcrafted and learning-based representations.The handcrafted action representation approaches have been there for a quite long time.These approaches have achieved remarkable results on different publically available bench mark datasets.However, most successful handcrafted representation methods are based on the local densely-sampled descriptors, which incur a high computational cost.In these approaches, the important features from the sequence of image frames are extracted to build the feature vector using human engineered feature detectors and descriptors.Then, the classification is performed by training a generic classifier.These approaches include space-time, appearance, local binary patterns, and fuzzy logic-based approaches.
On the other hand, learning-based action representation approaches use trainable feature extractors followed by a trainable classifier, which lead to the concept of end-to-end learning or learning from pixel level to action categories identification.This eliminates the need for handcrafted feature detectors and descriptors used for action representation.These approaches include evolutionary (GP-based), dictionary learning, and deep learning-based approaches.Recently, the research community has paid a lot of attention to these approaches.This is mainly due to their high performance as compared to their handcrafted counterparts on some challenging datasets.However, fully data-driven deep models referred to as "black-box" have some limitations: Firstly, it is difficult to incorporate problem-specific prior knowledge into these models.Secondly, some of the best performing deep learning-based methods are still dependent on handcrafted features.The performance of the pure learning-based methods is still not up to the mark.This is mainly due to the unavailability of huge datasets for action recognition unlike in the object recognition where huge dataset such as ImageNet is available.Recently, a large scale dataset ActivityNet has been developed, which is expected to fill this gap.This dataset contains over 200 action categories with 849 hours of video data.
In order to provide further insight into the field, we have presented the well-known public datasets for activity recognition.These datasets include: KTH, Weizmann, IXMAS, UCF Sports, Hollywood2, YouTube, HDMB-51, UCF-101, and ActivityNet.In addition to this, we have also presented the important applications of human activity recognition such as intelligent video surveillance, ambient assisted living, human-robot interaction, entertainment, and intelligent driving.

Figure 1 .
Figure 1.Categorization of different level of activities.

Figure 1 .
Figure 1.Categorization of different level of activities.

Figure 2 .
Figure 2. Example of handcrafted representation-based approach.

Figure 3 .
Figure 3. Example of learning-based representation approach.

Figure 3 .
Figure 3. Example of learning-based representation approach.

Figure 3 .
Figure 3. Example of learning-based representation approach.

Figure 4 .
Figure 4. Traditional action representation and recognition approach.

Figure 4 .
Figure 4. Traditional action representation and recognition approach.

Figure 6 .
Figure 6.Example of Fuzzy view estimation framework

Figure 6 .
Figure 6.Example of Fuzzy view estimation framework.
(a) adjustment of pre-trained CNN on target dataset; (b) extraction of features at frame-level; (c) using principal component analysis (PCA) for dimensionality reduction; (d) stratified pooling frame-level features into video-level features; (e) SVM for multiclass classification.This architecture is shown in Figure 10.

Figure 10 .
Figure 10.An example of stratified pooling with CNN.

Figure 10 .
Figure 10.An example of stratified pooling with CNN.

Figure 11 .
Figure 11.One frame example of each action in Weizmann dataset.

Figure 12 .
Figure 12.One frame example of each action from four different scenarios in the KTH dataset.
Figure 11.frame example of each action in Weizmann dataset.

Figure 11 .
Figure 11.One frame example of each action in Weizmann dataset.

Figure 12 .
Figure 12.One frame example of each action from four different scenarios in the KTH dataset.

Figure 12 .
Figure 12.One frame example of each action from four different scenarios in the KTH dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 13 .
Figure 13.One frame example for each action from five different camera views in IXMAS (INRIA Xmas Motion Acquisition Sequences) dataset.

Figure 17 .
Figure 17.Exemplar frames from sports action dataset.

Figure 17 .
Figure 17.Exemplar frames from sports action dataset.

Figure 17 .
Figure 17.Exemplar frames from sports action dataset.

Figure 18 .
Figure 18.Exemplar frames of 11 sports actions from YouTube action dataset

Figure 18 .
Figure 18.Exemplar frames of 11 sports actions from YouTube action dataset.

Table 1 .
Comparison of Space-Time-based approaches for activity recognition on different datasets.

Table 2 .
Comparison of appearance, LBP (Local Binary Pattern), and fuzzy logic-based approaches.

Table 3 .
Comparison of Learning-based action representation approaches.
is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The is one of the largest datasets available for activity recognition developed by Serre lab, Brown University, USA in 2011.It consists of 51 types of daily life actions comprised of 6849 video clips collected from different sources such as movies, YouTube, and Google videos.The

Table 4 .
Well-known public datasets for human activity recognition.

Table 4 .
Well-known public datasets for human activity recognition.
[56]Hollywood2Hollywood2[56]action dataset was created by INRIA (Institut National de Recherche en Informatique et en Automatique), France in 2009.This dataset consists of 12 actions (get out of car, answer phone, kiss, hug, handshake, sit down, stand up, sit up, run, eat, fight, and drive car) with dynamic background features.This dataset is very challenging, consists of short unconstrained