A Discriminative Framework for Action Recognition Using f-HOL Features

Inspired by the overwhelming success of Histogram of Oriented Gradients (HOG) features in many vision tasks, in this paper, we present an innovative compact feature descriptor called fuzzy Histogram of Oriented Lines (f-HOL) for action recognition, which is a distinct variant of the HOG feature descriptor. The intuitive idea of these features is based on the observation that the slide area of the human body skeleton can be viewed as a spatiotemporal 3D surface, when observing a certain action being performed in a video. The f-HOL descriptor possesses an immense competitive advantage, not only of being quite robust to small geometric transformations where the small translation and rotations make no large fluctuations in histogram values, but also of not being very sensitive under varying illumination conditions. The extracted features are then fed into a discriminative conditional model based on Latent-Dynamic Conditional random fields (LDCRFs) to learn to recognize actions from video frames. When tested on the benchmark Weizmann dataset, the proposed framework substantially supersedes most existing state-of-the-art approaches, achieving an overall recognition rate of 98.2%. Furthermore, due to its low computational demands, the framework is properly amenable for integration into real-time applications.


Introduction
In recent years, automatic recognition of human activities from both still images and video sequences has attracted tremendous research interest, due to its immense potential for many applications in various fields and domains [1,2].Although many efficient applications are available for the purpose of human action recognition, the most active widespread application might be Human Computer Interaction (HCI), where no explicit user's actions (e.g., keystrokes and mouse clicks) are available to capture user input.Instead, interactions are more likely to occur through human actions and/or gestures [3].In this sense, it is worth pointing out an indisputable fact that the recognition of human actions is an effortless process for us as human beings, but a very challenging task for computers.The task of action recognition shares many challenges with other tasks in computer vision and pattern recognition, such as object detection and tracking, motion recognition, etc.The major common challenges in human action recognition shared with other problems in computer vision include illumination conditions, occlusions, clutter in background, object deformations, intra/inter class variations, pose variations, and camera point-of-view.
Furthermore, while human vision has an extraordinary ability to efficiently recognize human actions from video data, with a high degree of accuracy, it is an arduous task for the computer to do such a task in a very similar manner.First, the fact that the same action is performed by different people at different velocity poses a quite serious technical problem to any automatic striving to achieve action recognition task optimally.Moreover, moving shadows generated by bad lighting conditions can also degrade tracking the motion of human body parts.Some body parts can be occluded owing to camera viewpoints that provides an additional difficulty to the human action recognition task.Added to that, small moving objects (i.e., distractors) in background are also another problematic issue for this task.For example, in a scene of crowded street, trees swinging and/or shop advertisements blinking in the background are challenging issues for motion detection and tracking.
It is worth emphasizing here that, while the early research into human motion modeling and/or recognition dates back to the pioneering work of Johansson [4] in the mid-1970s, research on human action recognition did not shift to the forefront until the early 1990s.About a decade later, by the end of the 1990s, research in human action recognition has barely begun to get into its infancy [5,6].It may be interesting to state that the past ten years or so have witnessed an increasing number of research efforts in human action recognition.However, the contributions to improved human action recognition have been modest, as well as the off-the-shelf technology solution space for human action recognition is still far from being quite mature yet.The experimental systems of action recognition are now appearing at a very limited number of locations (e.g., airports and other public places).
The primary objective of this paper is to perform an innovation design for improving the recognition of human actions in video sequences, by developing an efficient and reliable method for modeling and recognizing human actions from video sequences.In our approach, we present a new feature descriptor called fuzzy Histogram of Oriented Lines (f-HOL) for action recognition, which is a distinct variant of Histogram of Oriented Gradients (HOG) features.The extracted features are then used by a discriminative Latent-Dynamic Conditional random fields (LDCRFs) model to recognize actions from the video frames.A set of validation experiments are conducted on the benchmark Weizmann action recognition dataset.The preliminary results achieved with this approach are promising and compare very favorably to those of other investigators published in the literature.
The rest of this paper proceeds as follows.In Section 2, we present an overview of related work to provide relevant background knowledge concerning the problem domain.The architecture of the proposed framework for human action recognition is fully detailed in Section 3. Section 4 summarizes our extensive experiments comparing the results achieved by the proposed approach with those of other similar state-of-the-art methods from literature.Finally, we conclude and suggest possible directions for future research in Section 5.

Related Literature
Over the course of nearly two decades or so, a large amount of literature has been reported by many researchers in the fields of computer vision and pattern recognition for video-based human action recognition [3,[7][8][9][10][11], motivated by a wide spectrum of real-world applications, such as intelligent human-computer interface, detection of abnormal events, video retrieval, autonomous video surveillance, etc. Broadly speaking, human actions can be recognized using various visual cues, such as motion [12,13] and shape [14].Extensive literature surveys reveal that there exists an increasing corpus of prior work on human action recognition focusing on using spatial-temporal keypoints and local feature descriptors [1,15,16].The local features are extracted from the region around each keypoint detected by the keypoint detection process.These features are then quantized to provide a discrete set of visual words before they are fed into the classification module.
In [9], Blank et al. define actions as 3D space-time shapes generated by accumulating the detected foreground human figures, while they assume fixed camera and known background appearance.Their method greatly depends on moving object tracking, where various space-time features (e.g., local space-time saliency, action dynamics, shape structures, and orientation) are extracted for action recognition.In addition, there is another thread of research targeted at analyzing patterns of motion to recognize human actions.For instance, in [17], the authors analyze the periodic structure of optical flow patterns for gait recognition.In the same vein, in [18], periodic motions are detected and classified to recognize actions from video sequences.Alternatively, several researchers have proposed using both motion and shape cues.For example, in [19], the authors detect the similarity between video segments using a space-time correlation model.In [20], Bobick and Davis use temporal templates, including motion-energy images and motion-history images to recognize human movement.Rodriguez et al. [21] present a template-based approach using a Maximum Average Correlation Height (MACH) filter to capture intra-class variabilities.Moreover, in a more recent work by Fernando et al. [22], an effective approach for action recognition is proposed, which has the potential to adequately represent action videos using a ranking machine.More specifically, given the frame descriptors, the action video can then be represented by the hyperplane that ranks the frames based on their temporal orders.
Furthermore, in [23], Kantorov and Laptev exploit optical flow measurements from videos to encode the pixel motions.In their work, as a spatiotemporal descriptor for action representation, they present the Histogram of optical Flow (HoF) over local regions.Since the measurement of optical flow is computationally intensive, the authors opted for the use of video decompression techniques.More specifically, they opted for not obtaining the HoF descriptor (or its more recent extension MBH descriptor) from the estimation of original optical flow fields.Instead, they make use of the motion fields in MPEG compression.This motion field, the so-called MPEG Flow, can be generated virtually free during a video decoding operation.
Furthermore, a substantial amount of research has been conducted on the modelling and understanding human motions by constructing elaborated temporal dynamic models [24,25].In addition, there is an increasing body of research reporting promising results using generative topic models for visual recognition based on the so-called Bag-of-Words (BoW) models.The key concept of a BoW is that the video sequences are represented by counting the number of occurrences of descriptor prototypes, so-called "visual words".Topic models are built and then applied to the BoW representation.Three of the most popularly used topic models are Correlated Topic Models (CTM) [26], Latent Dirichlet Allocation (LDA) [27] and probabilistic Latent Semantic Analysis (pLSA) [28].

Proposed Methodology
In this section, the proposed approach for video-based action recognition is described.A step-by-step overview of the proposed methodology is presented in Figure 1.As schematically illustrated in the above figure, the general framework of our proposed approach proceeds as follows.The first step consists in separating moving objects (i.e., human body parts) from the background in a given video sequence.Adaptive background subtraction is most appropriate to achieve this goal of motion detection.The segmented body parts can further be refined by standard morphological operations, such as erosion and dilation.Then, a set of low dimensional local features is extracted from the silhouettes of moving objects.Finally, a 3D vector representation is formed from the extracted features and then fed into an LDCRF classifier for action classification.The details of each part of the proposed technique are described in the remainder of this section.

Background Subtraction and Shadow Removal
In this step, we aim at robust background subtraction involving shadow detection and removal to track human movements in video sequences.To achieve this goal, we employ an effective algorithm for background modeling, subtraction, update, and shadow removal [29].The fundamental idea of the algorithm consists in modelling the background color distribution with adaptive Gaussian mixtures, coupled with color-based shadow detection.Adaptive Gaussian mixtures are employed in the combined input space of luminance-invariant color to distinguish moving foreground from their moving cast shadows in video sequences.
In its most general form, background subtraction is a commonly used paradigm for detecting moving objects in a scene taken from a stationary camera, which involves two distinct processes that operate in a closed loop: background modeling and foreground detection.A standard approach for background modeling involves the construction of a model for background in the field of view of a camera.Then, the background model is periodically updated to account for illumination changes.In the foreground detection, a decision is used to ensure that the model is fit to the background.The resulting change label field is fed back into background modeling so that no foreground intensities contaminate the background model.With respect to Gaussian mixtures, it is perhaps not irrelevant to point out that Gaussian Mixtures Models (GMMs) are an instance of a larger class of density models that have several functions as additive components [30].
In this work, Gaussian mixture models are used for modelling background.On this model, each pixel in the scene is modeled using a mixture of K (usually set from three to five) Gaussian distributions; we used K = 3 in our experiments.The persistence and variance of each Gaussian of the mixture are used to determine which Gaussian probably corresponds to background colors.Pixels whose color values do not fit the background distributions are detected as part of the foreground or moving objects.More formally, let {X 1 , . . ., X t } be the history associated with a pixel at time t, where X i (i = 1, . . ., t) are measurements of the RGB (Red, Green, Blue) values at time i.The recent history of each pixel can be modeled reasonably well with a mixture of K Gaussian distributions.Thus, the probability of observing the current pixel value is defined as follows: where ω i,t , t. µ i,t and Σ i,t are an estimate of the weight, the mean value, and the covariance matrix of the i-th Gaussian in the mixture at time t, respectively.η is a Gaussian probability density function: Assuming the independence of the color channels, Σ i,t can be expressed as: Thereafter, online approximation is used to update the model in an iterative manner as follows.At each pixel, all parameters of the most matched Gaussian are updated via an online K-means approximation, whereas only the weight parameters of others are updated, while their means and variances remain unchanged (full details about model parameter's updates can be found in [5]).Figure 2 shows a sample of results for the human object scene when the video segmentation technique with Gaussian mixture model (GMM) and shadow removal is applied, where the moving object along with the shadow is extracted.
It is generally recognized that an effective design of a color model that is able to separate the brightness from the chromaticity component can robustly remove shadow from images [31].Formally speaking, for a given pixel, let B t = [µ r , µ g , µ b ] be the expected background value approximated from a set of first training frames.Thus, in each coming frame X t = [x r (t), x g (t), x b (t)], brightness α t and chromaticity distortions χ t can be computed from the background value as follows: From a geometrical point of view, in RGB space, we can see the chromaticity distortion χ t as the length of the vector that goes from the pixel value X t to the plane perpendicular to the line connecting the background value µ with the zero intensity point.It is a scalar that can be thought of as a measure of indicating how much the pixel color varies from the background color.The chromaticity distortion can be simply normalized with division by its variation evaluated during the training phase.Formally, the normalization process is defined as follows: where τ denotes the number of training frames.In this context, a pixel noise is used to normalize the variation between the pixel value and the background value.Hence, the full statistical model allows for the application of only one single threshold τ χ for variation.Following the original approach gives a sense of a proper value for the threshold by plotting the histogram of the normalized chromaticity distortion during the training sequence, exempt of foreground perturbations.In practice, the successful detection rate experienced by the user plays a crucial role in determining the optimum value for τ χ .Depending upon their normalized brightness distortions and normalized chromaticity distortions, pixels are then categorized into four clusters: labeled background, cast shadow, highlight or foreground, as follows.Pixels having small normalized brightness distortion and small normalized chromaticity distortion belong to the background, whereas, pixels that have a small normalized chromaticity distortion and a lower or higher brightness value than the background value are labeled as cast shadow or highlight, respectively.All of the remaining unclassified pixels are automatically assigned to foreground objects or moving objects.More specifically, a pixel is detected as a cast shadow if it meets the following coupled constraints: where ∧ denotes "and".After shadow elimination, there may remain some small isolated regions and noises in the binary image.The extracted foreground objects are usually affected by these noises and artifacts.Therefore, a median filter is first applied to suppress small artifacts and remove noises.
Afterwards, an adaptive "close-opening" filter (i.e., made up of closing operator followed by opening operator) is applied to improve the quality of segmentation and preserve the edges of the moving objects.The structuring elements of size 3 × 3 and 5 × 5 were tried.Finally, a typical blob analysis process scans through the entire segmented image to detect all of the moving objects (or blobs) in the image and builds a detailed report on each moving object [32].

Feature Extraction
For successful action feature extraction, the accurate segmentation of the action silhouette points is essential.In this section, we present an innovative compact representation for action recognition based on skeleton dynamics that collects multiple action information such as static pose, motion, and overall dynamics.In order to represent a human action entirely, we propose mainly capturing the body shape variations based on action skeleton information.In achieving this goal, in our approach, we mainly focus on the gradients of one-dimensional line representation of action skeletons, namely dynamic features.
Generally, the skeletonization process aims at reducing foreground regions (i.e., objects of interest) in a segmented binary image to a skeletal remnant that maintains the connectivity of objects in the original image.The skeleton is "central-spine", which can easily be shown to be a 1D line representation of a 2D object.It is easy to demonstrate that the skeleton retains the topology of the original shape, and the original shape can be reconstructed from its own skeleton.In order to find the skeleton of a given silhouette image, a set of morphological operations is applied to the segmented silhouette image.These operations include multiple morphological dilation operations.Then, the same number of morphological erosion operations are applied.There exist some holes that were generated during the segmentation process.These holes are filled thanks to the application of the dilation operation followed by the erosion operation to improve the skeletonization process and restore the shape of the human body.As a result, we obtain the skeletons of moving human body parts.More specifically, we designed the following code snippet mainly based on the real-time OpenCV library and C++ interface to extract the skeleton from the segmented video sequence:  Figure 3 shows a sample sequence of skeleton images for a video sequence of a running action.Once binary skeletons of a given action video have been obtained, the probabilistic Hough transform algorithm described in [33] can be employed to detect straight-lines in each frame of the action video.An example of straight-line detection in a video of running action is given in Figure 4.It would be worthy of note at this point that our designed function for line detection is not as expensive as other traditional OpenCV Hough detectors.
For local feature extraction, each video clip (i.e., action snippet) is firstly divided into several time slices defined by linguistic intervals.Gaussian functions are used to describe these intervals: where ε j , σ, and m are the center, width, and fuzzification factor of temporal slices, respectively, and s is the total number of the time slices.It is important to note that all Gaussian membership functions defined earlier are chosen to be of the same shape such that their final cumulative sum is always unity for every instant of time.In this work, we aim at introducing a new action representation based on computing rich descriptors from detected straight-lines that capture more local spatiotemporal information.Human action is generally composed of a sequence of temporal poses.Thus, reasonable estimate of an action pose can be constructed from a finite set of detected lines.Formally, let L be a set of line segments detected using the probabilistic Hough transform from an action snippet at a time instant t.Assuming each line is represented by a four-element vector (x 1 , y 1 , x 2 , y 2 ), where (x 1 , y 1 ) and (x 2 , y 2 ) are the ending points of each detected line segment, and then the magnitude and orientation of the gradient are calculated as: In order to form a local feature descriptor, the so-called fuzzy Histogram of Oriented Lines (f-HOL) is constructed for a given action snippet.In order to achieve this, a separate f-HOL is computed for each time slice as follows: where ρ m is a predetermined threshold to remove gradients of small magnitude.All the resulting 1D histograms are then normalized to achieve robustness to scale variations.Finally, these normalized histograms are concatenated into a single histogram to form the local feature vector of the action snippet.Figure 5 displays a sample of three action sequences from the Weizmann action dataset and their corresponding smoothed f-HOL descriptors.From top to bottom, the actions are "Bend", "Jack" and "Jump", respectively.

Fusion of Local and Global Action Features
It follows from the discussion in the previous subsection that the local features extracted from action snippets using f-HOL have been highlighted.On the other hand, global features have been extensively applied to a wide range of object recognition problems and obtained surprisingly good results.This, in turn, fosters a strong motivation in us to extract global features and fuse them together with local features to build a more expressive and discriminative action representation.The extracted global features are based on computing the center of gravity (COG) that delivers the center of motion.Hence, we can obtain the motion sequence from the COG trajectory of the motion, where the center of motion is given by where z i (i = 1, 2, . . ., n) are moving pixels in the current frame.It is of interest to note that the global features are potentially informative not only about the type of motion (e.g., translational or oscillatory), but also about the rate of motion (i.e., velocity).In addition, in our experiments, these features exhibit sufficient discriminative capabilities to distinguish, for example, between an action where motion occurs over a relatively large area (e.g., running) and an action localized in a smaller region, where only small body parts are in motion (e.g., boxing).

Action Classification
In this section, the details of the feature classification module in our action recognition system are described.Generally, the main purpose of the classification module in the current action recognition system is to classify a given action into one of a set of predefined classes, depending on the extracted features.The classification module depends on the availability of a set of previously labeled or classified actions.In this case, this set of actions is termed the training set and the resulting learning strategy x j represents the j-th observation, h j denotes the hidden state assigned to x j , and y j is the class label of x j .The nodes in gray circles represent the observed variables.
is called supervised learning.For the task of action classification, there are plenty of classification techniques reported in literature, including Naïve Bayesian (NB), k-Nearest Neighbor (k-NN), Support Vector Machines (SVMs), Neural Networks (NN), Conditional Random Fields (CRFs), etc.In this work, we opted to choose the Latent-Dynamic Conditional Random Fields (LDCRFs) for action classification.
Due to their inherent dependence on CRFs, LDCRFs are characterized as discriminative models that have the capability to describe the substructure of a label and learn dynamics between labels.Moreover, it was found that LDCRFs perform well when applied to several large scale recognition problems, and they are superior to other learning methods (e.g., Hidden Markov Models (HMMs)) at learning relevant context and integrating it with visual observations [34,35].Historically, LDCRF have been conceived as an improved extension to CRFs to learn the hidden interaction between features, and they can be interpreted as undirected graphical models capable of labeling sequential data.Therefore, they can be applied directly to sequential data avoiding the need for windowing the signal.In this manner, each label (or state) suggests a specific gesture.As LDCRFs include a class label for each observation, they are able to classify unsegmented gestures.Furthermore, the LDCRF model can perfectly infer the action sequences in the training and test phases.
Formally speaking, the basic task of the LDCRF model, as described by Morency et al. [36], is to learn a mapping between a sequence of observations x = x 1 , x 2 , . . ., x m and a sequence of labels y = y 1 , y 2 , . . ., y m .Each y j is a class label for the j-th observation in a sequence and is a member of a set Y of possible class labels.A feature vector φ(x j ) ∈ R d is used to represent each image observation x j .For each sequence, let h = h 1 , h 2 , . . ., h m be a vector of substructure variables not observed in the training examples.Hence, these variables form a set of "hidden" variables in the model, as shown in Figure 6.Given the above definitions, a latent conditional model can thus be formulated as follows: where θ is a set of the model parameters.Given a set of sequences, each labeled with its correct class name {(x i , y i ), i = 1 . . .n}, the training objective is then to learn the model parameters θ using the following objective function [37]: where n is the total number of training sequences.It can be seen from the above equation that the first term on the right-hand side of the equation represents the conditional log-likelihood of the training samples, whereas the second term is the log of a Gaussian prior with variance σ  In order to estimate the optimal model parameters, an iterative gradient ascent algorithm is employed to maximize the objective function: Once the model parameters θ * are learned, given an unseen (test) sample x, the predicted class label y * can be straightforwardly obtained via inference in the model as follows: For further details concerning the training and inference of LDCRF, the interested reader is referred to the full description given in [36].

Experiments and Results
In this section, our intensive experiments conducted to evaluate the proposed approach for action recognition are described and the obtained results are discussed.We start with a brief description of the action recognition dataset and the evaluation protocol that have been exploited to assess the performance of the recognition framework.After introducing the experiment settings and the evaluation protocol, we compare the proposed approach to related existing recognition methods.In the present work, the proposed recognition framework has been tested on the Weizmann dataset, which is regarded as one of the most widely used datasets for action recognition.This dataset was first presented by Blank et al. [9] in 2005, and, thereafter, it was made publicly available to researchers without a restriction or other access charge.The Weizmann dataset consists of a total of 10 action categories, namely "walk", "run", "jump-forward-on-two-legs" (or shortly "jump"), "jumping-in-place-on-two-legs" (or "p-jump"), "jumping-jack" (or "jack"), "gallop side ways" (or "side"), "bend", "skip", "wave-one-hand" (or "wave1") and "wave-two-hands" (or "wave2').Each action is performed by nine persons.The action sequences were captured with a static camera over static background at a rate of 25 fps, with a relatively low spatial resolution of 180 × 144 pixels, 24 bits per pixel color depth.It should be noted that the action video clips (i.e., so-called action snippets) are of short duration; each snippet generally lasts only for a short period of time, namely just a few seconds.A sample frame for each action in the Weizmann dataset is shown in Figure 7.
In order to provide an unbiased estimate of the generalization abilities of the proposed method, the leave-one-out cross-validation (LOOCV) technique was applied for the validation process.As the name

Method Recognition Rate
Our method 98.20% Fathi and Mori [7] 100.00%Sadek et al. [41] 97.80% Bregonzio et al. [39] 96.60% Zhang et al. [40] 92.80% Niebles et al. [42] 90.00% It should finally be noted that all of the experiments reported in this work have been carried out on a 3.2 GHz Intel dual core machine with 4 GB of RAM, running Microsoft Windows 7 Professional.The recognition system has been implemented by using Microsoft Visual Studio 2013 Professional Edition (C++) and OpenCV Library for feature detection and classification.As a final remark, we emphasize that the most significant feature of the proposed approach is its rapidity; our action recognizer runs comfortably in real-time in almost all of the experiments (i.e., at roughly 24 fps on average).This supports the expectation that the proposed method can be used in real-world settings and is amenable to working with real-time applications.

Conclusions
In this paper, a discriminative framework for action recognition has been presented, based on a novel feature descriptor so-called 2D f-HOL for action skeleton and a discriminative LDCRF model for feature classification.The proposed approach has been evaluated on the popular Weizmann dataset.The obtained results have demonstrated that the approach has a competitive performance compared to most existing approaches, with an average recognition rate of 98.2%.Moreover, the approach works very efficiently, and it is able to be applicable in real-time scenarios.An important aspect of future work will involve further validation of the approach on more realistic datasets presenting many technical challenges in data handling, such as object occlusion and significant background clutter.

Figure 1 .
Figure 1.A schematic diagram of the proposed methodology for action recognition.

Figure 2 .
Figure 2.An example of shadow and motion detection in a real outdoor video: (a) an input frame; (b) subtracted background and detected shadow, where foreground pixels are marked in red color, whereas shadow pixels are marked in green color; (c) silhouette after removing shadow; and (d) moving object detection.

Figure 3 .
Figure 3. Sample sequence of skeleton images for a video of running action.

Figure 4 .
Figure 4.An example of straight-line detection in a video sequence of running action.

Figure 5 .
Figure5.A sample of three actions along with their corresponding smoothed f-HOL features; from top to bottom, the actions are "Bend", "Jack", and "Jump", respectively.

3 Figure 6 .
Figure 6.Graphical representation of the Latent-Dynamic Conditional Random Fields (LDCRFs) model.xj represents the j-th observation, h j denotes the hidden state assigned to x j , and y j is the class label of x j .The nodes in gray circles represent the observed variables.

Figure 7 .
Figure 7. Sample actions from the benchmark Weizmann dataset.

Table 2 .
Comparison with a number of related results reported in recent literature.