Localized Trajectories for 2D and 3D Action Recognition †

The Dense Trajectories concept is one of the most successful approaches in action recognition, suitable for scenarios involving a significant amount of motion. However, due to noise and background motion, many generated trajectories are irrelevant to the actual human activity and can potentially lead to performance degradation. In this paper, we propose Localized Trajectories as an improved version of Dense Trajectories where motion trajectories are clustered around human body joints provided by RGB-D cameras and then encoded by local Bag-of-Words. As a result, the Localized Trajectories concept provides an advanced discriminative representation of actions. Moreover, we generalize Localized Trajectories to 3D by using the depth modality. One of the main advantages of 3D Localized Trajectories is that they describe radial displacements that are perpendicular to the image plane. Extensive experiments and analysis were carried out on five different datasets.


Introduction
Human action recognition is an active research topic with several applications in surveillance and security [1], healthcare and assisted living [2,3], and human-computer interaction [4].Nevertheless, due to large differences within the same class of actions, viewpoint variations, occlusions and changes in lighting conditions, action recognition still remains a challenging problem.
Consequently, there is a wide variety of action recognition approaches in the literature.One way to categorize them is based on the area features are computed on; global approaches, where the entire image is used to generate features [5,6], and local approaches, where specific regions of interest are selected to generate features.One of the most popular approaches belonging to the second category is Dense Trajectories [7], in which every action is represented by a set of motion trajectories along which features are aligned and encoded using the Bag-of-Words (BoW) model [8].
Approaches based on Dense Trajectories are particularly effective when the amount of motion is high [9].This is mainly because images in a video are densely sampled and tracked for generating the trajectories.However, Dense Trajectories, by definition, include trajectories of points that are irrelevant for action recognition due to background motion, noise, etc.; thus, resulting in the inclusion of irrelevant information.Furthermore, Dense Trajectories are typically generated using optical flow which fails to describe motion with radial orientation with respect to the image plane.Therefore, taking advantage of the availability of RGB-D cameras, we propose to redefine Dense Trajectories by giving them a local description power.This is achieved by clustering Dense Trajectories around human body joints provided by RGB-D sensors, which we will refer to as Localized Trajectories henceforth.
The proposed approach offers two main advantages.First, since we only consider trajectories that are localized around human body joints, our approach is more robust to large irrelevant motion estimates.As a consequence, actions which have similar motion patterns, but involving different body parts, are more easily distinguished.Second, our approach allows the description of the relationship of "action-motion-joint", i.e. an action is associated with both; a type of motion and joint location, in contrast to classical Dense Trajectories described by the relationship "action-motion" where action is associated with a type of motion only.This is done by generating features around the Localized Trajectories based on the concept of local BoWs [10].One codebook is therefore constructed per group of Localized Trajectories.Each codebook corresponds to a specific body joint.
For a better description of radial motion, we further propose to explore Localized Trajectories using the three modalities provided by RGB-D cameras.Specifically, we introduce the 3D Localized Trajectories concept, which requires the estimation of scene flow, the displacement vector field in 3D, instead of optical flow.Coupling 3D Trajectories and the corresponding motion descriptors with Localized Trajectories offers richer localized motion information, in both lateral and radial directions, allowing a better discrimination of actions.However, scene flow estimation is generally more noisy resulting in a less accurate temporal tracking of points.Thus, we propose to construct local codebooks by sampling trajectory-aligned features based on confidence and ambiguity metrics [11].
This paper is an extended version of [12].Compared to our previous work, the main contribution is the generalization of the proposed Localized Trajectories to 3D using RGB-D data.This extension is combined with a novel codebook construction scheme, suitable for tackling noisy feature samples.Moreover, an extensive comparison with state-of-the-art approaches is presented, along with evaluation on multiple datasets and novel discussions and analysis.
In summary, the contributions of this paper are listed as follows: 1.A novel 2D Localized Trajectories concept is introduced, which utilizes body pose information in order to spatially group similar trajectories together.

Localized
Trajectories are extended from 2D to 3D thanks to the avail-ability of depth data, which are directly used for 3D motion estimation.
3. A novel feature selection concept for a robust codebook construction is introduced.
4. An extensive experimental evaluation on several RGB-D datasets is presented to validate the discriminative power of the proposed approach.
The remainder of the paper is organized as follows: in Section 2, a literature review of related works is given, followed by a detailed overview of background material in Section 3. The proposed approach is described in Section 4 and Section 5.In Section 6, descriptions of different datasets, experimental setups, and results are presented.Finally, Section 7 concludes the paper and provides a perspective on future research directions.

Related Work
In this section, we present some of the established action recognition approaches in the literature.First, we start by giving a general overview of RGB-D based action recognition approaches.Then, we focus on representations inspired by Dense Trajectories which are directly related to our work.

Dense Trajectories Related Approaches
Initially introduced by Wang et al. [7], Dense Trajectories are classically generated by computing motion and texture features around motion trajectories.Due to their popularity, many researchers have extended this original formulation in order to enhance their performance [13,9,14,15,16].
As a first attempt, Wang et al. [13] proposed to reinforce Dense Trajectories by using the Random Sampling Consensus (RANSAC) algorithm to reduce the noise caused by motion.In addition to that, they have replaced the Bag-of-Visual-Words representation with Fisher Vectors.
Then, Koperski et al. [9] suggested enriching motion trajectories using depth information.They proposed a model grouping the videos in two types: videos with high level of motion and others with low amount of motion.For the first group, an extension of Trajectory Shape Descriptor [7] which includes depth information has been used, while for the second group a novel descriptor called Speeded Up Robust Features (SURF) has been introduced in order generate local depth patterns.
On the other hand, in [15], a novel approach to encode relations between motion trajectories has been presented.Global and local reference points have been used to compute Dense Trajectories, offering robustness to camera motion.
Finally, Ni et al. [16] had the idea of focusing on trajectory groups which contribute more importantly to a specific action by defining an optimization problem.Towards the same direction, Jhuang et al. [19] proposed the extraction of features around joint trajectories, increasing the discriminative power of the original Dense Trajectories approach [7].
Although all the aforementioned methods have shown their effectiveness, they unfortunately lack locality information related to the human body.This piece of information is crucial when actions include similar motion patterns performed by different body parts.For this reason, we propose a novel dense trajectory-based approach by taking into consideration the local spatial repartition of motion with respect of the human body.

Action Recognition From RGB-D Data
With the recent availability of affordable RGB-D cameras, a large effort in action recognition using both RGB and depth modalities has been made.For a more comprehensive state-of-the-art, we refer the reader to a recent survey [20], where RGB-D based action recognition methods have been grouped in two distinct categories (according to the nature of the descriptor), namely, learned representations [21,22,23] and hand-crafted representations [11,24,25].Since this work bears interest to the description of actions using Dense Trajectories, we mainly focus on hand-crafted based approaches.In turn, they can be classified as follows: depth-based approaches, skeleton-based approaches and hybrid approaches.
The first class of methods extracts directly human motion information from depth maps [26,27,28,29,30,24,31,32,33].The second group gathers approaches which make use of the 3D skeletons extracted from depth maps.
Compared to depth-based descriptors, skeleton-based descriptors require low computational time, are easier to manipulate and can better discriminate local motions.However, they are more sensitive to noise since they widely depend on the quality of the skeleton.Thus, to reinforce action recognition, a third class of methods called hybrid makes use of more than two modalities.These approaches usually exploit the skeleton information to compute local features using RGB and/or depth images.These local RGB-D based features have shown noteworthy potential [11,25,41].Inspired by this relevant concept which aims at computing local depth-based and RGB-based features around specific joints, we propose to adapt the same idea to Dense Trajectories which have been proven to be one of the most powerful action representations.

Background: Dense Trajectories for Action Recognition
Dense Trajectories have been initially introduced by Wang et al. [7].They are constructed by densely tracking sampled points over an RGB video stream and constructing representative features around the detected trajectories.As mentioned in Section 1, Dense Trajectories have been proven to be very effective in action recognition.They owe mainly their success to the fact that they incorporate low-level motion information.Below, we overview the Dense Trajectories approach.
Let V be a sequence of N images.Subsequently, representative points are sampled from each image grid with a constant stepping size -we denote each sampling grid position at frame t as p t = (x t , y t ).The point p t is then estimated in the next frame using a motion field (u t , v t ), derived by optical flow estimation [42] such that: where κ is a median filter kernel at the position p t+1 .As a result, large motion changes between subsequent frames are smoothed.Furthermore, to avoid drifting, trajectories longer than the assigned fixed length are rejected.Applying (1) on L frames results a smoothed trajectory estimation of the point p t = (x t , y t ).
We denote the m th dense trajectory as: ., M }, t 0 the first frame of the sequence V and M the total number of generated trajectories.
The set of M trajectories generated in ( 2) is used to construct descriptors aligned along a spatio-temporal volume.In [7], four types of descriptors are used: TSD [7], HOG [17], HOF [18], and MBH [7].Each of the above descriptors is designed to capture distinctive spatio-temporal features of the occurring motion.
As a final step, all of the descriptors are aggregated and encoded using BoWs -one codebook of visual words per descriptor is constructed using K-means clustering so that the final features are represented by a unified histogram of word appearances.
One of the main drawbacks of Dense Trajectories is that points on the image grid are sampled uniformly, which potentially leads to the inclusion of a significant amount of noise.Furthermore, the generated Dense Trajectories do no take into account the spatial human body structure.Thus, actions with similar motion patterns can potentially be confused during classification.

Localized Trajectories for Action Recognition
To enhance their robustness to irrelevant information, a reformulation of Dense Trajectories is proposed, called Localized Trajectories.The main idea of this new approach consists in attributing Dense Trajectories a local description in order to: 1) track the motion in specific and relevant spatial regions of the human body, more specifically around the joints.2) remove redundant and irrelevant motion information, which can negatively affect the classifier performance.
To that end, the pose information through estimated 3D skeletons is used as prior information to estimate an optimal clustering configuration.
Let us consider the human skeleton extracted from RGB-D cameras composed of J joints and let us denote the trajectory of each skeleton joint j as Q j = {q j 1 , ..., q j N }.Note that we assume that the joints are always well detected.We use the distance proposed by Raptis et al. [43] to group Dense Trajectories of an action around joints.Given a pair of dense and joint trajectories, respectively, P m and Q j , which co-exist in the temporal range τ , the spatio-temporal distance between two given trajectories is expressed using (3) as follows: such that s t = ||p m t − q j t || 2 is the spatial distance and r t = ||(p m t − p m t−1 ) − (q j t − q j t−1 )|| 2 is the velocity difference between trajectories P m and Q j .Then, an affinity matrix is computed between every pair of trajectories (P m , Q j ) using (3) as: where the measure d(P m , Q j ) penalizes trajectories with significant variation in spatial location and velocity.After a hierarchical clustering procedure which is based on the affinity score [43], a membership indicator function specifies the cluster G j * of joint j * each trajectory belongs to.
Furthermore, trajectories that are above a certain threshold of distance are rejected.This condition ensures that irrelevant and noise-resulting trajectories will not be considered, e.g, background motion.
Feature Representation: As discussed in [7], features can be computed along each trajectory and BoWs can be used to aggregate and encode the information.In such a case, however, a descriptor associated with each trajectory carries no locality information.On the contrary, we propose to exclusively assign trajectories and their corresponding descriptors to trajectory clusters.The main advantage of such a construction is that every trajectory-aligned descriptor does not only capture the spatio-temporal characteristics of the trajectory but it carries its location as well.Thus, we construct a local codebook for each trajectory group G j .During feature encoding, one histogram is constructed per joint cluster and per descriptor denoted by H j : The subscripts of the individual histograms identify the type of descriptors.
Finally, an action video is represented by the concatenation of the individual joint histograms in a final histogram H, as follows: The general overview of our approach is illustrated in Fig. 1.

3D Trajectories and Aligned Descriptors
Dense Trajectories, generated via optical flow, offer adequate performance when used for tracking movements that are lateral to the image plane.However, they struggle to track motion that happens radially, due to the fact that the occurring motion is subtle with respect to the 2D image plane.Consequently, in this subsection, we propose to extend localized Dense Trajectories to RGB-D input video stream by replacing optical flow with scene flow.The generated 3D trajectories are suitable for tracking motion in both lateral and radial directions as illustrated in Fig. 2.

Scene Flow Estimation Using RGB-D Data
To generalize the concept of Dense Trajectories from 2D to 3D, we propose to make use of the 3D extension of optical flow, called scene flow.Thanks to the emergence of RGB-D cameras, numerous approaches have been proposed to estimate scene flow from depth maps, e.g. the Primal-Dual Framework for Real-Time Dense RGB-D Scene Flow (PD-Flow) algorithm [44], the Dense semi-rigid scene flow estimation [45] and the Layered RGBD scene flow estimation [46].
The scene flow Ω is linearly dependent on the depth motion field S = (u, v, w), where w is the range flow.It is computed by mapping S to the 3D world coordinate system as below: where f x and f y are the camera focal lengths, and X, Y, Z are the 3D world coordinates of a specific point.On the other hand, The depth motion fields are estimated as a solution of a global variational problem, defined as: where E D (S) is a data term defined as the combined measure of the photometric and geometric inconsistency of successive depth and intensity images and E R (S) is defined as a regularizer term.
We choose PD-Flow [44] to estimate a dense scene flow field from an RGB-D video stream, since it has been shown to be one of the fastest and most accurate algorithms.

3D Localized Trajectories
To estimate the 3D trajectories using the scene flow, we start by uniformly sampling points from the 2D image grid.In this context, we define pixel co-ordinates as (x, y).Similar to [7], we reject points belonging to homogeneous areas.Next, each of the sampled points are mapped to a standard 3D world coordinate system using the inverse of the intrinsic camera parameter matrix as described below: where c x , c y are the image plane central point coordinates, f x and f y are the respective x and y components of the focal length and D is the depth value.
Subsequently, trajectories of the mapped 3D points are estimated using (1), except that the motion field is now based on an estimated scene flow.The estimated 3D Dense Trajectories are denoted as: where Ω i is the scene flow field.Correspondence between estimated 3D points, with scene flow, and image pixels is derived by solving (10) in terms of (x, y, D) T .
The above procedure is repeated recurrently until each of the 3D trajectories reach the fixed temporal length we have set.Similar to [7], trajectories with sudden displacements or small overall spatial length are considered irrelevant and are removed.
In depth maps, texture information is not present.Thus, in our case, only motion descriptors are considered.Three types of descriptors are used: 3D Trajectory Shape Descriptor (3DTSD), Histogram of Scene Flow [47] (HSF), and 3D Motion Boundary Histogram (3DMBH).3DTSD is based on the original idea of the TSD for Dense Trajectories [7].For each trajectory, the normalized displacement vector is computed.The HSF descriptor captures the orientation and the magnitude of the local scene flow field.For a spatio-temporal volume aligned around a 3D trajectory, the orientation of the 3D displacement is calculated using the azimuth θ x,y and elevation θ y,z angles formed by consecutive points as: For the histogram construction, the 4D space is quantized into a fixed number of bins.Similarly, the 3DMBH is based on the same idea as HSF.First, the derivative of the scene flow field is computed and, then, for every pair of coordinates, the orientation angle is estimated.
3D Trajectories are adapted to 3D Localized Trajectories by following the procedure described in Section 4. Similarly as before, we propose to enhance the discriminative power of 3D Trajectories by grouping them around 3D body joints.Hence, (3), ( 4) and ( 5) are adapted accordingly to incorporate all three dimensions of 3D trajectories P m 3D and 3D joint trajectories Q j 3D .Then, during feature encoding, every histogram of joint clusters G j defined in ( 6) is modified to include the descriptors used in this context, becoming:

Feature Selection for Codebook Construction
While 3D Trajectories are advantageous in capturing radial motion, they are notably more noisy compared to Dense Trajectories, due to the scene flow estimation.As a result, the quality of the codebooks is degraded, unfavorably affecting the general performance of the proposed approach.In order to enhance it, we propose to select features according to the classifier confidence and ambiguity probabilistic metrics.Confidence is the classifier ability to quantify its predictions reliability, while ambiguity indicates the number of classes the classifier outputs for every prediction.The goal is to encode trajectory features F m using codebooks constructed by sampling features from a training set M r , which maximize the classifier confidence C and minimize ambiguity A metrics.
The confidence C and ambiguity A metrics are defined as: and where P r(l m = a|F m ) is the posterior probability of label a given feature F m .

Experimental Evaluation
In this section, we evaluate the proposed approaches on 5 challenging datasets: MSRDailyActivity3D [11], Online RGB-D Action (ORGBD) [48], G3D Gaming Action [49], Watch-n-Patch [50] and KARD datasets [51].First, a brief description of each dataset is given followed by description of the experimental setups.
Then, the obtained results are reported and extensively analyzed.

Datasets and Experimental Settings:
The first dataset used for the experimental evaluation is the MSRDailyActivity 3D [11].In this dataset, 10 actors perform 16 daily activities, which in some cases involve human-object interaction.The dataset is captured by the Kinect v1 device, providing therefore RGB, depth and skeleton modalities.A distinctive characteristic of this dataset is that every actor repeats each action twice in both sitting and standing position.For the experiments, we follow a cross-splitting protocol as in [11], where half of the subjects are used for training and the rest for testing.
The second dataset is called Online RGB-D Action (ORGBD) [48].It can be used for both action recognition and action detection and includes 7 common types of human-object interaction related to the living room environment.
Three sets of video sequences are collected using a Kinect sensor.Thus, RGB, depth and skeleton modalities are available.The first set is captured in the context of action recognition in the Same Environment, whereas the second set is acquired for cross-environment action recognition and the third for on-line action detection.The splitting protocol requires two fold cross-validation for the same-environment scenario, whereas, for cross-environment action recognition, training and testing sets should include different environments [48].
One challenging dataset used for the evaluation is the G3D Gaming Action Dataset [49].This Kinect-acquired dataset can be used for both action recog- For each one of the aforementioned datasets, we report the obtained recognition accuracy using the proposed Localized Trajectories and compare it to the classical Dense Trajectories and recent state-of-the-art approaches.In the following, we denote the original dense trajectory approach [7] by Dense Trajectories.We refer to the 2D proposed approach as 2D Localized Trajectories.
Similarly, the proposed 3D extension of the classical and the local Dense Trajectories are respectively called 3D Dense Trajectories and 3D Localized Trajectories.
The number of skeleton joints defines the number of clusters.Subsequently, in MSRDailyActivity3D, ORGBD and G3D datasets, the skeletons are composed of 20 joints, while in Watch-n-Patch and KARD datasets, they are respectively formed by 25 and 15 joints.We, also, choose empirically 2000 trajectories per video in order to construct the codebooks and 128 words per cluster and per descriptor for every dataset.

Performance of 2D Localized Dense Trajectories
In this subsection, an analysis of the obtained results is provided.First, we compare the performance of our approach against Dense Trajectories and other state-of-the-art methods.Later, we discuss some of the limitation of 2D Localized Trajectories.

2D Localized Dense Trajectories vs Dense Trajectories
Since the aim of this work is to improve the discriminative power of classical Dense trajectories, we start by comparing our proposed 2D Localized Dense Trajectories with them.The results obtained on the five benchmarks prove the superiority of the proposed 2D Localized Trajectories.As reported in Table 1,
The reported results reflect the ability of 2D Localized Trajectories to distinguish actions with similar motion patterns that are performed by different body parts.This is shown in various cases when comparing confusion matrices obtained for 2D Localized Trajectories and Dense Trajectories.For instance, in the confusion matrices of G3D dataset in Fig. 4, 2D Localized Trajectories boost the performance of the following action pairs: Punch Right-Punch Left and Kick Right-Kick Left.Also, in the same dataset, the recognition accuracy of both Tennis Swing Backhand and Throwing Bowling Ball activities which include similar motion shapes is improved by 20% and 6%, respectively.
Furthermore, the accuracy of Drinking and Reading Book classes in ORGBD dataset is increased by 33% and 31%, respectively (see Fig. 5).
Another example of this enhancement can be the pair of actions Defend and Aim & Fire Gun in G3D dataset.The motion shapes of both action classes are similar, since both of them include arm raising.Nevertheless, the first is performed using both arms and the second by using only one arm.As we can see in Fig. 4, the performance obtained for the action Defend is improved by 13% and the confusion with the action Aim & Fire Gun is reduced by 14%.In addition, in the same dataset, actions Wave and Clap have similar lateral motion and using the classical Dense Trajectories made their distinction challenging.
However, with the use of 2D Localized Trajectories, motion trajectories were assigned to only one hand cluster in Wave action and to both hands in Clap action, reducing the confusion between these classes.This results in an accuracy boost of 13% in Wave class, as it is shown in Fig. 4.
Moreover, in scenarios with full-body motion, such as the kitchen environment in Watch-n-Patch dataset, 2D Localized Trajectories outperform the Dense Trajectories approach as shown in Fig. 6.Clusters isolate specific motion of body parts, therefore motion patterns related to the action can be identified more effectively.

Comparison with 3D-Based State-of-the-Art Approaches
Our 2D Localized Trajectories approach has shown competitive performance compared to 3D-based state-of-the-art approaches.In ORGBD dataset, we achieve the second best performance in Same Environment setting (Table 3).
We manage to match the state-of-the-art results of [11] in Cross Environment settings and, at the same time, increase the mean accuracy by 16% over the Dense Trajectories.
In Watch-n-Patch dataset, the 2D Localized Trajectories improved the performance of the Dense Trajectories by 2.3% in the office environment and by 25.3% in the kitchen environment, as illustrated in Table 4.The discriminative power of our approach boosts the performance of every action class, especially in the kitchen environment, as it can be observed in Fig. 6.On this dataset, we compare our work only with Dense Trajectories.To the best of our knowledge, there is no work in the literature reporting offline action recognition accuracy on it, since this dataset has been initially acquired for action detection.
In G3D dataset case, the results in Table 2 indicate that the 2D Localized Trajectories approach performes adequately enough compared to state-of-theart 3D concepts, despite the fact that it includes a significant amount of radial motion.The obtained results in Table 2 show that our method was the third best performing, without utilizing depth or 3D skeleton modalities.
In KARD dataset, our approach based on the 2D Localized Trajectories outperforms almost all state-of-the-art approaches, with a score of 98.2%, except of difference.
The 2D Localized Trajectories approach offers the second largest improvement on MSRDailyActivity3D dataset, by 10% compared to Dense Trajectories as it is depicted in Table 1.Apart from that, its performance was slightly inferior to the performance other state-of-the-art approaches, since it came third in average accuracy, behind Local HON4D [24] and Skeleton & LoP [11].were consequently low, as demonstrated in Fig. 3a and Fig. 3b.For that reason, as mentioned earlier, the proposed 3D Localized Trajectories presents as a good alternative to solve these two issues.Performance of the 3D Localized Trajectories are reported in the next section.

Performance of 3D Localized Trajectories
The proposed 3D Localized trajectories approach was evaluated on MSR-DailyActivity3D dataset.The results reported in Fig. 1 show its superiority against Dense Trajectories and 2D Localized Trajectories.In fact, the accuracy of Dense Trajectories and 2D Localized Trajectories are improved by 1.9% and 11.9%, respectively.The performance improvement happens mainly because of the inclusion of depth information in 3D trajectories.This helps in distinguishing actions which are performed radially with respect to the camera.The latter is particularly reflected in the confusion matrix of MSR DailyActivity 3D dataset in Fig. 3, where actions like play game and play guitar are more effectively discriminated using 3D information.The reported accuracies for the actions play game and play guitar are significantly improved.In particular, from 20% and 20% using Dense Trajectories and 40% and 40% using 2D Localized Trajectories, the accuracy climbed to 60% and 70% with the use of 3D Localized Trajectories, respectively.
These promising results highlight the potential of our first attempt to generalize Dense Trajectories to 3D and opens up new perspectives.Indeed, many components of this 3D concept can be reinforced to increase its effectiveness.
For example, 3D trajectories are slightly more noisy than the Dense trajectories mainly because depth sensors introduce additional noise.This noise translated to a significant number of points belonging to the background which appeared to move radially, creating a lot of irrelevant 3D trajectories.Most importantly, the scene flow estimation is not optimal, since it relies on two different modalities which often appear to be misaligned.This fact is reflected in the performance of the 3D Trajectories (without locality), resulting in a notably lower accuracy than the Dense Trajectories, as demonstrated in Table 1.Nevertheless, the trajectory clustering around body joints is still able to remove a significant amount of noisy and irrelevant trajectories in 3D Localized Trajectories case.

Global BoW vs. Local BoW
To experimentally motivate the use of local BoWs, we compare the results obtained for 2D Localized trajectories using both a global BoW and a local BoWs.Hence, the experiments are conducted on the cross-environment scenario of the ORGBD dataset.The mean accuracy is notably lower compared to the 2D Localized Trajectories approach with Local BoW, reaching 53.6% vs.

59.8%. The results suggest that trajectories clustering combined with local
BoWs contribute significantly to the enhancement of the local discriminative power of the overall approach.They, also, suggest that the local encoding is more effective, since the codebooks are constructed using features which are specific to the motion of each body part.

Conclusion
In this paper, we proposed to solve two major shortcomings of the original Dense Trajectories approach using additional modalities provided by RGB-D cameras: the lack of locality information and the ineffectiveness in describing radial motion.Our contribution is two-fold.First, we enhanced the discrim-

Figure 1 :
Figure 1: Proposed 2D Localized Trajectories approach.From an RGB sequence, Dense Trajectories are generated and, then, clustered around body joints using RGB-D pose information (only 2D information is used).Finally, local codebooks, for every cluster G j , are constructed for the histogram representation of features.This feature representation is used in both training and testing phases of the classification.

Figure 2 :
Figure 2: Scene flow-generated motion trajectories.Three phases of the same action are illustrated: In (a), (b) and (c) the frontal view of a subject drinking water is displayed as a point cloud, along with the corresponding motion trajectories in red.The same sequence is illustrated from the side in (d), (e) and (f).The capture of both lateral and radial motion shape is clearly depicted.
nition and temporal action detection.It consists of 10 subjects performing 20 gaming actions which are grouped into 7 gaming scenarios, which are: Fighting, playing golf, playing tennis, bowling, first person shooter, driving a car and miscellaneous.The first 5 actors are used for training and the rest are used for testing[49].Watch-n-Patch[50] dataset, which was introduced by the Cornell University, is also utilized.This dataset includes 21 types of actions (10 in an office and 11 in a kitchen) which involve interactions with 23 types of objects.7 subjects perform 2-7 actions in every of the 458 videos.The dataset was recorded using a Kinect v2 camera.This dataset distinguishes itself by a high intra-class variability since the subjects perform different combinations of actions and order them differently each time.For the experiments, we use the provided splitting protocol proposed in[50], where, for every environment, almost half of the videos are used for training and the rest for testing.The last dataset used for evaluation is called Kinect Activity Recognition Dataset (KARD)[51].It contains 18 action classes which are performed by 10 subjects (9 males and 1 female) where half of them are used for training and the other half for testing, as proposed in[51].The dataset was captured by a Kinect device and consequently contains the three RGB-D modalities: RGB images, depth maps and 3D skeletons.Implementation Details: For extracting Dense Trajectories and features from videos, we use the implementation provided by the authors in[7] 1 .The trajectory temporal length is fixed to 15 frames.The features are computed on a spatio-temporal volume of 32 × 32 × 15 aligned on the trajectory, as suggested in[7].This volume is further divided into 2 × 2 × 3 cells, where the histograms of the descriptors are computed.In the case of 3D trajectories, we use the same parameters for the spatio-temporal volume.The number of histogram bins for the 2D trajectories is set to 8 for HOG and MBH descriptors and 9 for HOF descriptor, whereas for 3D trajectories case we use 9-bin histograms for every descriptor.The distance threshold for each trajectory is set to 0.02.Moreover, a linear SVM is employed for classification.

6. 1 . 3 .
Limitations of 2D Localized Dense TrajectoriesDespite its strong performances, 2D Localized trajectories action representation suffers from two limitations.First, 2D Localized Trajectories approach presents low performance when the motion amount is small.This attribute is inherited from Dense Trajectories approach and is clearly depicted in action classes such as Call Cellphone in both MSR DailyActivity 3D and ORGBD as it is shown in Fig.3and Fig 5, respectively, and Write on a Paper in MSR Dai-lyActivity 3D.Nonetheless, Sit Still class achieves adequate performance with the use of 2D Localized Trajectories, since it is an action class with almost no motion.Second, 2D Localized Trajectories approach does not capture radial motion sufficiently.Action classes such as Playing the guitar in MSRDailyActivity3D dataset include a notable amount of radial motion and the accuracy results inative power and locality-awareness of Dense Trajectories by clustering them around human body joints.This method is coupled with the local Bag-of-Words concept, strengthening further the framework.Second, we constructed 3D Localized Trajectories for action recognition.For this purpose, we used a) scene flow instead of optical flow for the generation of the 3D Trajectories and b) 4D extension of the originally used spatio-temporal descriptors.The reported results show the robustness of the two proposed representations in various challenging datasets.As future work, we intend to develop an automatic way of choosing the optimal parameters.In addition, we intend to estimate more reliable and robust to noise 3D Trajectories directly from point cloud data for the purposes of enhancing our current approach and extending it to view-invariant action recognition.

Table 1 :
Mean accuracy of recognition (%) on MSR DailyActivity 3D dataset for DenseTrajectories and 2D Localized Trajectories approaches against literature.

Table 2 :
Mean accuracy of recognition (%) on G3D dataset for Dense Trajectories and 2DLocalized Trajectories approaches against literature.

Table 3 :
Mean accuracy of recognition (%) on ORGBD dataset for Dense Trajectories and 2D Localized Trajectories approaches against literature in both Same and Cross Environment

Table 4 :
Mean accuracy of recognition (%) on Watch-n-Patch in both kitchen and office settings for Dense Trajectories and 2D Localized Trajectories approaches.

Table 5 :
Mean accuracy of recognition (%) of Dense Trajectories and 2D Localized Trajectories approaches on KARD dataset.