Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

Pham, Huy Hieu; Salmane, Houssam; Khoudour, Louahdi; Crouzil, Alain; Zegers, Pablo; Velastin, Sergio A.

doi:10.3390/s19081932

Open AccessArticle

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks^†

by

Huy Hieu Pham

^1,2,*

,

Houssam Salmane

¹

,

Louahdi Khoudour

¹,

Alain Crouzil

²

,

Pablo Zegers

³

and

Sergio A. Velastin

^4,5,6

¹

Cerema, Project team STI, 1 avenue du Colonel Roche, F-31400 Toulouse, France

²

Informatics Research Institute of Toulouse (IRIT), Paul Sabatier University, Toulouse 31062, France

³

Aparnix, La Gioconda 4355, 10B, Las Condes, Santiago 7550076, Chile

⁴

Cortexica Vision Systems Ltd., London SE1 9LQ, UK

⁵

School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK

⁶

Department of Computer Science, University Carlos III of Madrid, 28903 Leganés, Spain

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper, Pham, H. H., Khoudour, L., Crouzil, A., Zegers, P., & Velastin, S. A. “Skeletal Movement to Color Map: A Novel Representation for 3D Action Recognition with Inception Residual Networks.” published in the 25th IEEE International Conference on Image Processing (ICIP). In the evaluation section, we also reproduce results from our paper, Pham, H. H., Khoudour, L., Crouzil, A., Zegers, P., & Velastin, S. A. “Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural network” published in the IET Computer Vision Journal in 2018 and compare them with the method described in this paper.

Sensors 2019, 19(8), 1932; https://doi.org/10.3390/s19081932

Submission received: 6 March 2019 / Revised: 10 April 2019 / Accepted: 17 April 2019 / Published: 24 April 2019

(This article belongs to the Special Issue Deep Learning-Based Image Sensors)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Designing motion representations for 3D human action recognition from skeleton sequences is an important yet challenging task. An effective representation should be robust to noise, invariant to viewpoint changes and result in a good performance with low-computational demand. Two main challenges in this task include how to efficiently represent spatio–temporal patterns of skeletal movements and how to learn their discriminative features for classification tasks. This paper presents a novel skeleton-based representation and a deep learning framework for 3D action recognition using RGB-D sensors. We propose to build an action map called SPMF (Skeleton Posture-Motion Feature), which is a compact image representation built from skeleton poses and their motions. An Adaptive Histogram Equalization (AHE) algorithm is then applied on the SPMF to enhance their local patterns and form an enhanced action map, namely Enhanced-SPMF. For learning and classification tasks, we exploit Deep Convolutional Neural Networks based on the DenseNet architecture to learn directly an end-to-end mapping between input skeleton sequences and their action labels via the Enhanced-SPMFs. The proposed method is evaluated on four challenging benchmark datasets, including both individual actions, interactions, multiview and large-scale datasets. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches on all benchmark tasks, whilst requiring low computational time for training and inference.

Keywords:

3D human action recognition; skeleton-based representation; SPMF; Enhanced-SPMF; AHE; D-CNNs; DenseNet

1. Introduction

Human action recognition [1] is one of the most important and challenging tasks in computer vision. Detecting and recognizing correctly what humans do in unknown videos serve as a key component of many real-world applications such as smart surveillance [2,3], human–object interaction [4,5], autonomous vehicle technology [6,7], etc. Although significant progress has been achieved over two decades of research, video-based human action recognition is still a challenging issue due to a number of obstacles, e.g., changes in camera viewpoint, occlusions, background, surrounding distractions, diversity in length and speed of actions [8].

As many other visual recognition tasks, traditional approaches on human action recognition [9] have focused on extracting hand-crafted local features and building local descriptors from RGB sequences provided by 2D cameras. Some typical examples that have been widely exploited with success are SIFT [10,11], HOG/HOF [12,13], HOG-3D [14], Cuboids [15], SURF [16] and Extended SURF [17]. Since these approaches typically recognize actions based on the appearance and movement of the human body parts from a monocular RGB video sequence, they tend to lack 3D structure from the scene. Therefore, single modality human action recognition based only on RGB videos is not enough to overcome the current challenges.

The availability of low-cost and easy-to-use depth sensors such as the Microsoft Kinect™ sensor [18] has helped the computer vision community improve action recognition. These sensors are able to provide detailed 3D structural information of human motion, which is considered complex for traditional 2D cameras. Many action recognition approaches using RGB-D cameras have been proposed and advanced the state-of-the-art [19,20,21,22,23,24,25]. In particular, most of currently depth-sensing cameras have integrated real-time skeleton estimation and tracking frameworks [26,27], helping to facilitate the collection of skeleton sequences. This data source is a high-level representation allowing to describe human action in a more precise and effective way, which is suitable for the problem of action analysis and recognition. Skeleton-based human action recognition is a time-series problem. The skeletal data comprises 3D coordinates of the key joints in the human body over time. This is an effective representation for structured motion [28] because each human action can be represented through the movement of skeleton sequences. Moreover, a large set of actions can be distinguished from these movements [29]. 3D skeletal data is not only invariant to camera-viewpoint but also can be estimated in real-time. Moreover, it is available for most of depth based action datasets [30]. Hence, exploiting this data source for 3D human action recognition opens up opportunities for addressing the limitations of RGB-depth modalities-based solutions and so many skeleton-based action recognition approaches have been proposed [19,23,31,32,33]. Our goal is to exploit the potential of low-cost consumer depth cameras for identifying salient spatio–temporal patterns in skeleton sequences and then explore them for improving the recognition of human actions using deep learning models.

In the literature of skeleton-based action recognition, there are two main issues that need to be solved. The first challenge is to find a skeleton-based representation that transforms the raw skeletal data into a representation that effectively captures the spatio–temporal dynamics of human skeleton joints. The second challenge is to model and recognize actions that are complex, variable and have large intra-class correlation, from the skeleton-based representation. Previous studies [19,23,31,34,35,36,37,38,39,40,41,42] on this topic can be divided into two main categories: skeleton-based action recognition based on hand-crafted features and skeleton-based action recognition using deep neural networks. The first group of methods uses hand-crafted local features and probabilistic graphical models such as Hidden Markov Model (HMM) [43], Conditional Random Field (CRF) [34], or Fourier Temporal Pyramid (FTP) [23] to model and classify actions. However, almost all of these approaches are shallow, data-dependent and require a lot of feature engineering. The second group of methods considers skeletal data as a time-series patterns and proposes the use of Recurrent Neural Networks (RNNs) [44], especially Recurrent Neural Networks with Long Short-Term Memory units (RNN-LSTMs) [45,46] to analyze and model the contextual information contained in the skeleton sequences. They are considered as the most popular deep learning based approach for skeleton-based action recognition and have achieved high-level performance. Although being able to model the long-term temporal of human motion, RNN-LSTMs [45,46] just consider skeleton sequences as a kind of low-level features by feeding raw skeletal data directly into the network input. The huge number of input features makes them complex, time-consuming and may easily lead to overfitting. Nevertheless, almost all of these networks act just as classifiers and do not extract high-level features for recognition tasks [47].

A practical human action recognition system should be able to detect and recognize actions from different viewpoints, robust to noise and operate in real-time. We believe that an efficient and effective representation for 3D human motion plays a decisive role in improving recognition performance. Motivated by the success of our previous work on the SPMF (Skeleton Posture-Motion Feature) representation [48] for video-based human action recognition, in this paper we aim to find a new skeleton-based representation and take full advantages in learning highly hierarchical image features of Deep Convolutional Neural Networks (D-CNNs) to build an end-to-end learning framework for 3D human action recognition from skeletal data. Specifically, we propose a new 3D motion representation, termed as Enhanced-SPMF (Enhanced Skeleton Posture-Motion Feature). Similar to the SPMF [48], the proposed Enhanced-SPMF has a 2D image structure with three color channels, which is built from a set of spatio–temporal stages, combining 3D skeleton poses and their motions. Moreover, an Adaptive Histogram Equalization (AHE) algorithm [49] is then applied to the color images to enhance their local patterns and generate more discriminative features for classification task. Figure 1 illustrates an overview of the proposed Enhanced-SPMF. To learn image features and recognize action labels from the proposed representation, different D-CNN models based on the DenseNet architecture [50] have been designed and evaluated.

There are five important hypotheses that motivate us to propose a new skeleton-based representation and design DenseNets [50] for 3D human action recognition with skeletal data. First, human actions can be correctly represented through the skeleton movements [28,29]. Second, compared to RGB and depth streams that contain thousands of pixels per frame, skeletal data has a high-level abstraction with much less complexity. This makes the training and inference processes much simpler and faster. Third, as shown in our previous works [48,51], the spatio–temporal dynamics of skeleton sequences can be transformed into color images—a kind of 3D tensor-structured representation that can be effectively learned by representation learning models as D-CNNs. Fourth, many different action classes share a great number of similar primitives, which interferes with action classification. Therefore, extracting essential spatio–temporal patterns from skeleton movements plays a key role in this task. Last, recent research results indicate that CNNs have achieved outstanding performances in many image recognition tasks [52,53]. There are a many signs that seem to indicate that the learning performance of CNNs can be significantly improved by increasing the depth of their architectures [54,55,56,57]. In particular, D-CNNs with architectures such as DenseNet [50] can improve accuracy in the image recognition task since this kind of network is able to prevent overfitting and degradation phenomena [58] by maximizing information flow and facilitating features reuse as each layer in its architecture has direct access to the features from previous layers. Therefore, we explore the use of DenseNet in this work and optimize this architecture for learning and recognizing human actions on the proposed image-based representation.

The effectiveness of the proposed method is evaluated on four public benchmark RGB-D datasets, including MSR Action3D [59], KARD [60], SBU Kinect Interaction [61] and NTU-RGB+D datasets [39]. The hypotheses above were reinforced since the experimental results show that we achieve state-of-the-art performance on all the reported benchmarks. Furthermore, we also report the effectiveness of this approach in terms of computational cost, for both training time and inference latency. Overall, the main contributions of our study include two aspects:

Firstly, we present Enhanced-SPMF, a new skeleton-based representation for 3D human action recognition from skeletal data. This work is an extended version of our paper published in the 25th IEEE International Conference on Image Processing (ICIP) [48] in which the Enhanced-SPMF is an extension of SPMF (Skeleton Pose-Motion Feature). Compared to our previous work, the current work aims to improve the efficiency of the 3D motion representation via a smoothing filter and a color enhancement technique. The smoothing filter helps us to reduce the effect of noise on skeletal data, meanwhile the color enhancement technique could make the proposed Enhanced-SPMF more robust and discriminative for recognition task. An ablation study on the Enhanced-SPMF demonstrated that the new representation leads to better overall action recognition performance than the SPMF [48].
Secondly, we present a deep learning framework (The implementation and models will be made publicly available at https://github.com/cerema-lab/Sensors-2018-HAR-SPMF). based on the DenseNet architecture [50] for learning discriminative features from the proposed Enhanced-SPMF and performing action classification. The framework directly learns an end-to-end mapping between skeleton sequences and their action labels with little pre-processing. We evaluate the proposed method on four highly competitive benchmark datasets and demonstrate significantly improvement over existing state-of-the-art approaches. Our computational efficiency evaluations show that the proposed method is able to achieve high-level of performance whilst requiring low computational time for both the training and inference stages. Compared to our previous work that exploited the Residual Inception v2 network [48], the current work uses a more powerful deep learning model for action recognition task

The rest of this paper is organized as follows: Section 2 discusses related works. Section 3 presents the details of the proposed approach. Datasets and experiments are described in Section 4. The experimental results and analyses are provided in Section 5. Section 6 concludes the paper.

2. Related Work

In this section, we briefly review the exiting literature closely related to the topic of deep learning based approaches for 3D human action recognition from skeleton sequences, including skeleton-based action recognition using hand-crafted features and deep learning-based action recognition. We encourage the readers to refer to an extensive review by Han et al. [62] for getting a more comprehensive picture on this topic.

2.1. Hand-Crafted Approaches for Skeleton-Based Human Action Recognition

Earlier studies on skeleton-based human action recognition focus on finding well-designed hand-crafted features and using temporal graphical models to analyze the global temporal evolution of skeleton joints. Since when the first work on 3D human action recognition from depth data was introduced [59], many approaches for skeleton-based action recognition have been proposed [19,23,31,34,35,36]. The common characteristic of these approaches is that, they extract geometric features of 3D joint movements and model their temporal information by a generative model. For instance, Wang et al. [19] represented the human motion by means of the pairwise relative positions of the skeleton joints for generating more discriminative features. Fourier Temporal Pyramid (FTP) [19] was then proposed to model the temporal dynamics of the actions from LOPs. Vemulapalli et al. [23] represented the 3D geometric relationships of body parts as points in a Lie Group and then exploited Dynamic Time Warping (DTW) [63] and Fourier Temporal Pyramid (FTP) [19] to model their temporal dynamics. Xia et al. [31] extracted and computed histograms of 3D joint locations (HOJ-3D) to represent actions via posture visual words. The temporal evolutions of those words are modeled by a discrete Hidden Markov Models (HMM) [64]. Instead of modeling temporal evolution of skeletons, Luo et al. [35] proposed a discriminative dictionary learning algorithm (called DL-GSGC) that incorporated both group sparsity and geometry constraints to learn motion features from the 3D joint positions. An encoding technique called Temporal Pyramid Matching (TPM) [35] was then used for keeping the temporal information and performing action classification.

Although promising results have been achieved, the above approaches have some limitations that are difficult to overcome. For instance in many cases, they require pre-processing input data in which the skeleton sequences need to be segmented or aligned. Unlike these approaches, we propose a skeleton-based representation and a deep learning framework for 3D human action recognition that learns to recognize actions directly from the original skeletons in an end-to-end manner, without dependence on the length of actions. Moreover, the proposed solution is general and can be applied with some other data modalities such as motion capture data [65] and the output of pose estimation algorithms [66,67].

2.2. Deep Learning Approaches for Skeleton-Based Human Action Recognition

Approaches based on Recurrent Neural Network with Long Short-Term Memory units (RNN-LSTM) [45,68] are the most popular deep learning approach for skeleton-based action recognition and have achieved high-level performance for video-based action recognition tasks [37,38,39,40,41,42]. The temporal evolutions of skeletons are spatio–temporal patterns. Thus, they can be modeled by memory cells in the structure of RNN-LSTMs [45,68]. For instance, Du et al. [37] proposed to use a hierarchical RNN to model the long-term contextual information of skeletal data, in which the human skeleton was divided into five parts according to its physical structure. Each low-level part was modeled by an RNN and then combined into the final representation of high-level parts for action classification. Shahroudy et al. [39] introduced a part-aware LSTM human action learning model by splitting a long-term memory of the entire motion to part-based cells. The long-term context of each body part was learned independently. The output of the network was then formed as a combination of independent body part context information. Liu et al. [40] presented a spatio–temporal LSTM network, called ST-LSTM, for 3D action recognition from skeletal data. They proposed a skeleton-based tree traversal technique to feed the structure of the skeletal data into a sequential LSTM network and improved the performance of the ST-LSTM by adding more trust gates. Recently, Liu et al. [42] focused on selecting the most informative skeleton joints by using a new class of LSTM network, namely Global Context-Aware Attention LSTM (GCA-LSTM), for 3D skeleton-based action recognition. Two LSTM layers were used. The first layer encodes the input sequences and generates an initial global context memory for these sequences. Meanwhile, the second layer performs attention over the input sequences with the assistance of obtained global context memory. The attention representation was then used back to refine the global context. Multiple attention iterations are executed and the final global contextual information is used for action classification task.

Compared to the approaches based on hand-crafted local features, the RNN-LSTM based approaches and their variants have been showing superior action recognition performance. However, they tend to overemphasize the temporal information and lose the spatial information of skeletons [37,38,39,40,41,42]. RNN-LSTM based approaches still struggle to cope to scope with the complex spatio–temporal variations of skeletal movements due to a number of issues such as jitters and movement speed variability. Another drawback of the RNN-LSTM networks [45,68] is that they just model the overall temporal dynamics of actions without considering the detailed temporal dynamics of them. To overcome these limitations, we propose in this study a CNN-based approach that is able to extract discriminative features of actions and model various temporal dynamics of skeleton sequences via the proposed Enhanced-SPMF representation, including both short-term, medium-term, and long-term actions. We summarize the advantages and disadvantages of our proposed method in comparison with some previous approaches in Table 1.

3. Method

The details of the proposed approach are presented in this section. Figure 2 illustrates the key components of the proposed learning framework for recognizing actions from skeleton sequences. We first show how skeleton pose and motion features can be combined to build an action map in the form of an image-based representation (Section 3.1), and how to use a color enhancement technique for improving the discriminative ability of the proposed representation (Section 3.2). We then introduce an end-to-end deep leaning framework based on DenseNets to learn and classify actions from the enhanced representations (Section 3.3). Before that, in order to put the proposed approach into context, it is useful to review the central ideas behind the original DenseNet architecture (Section 3.3.1).

3.1. SPMF: Building Action Map from Skeletal Data

One of the major challenges in exploiting D-CNNs for skeleton-based action recognition is how the spatio–temporal patterns of skeleton movements could be effectively represented and fed to D-CNNs for representation learning. As D-CNNs work well on image representations [73], our idea therefore is to encode the whole skeleton sequence into a single 2D image as a global representation for the action sequence. In general, two essential elements that determine a human action are poses and their motions. Hence, we decide to transform these two important elements into the static spatial structure of a color image with three R, G, B channels. Specifically, we propose a new representation, namely Enhanced-SPMF (Enhanced Skeleton Pose-Motion Feature), which is built from pose and motion vectors extracted from the skeleton joints. Note that, combining multiple kinds of geometric features such as joint coordinates, lines and planes determined by the joints will lead to lower performance than using only a single type of feature or several main type of features [74]. Moreover, it has been reported [61] that joint features such as joint-joint distance and joint-joint motion are the strongest features among many others.

3.1.1. Pose Features (PFs) Computation

Given a skeleton sequence

S

with N frames, denoted by

S = {F^{t}}

, where

t = 1, 2, 3, \dots, N

. Let

p_{j}^{t}

and

p_{k}^{t}

be the 3D coordinates of the j-th and k-th joints in

F^{t}

. The Joint-Joint Distance

J J D_{j k}^{t}

between

p_{j}^{t}

and

p_{k}^{t}

at timestamp t is computed as

J J D_{j k}^{t} = | | p_{j}^{t} - p_{k}^{t} {| |}_{2}, (t = 1, 2, 3, \dots, N),

(1)

where

| | \cdot {| |}_{2}

denotes the Euclidean distance between two joints. The joint distances obtained by Equation (1) for all types of actions of a specific dataset range from

D_{\min} = 0

to

D_{\max} = \max {J J D_{j k}^{t}}

. We note this distance space as

D_{o r i g i n a l}

. In fact,

D_{o r i g i n a l}

can be transformed into a tensor-structure and fed directly to D-CNNs for learning action features. However, since

D_{original}

is a high-dimensional space, it could lead D-CNNs to overfit as well as being time-consuming. Thus, we need to describe the input skeleton sequences as low-dimensional signals such that they are easy to parameterize by learning models and discriminative enough for a classification task. To do that, we normalize all elements of

D_{o r i g i n a l}

to the range

[0, 1]

, denoted as

D_{[0, 1]}

. To reflect the change in joint distances, we encode

D_{[0, 1]}

into a color space using a sequential discrete color palette called JET color map (A JET color map is based on the order of colors in the spectrum of visible light, ranging from blue to red, and passing through the cyan, yellow, and orange.). The encoding process converts the joint distances

J J D_{j k}^{t} \in D_{[0, 1]}

for all possible combinations j and k into color points

{JJD}_{R G B}^{t} \in N_{[0, 255]}^{3}

performed by 256-color JET scale. To this end, we first normalize the distance values with respect to the maximum and minimum values of a grayscale image ranging from 0 to 1. As illustrated in Figure 3, the scalar distances are converted to a three channel map via a JET mapping. This technique is similar to depth encoding method presented in [75]. The use of a discrete color palette allows us to reduce complexity of input features. This helps accelerate the convergence rate of deep learning networks during the training stage. Moreover, it should be noted that point-point distances are invariant when they are moved into a new coordinates system in the 3D Euclidean space. Therefore, the use of the Joint-Joint Distance

J J D_{j k}^{t}

can help our final representation be more independent to the camera viewpoint.

Apart from the distance information, the orientation between joints is also important for describing human motions. The Joint-Joint Orientation

{JJO}_{j k}^{t}

from joint

p_{j}^{t}

to

p_{k}^{t}

at time-stamp t is computed as

{JJO}_{j k}^{t} = p_{j}^{t} - p_{k}^{t}, (t = 1, 2, 3, \dots, N) .

(2)

The

{JJO}_{j k}^{t}

is a vector where all of its components p can be normalized to the range

[0, 255]

. This can be done via the following transformation

p_{norm} = floor (255 \times \frac{p - c_{min}}{c_{max} - c_{min}}),

(3)

where

p_{norm}

indicates the normalized value,

c_{max}

and

c_{min}

are the maximum and minimum values of all coordinates over the training set, respectively. The function floor(·) rounds down to the nearest integer. We consider three components

(x, y, z)

of

{JJO}_{j k}^{t}

after normalization as the corresponding three components

(R, G, B)

of a color pixel and build

{JJO}_{R G B}^{t}

as a 3D array that is formed by all

{JJO}_{j k}^{t}

values. We then define “a human pose” at timestamp t by vector

{PF}^{t}

that describes the distance and orientation relationship between skeleton joints,

{PF}^{t} = [{JJD}_{R G B}^{t} ⧺ {JJO}_{R G B}^{t}], (t = 1, 2, 3, \dots, N) .

(4)

Here the symbol (

⧺

) horizontally concatenates vectors

{JJD}_{R G B}^{t}

and

{JJO}_{R G B}^{t}

together.

3.1.2. Motion Features (MFs) Computation

Let

p_{j}^{t}

and

p_{k}^{t + 1}

denote the 3D coordinates of the j-th and k-th joints at two consecutive frames

F^{t}

and

F^{t + 1}

. Similarly to

J J D_{j k}^{t}

in Equation (1), the Joint-Joint Distance

J J D_{j k}^{t, t + 1}

between

p_{j}^{t}

and

p_{k}^{t + 1}

is computed as

J J D_{j k}^{t, t + 1} = | | p_{j}^{t} - p_{k}^{t + 1} {| |}_{2}, (t = 1, 2, 3, \dots, N - 1) .

(5)

Also, similarly to Equation (2), the Joint-Joint Orientation

{JJO}_{j k}^{t, t + 1}

from joint

p_{j}^{t}

to

p_{k}^{t + 1}

is computed as

{JJO}_{j k}^{t, t + 1} = p_{j}^{t} - p_{k}^{t + 1}, (t = 1, 2, \dots, N - 1) .

(6)

We define “a human motion” from t to

t + 1

by vector

{MF}^{t \to t + 1}

, in which

{MF}^{t \to t + 1} = [{JJD}_{R G B}^{t, t + 1} ⧺ {JJO}_{R G B}^{t, t + 1}], (t = 1, 2, \dots, N - 1),

(7)

where

{JJD}_{R G B}^{t, t + 1}

and

{JJO}_{R G B}^{t, t + 1}

are encoded to qualify the color representation as

{JJD}_{R G B}^{t}

and

{JJO}_{R G B}^{t}

, respectively.

3.1.3. Building Global Action Map from PFs and MFs

Based on the obtained PFs and MFs, we propose a skeleton-based representation called SPMF for 3D human action recognition. To this end, all PFs and MFs computed from the skeleton sequence

S

are concatenated into a single feature vector in temporal order from the beginning to the end of the action. It is a global representation for the whole skeleton sequence

S

without dependence on the range of action and can be obtained by

\begin{matrix} SPMF = [{PF}^{1} ⧺ {MF}^{1 \to 2} ⧺ {PF}^{2} ⧺ \dots ⧺ {PF}^{t} ⧺ {MF}^{t \to t + 1} ⧺ {PF}^{t + 1} \dots ⧺ {PF}^{N - 1} ⧺ {MF}^{N - 1 \to N} ⧺ {PF}^{N}] . \end{matrix}

(8)

Figure 4 (top row) shows some SPMFs obtained from the MSR Action3D dataset [59] in which all images are resized to 32 × 32 pixels. Before computing the SPMF, a Savitzky-Golay smoothing filter [37,76] is adopted to reduce the effect of noise on skeletal data. In the experiments, we use the filter

\begin{matrix} f^{t} = \frac{- 3 c^{t - 2} + 12 c^{t - 1} + 17 c^{t} + 12 c^{t + 1} - 3 c^{t + 2}}{35}, \end{matrix}

(9)

where

c^{t}

denotes the skeleton joint coordinates of frame

F^{t}

(

t = 1, 2, \dots, N

) and

f^{t}

denotes the filtering result. This filter design method is described in detailed in Appendix A.

3.2. Enhanced-SPMF: Building Enhanced Action Map

The skeleton-based representations obtained by Equation (8) mainly reflect the spatio–temporal distribution of skeleton joints. We visualize these representations and observe that they tend to be low contrast images, as shown in Figure 4 (top row). In this case, a color enhancement method can be useful for increasing contrast and highlighting the texture and edges of the motion maps. Therefore, it is necessary to enhance the local features on the generated color images after encoding. The Adaptive Histogram Equalization (AHE) [49] is a common approach for this task. This technique is capable of enhancing the local features of an image. Mathematically, let

I

be a given digital image, represented as a r-by-c matrix of integer pixels with intensity levels in the range

[0, L - 1]

. The histogram of image

I

will be defined by

H_{k} = n_{k},

(10)

where

n_{k}

is the number of pixels in

I

with intensity k. The probability of occurrence of intensity level k in

I

can be estimated by

p_{k} = \frac{n_{k}}{r \times c}, (k = 0, 1, 2, \dots, L - 1) .

(11)

The histogram equalized image is defined by transforming the pixel intensities, n, of

I

by the function

T (n) = floor ((L - 1) \sum_{k = 0}^{n} p_{k}), (n = 0, 1, 2, \dots, L - 1),

(12)

The Histogram Equalization (HE) method is used for increasing the global contrast of the image. However, it cannot solve the problem of increasing local contrast. To overcome this limitation, the image needs to be divided into

R

regions and the HE is then applied in each and every one of these regions. This technique is called the Adaptive Histogram Equalization algorithm (AHE) [49]. The bottom row of Figure 4 shows samples of the enhanced motion map with

R

= 8 on 32 × 32 images, which we refer to it as Enhanced-SPMF, for some actions from the MSR Action 3D dataset [59].

3.3. Deep Learning Model

3.3.1. Densely Connected Convolutional Networks

DenseNet [50], considered as the current state-of-the-art CNN architecture, has some interesting properties. In this architecture [50], each layer is connected to all the others within a dense block and all layers can access to the feature maps from their preceding layers. Besides, each layer receives direct information flow from the loss function through the shortcut connections. These properties help DenseNet [50] to be less prone to overfitting for supervised learning problems. Mathematically, traditional CNN architectures, e.g., AlexNet [52] or VGGNet [54] connect the output feature maps

x_{l - 1}

of the

(l - 1) th

layer as input to the

l th

layer and try to learn a mapping function

x_{l} = H_{l} (x_{l - 1}),

(13)

where

H_{l} (\cdot)

is a non-linear transformation and usually implemented via a series of operations such as Convolution (Conv.), Rectified Linear Unit (ReLU) [77], Pooling [78], and Batch Normalization (BN) [79]. When increasing the depth of the network, the network training process becomes complex due to the vanishing-gradient problem and the degradation phenomenon [58] (please see Appendix B for more details). To solve these problems, He et al. introduced ResNet [56]. The key idea behind the ResNet architecture [56] is the presence of shortcut connections that bypass the non-linear transformations

H_{l} (\cdot)

with an identity function

id (x) = x

. This way, each ResNet building block [56] produces a feature map

x_{l}

by performing the following computation

x_{l} = H_{l} (x_{l - 1}) + x_{l - 1} .

(14)

Inspired by the philosophy of ResNet [56], to maximize information flow through layers, Huang et al. proposed DenseNet [50] with a simple connectivity pattern: the

l th

layer in a dense block receives the feature maps of all preceding layers as inputs. That means

x_{l} = H_{l} ([x_{0} ⧺ x_{1} ⧺ x_{2} ⧺ \dots ⧺ x_{l - 1}]),

(15)

where

[x_{0} ⧺ x_{1} ⧺ x_{2} ⧺ \dots ⧺ x_{l - 1}]

is a single tensor constructed by concatenation of the previous layer’s output feature maps. Additionally, all layers in the architecture receive direct supervision signals from the loss function through the shortcut connections. In this manner, the network is easy to optimize and resistant to overfitting. In DenseNet [50], multiple dense blocks are connected via transition layers. Each transition layer consists of a convolutional layer followed by an average pooling layer that changes the size of feature maps (The concatenation operation used in Equation (15) is not viable when the size of feature maps changes.). Each block with its transition layer produces k feature maps and the parameter k is called as the “growth rate” of the network. The non-linear function

H_{l} (\cdot)

in the original work [50] is a composite function of three consecutive operations: BN-ReLU-Conv.

3.3.2. Network Design

We propose to design and optimize deep DenseNets [50] for learning and classifying human actions on the Enhanced-SPMFs. To study how recognition performance varies with architecture size, we explore different network configurations. The following configurations are used in our experiments: DenseNet (L = 100, k = 12); DenseNet (L = 250, k = 24); and DenseNet (L = 190, k = 40), where L is the depth of the network and k is the network growth rate. On all datasets, we use three dense blocks on

32 \times 32

input images. In this design,

H_{l} (\cdot)

is defined as Batch Normalization (BN) [79], followed by an advanced activation layer called Exponential Linear Unit (ELU) [80] and

3 \times 3

Convolution (Conv.). A Dropout [80] with a rate of 0.2 is used after each Convolution to prevent overfitting. After the feature extraction stage, a Fully Connected (FC) layer is used for classification task in which the number of neurons for this FC layer is equal to the number of action classes in each dataset. The proposed networks can be trained in an end-to-end manner by gradient descent using Adam update rule [81]. During the training stage, we minimize a cross-entropy loss function, which is measured by the difference between the true action label y and the predicted action

\hat{y}

by the networks over the training samples

X

. In other words, the network will be trained to solve the following optimization problem

Arg \min_{W} (L_{X} (y, \hat{y})) = Arg \min_{W} (- \frac{1}{M} \sum_{i = 1}^{M} \sum_{j = 1}^{C} y_{i j} log {\hat{y}}_{i j}),

(16)

where

W

is the set of weights that will be optimized by the model, M denotes the number of samples in training set

X

and C is the number of action classes.

4. Experiments

We investigate the effectiveness of the proposed approach using four public benchmark action recognition datasets, comparing our method with current state-of-the-art models for each benchmark. We refer the reader to a survey by Zhang et al. [30] for a full description of current RGB-D based action recognition datasets: MSR Action3D [59], KARD [60], SBU Kinect Interaction [61], NTU-RGB+D [39]. The detailed description of each dataset is provided in Section 4.1. The implementation and training methodology are described in Section 4.2.

4.1. Datasets and Settings

MSR Action3D dataset [59]: This Kinect 1 captured dataset contains 20 actions performed by 10 subjects. Each skeleton is composed of 20 joints. The MSR Action3D dataset [59] is challenging due to its high inter-action similarities. There are 567 action sequences in total, however, 10 sequences are not valid since the skeletons were missing. Thus, our experiments were conducted on 557 valid sequences. We follow the standard protocol proposed by Li et al. [59]. Specifically, the whole dataset is divided into three subsets: AS1, AS2 and AS3. Table 2 provides a list of actions in each subset, in which all subjects with IDs 1, 3, 5, 7, 9 are selected for training and the remaining subjects with IDs 2, 4, 6, 8, 10 are used for test. Very deep neural networks such as the deep DenseNet architecture require a lot of data to train and optimize. Unfortunately, there are only 557 skeleton sequences on the MSR Action3D dataset [59]. Therefore, some data augmentation techniques, i.e., random cropping, vertical flipping, and rotation with

α = 90^{\circ}

have been applied on this dataset to minimize overfitting. Figure 5 illustrates three data augmentation techniques that were used in our experiments.

Kinect Activity Recognition Dataset (KARD) [60]: The KARD [60] is a Kinect 1 dataset that contains 18 actions and 540 video sequences in total. Each action is performed three times by 10 subjects. It is composed of RGB, depth and skeleton frames in which each skeleton frame contains 15 key joints. The authors of the dataset [60] proposed to divide it into three subsets (i.e., Action Set 1, Action Set 2, and Action Set 3), as listed in Table 3. For each subset, three experiments have been proposed. Specifically, the first experiment (Experiment A) uses one-third of the dataset for training and the rest for test. Meanwhile, the second experiment (Experiment B) uses two-thirds of the dataset for training and the rest for test. The last experiment (Experiment C) uses half of the dataset for training and the other half for testing. As was the case for MSR Action3D dataset [59], data augmentation techniques (i.e., random cropping, vertically flipping, and rotation with

α = 90^{\circ}

) were also applied.

SBU Kinect Interaction dataset [61]: This dataset was collected using the Kinect v1 sensor. It contains 282 skeleton sequences and 6822 frames performed by 7 participants. Each frame of the SBU Kinect dataset [61] contains skeleton joints of two subjects corresponding to an interaction, each skeleton has 15 key joints. There are 8 interactions in total, including approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. This dataset is challenging due to the fact that the joint coordinates exhibit low accuracy. Moreover, they contain non-periodic actions as well as very similar body movements. For instance, there are some pairs of actions that are difficult to distinguish such as exchanging objects–shaking hands or pushing–punching. We randomly split the whole dataset into 5 folds, in which 4 folds are used for training and the remaining 1 fold is used for test. It should be noted that each skeleton frame provided by the SBU dataset [61] contains two separate subjects. Therefore, we consider them as two data samples and feature computation is conducted separately for the two skeletons. Additionally, data augmentation (i.e., random cropping, vertically flipping, rotation with

α = 90^{\circ}

) has been also applied on the SBU dataset [61].

NTU-RGB+D dataset [39]: This Kinect 2 captured dataset is a very large-scale RGB-D dataset. To the best of our knowledge, the NTU-RGB+D dataset [39] is currently the largest and state-of-the-art benchmark dataset with skeletal data for human action analysis. It provides more than 56 thousand video samples, 4 million frames, collected from 40 distinct subjects for 60 different action classes. The following actions are provided by the NTU-RGB+D dataset [39] (please see Figure 6 for some examples): drinking, eating, brushing teeth, brushing hair, dropping, picking up, throwing, sitting down, standing up, clapping, reading, writing, tearing up paper, wearing jacket, taking off jacket, wearing a shoe, taking off a shoe, wearing on glasses, taking off glasses, putting on a hat/cap, taking off a hat/cap, cheering up, hand waving, kicking something, reaching into self pocket, hopping, jumping up, making/answering a phone call, playing with phone, typing, pointing to something, taking selfie, checking time, rubbing two hands together, bowing, shaking head, wiping face, saluting, putting palms together, crossing hands in front. sneezing/coughing, staggering, falling down, touching head, touching chest, touching back, touching neck, vomiting, fanning self. punching/slapping other person, kicking other person, pushing other person, patting others back, pointing to the other person, hugging, giving something to other person, touching other persons pocket, handshaking, walking towards each other, and walking apart from each other. In the NTU-RGB+D dataset [39], each skeleton contains the 3D coordinates of 25 body joints. The authors [39] of this dataset suggested two different evaluation criteria, including Cross-Subject evaluation and Cross-View evaluation. For the Cross-Subject setting, the sequences performed by 20 subjects (with IDs 1, 2, 4, 5, 8, 9, 13, 14, 15, 16, 17, 18, 19, 25, 27, 28, 31, 34, 35, and 38) are used for training and the rest sequences are used for test. In Cross-View setting, the sequences provided by cameras 2 and 3 are used for training while sequences from camera 1 are used for test. This setting allows to evaluate the ability to recognize actions under multiple-viewpoints of the proposed skeleton-based representation. We do not apply any data augmentation technique on the NTU-RGB+D [39] due to the very large-scale nature of this dataset [39].

4.2. Implementation Details

For all the datasets, the proposed Enhanced-SPMF representations are computed directly from the raw skeleton sequences without using a fixed number of frames. For computational efficiency, all the image representations are resized to 32 × 32 pixels. The three network configurations: DenseNet (L = 100, k = 12); DenseNet (L = 250, k = 24); and DenseNet (L = 190, k = 40) were implemented and evaluated in Python with the support of the Keras framework using TensorFlow as back-end. During the training stage, we use mini-batches of 32 images for all networks. The weights are initialized as per the He initialization technique [82]. Adam optimizer [81] is used with default parameters (i.e.,

β_{1} = 0.9

and

β_{2} = 0.999

). Additionally, we use a dynamic learning rate during training. The initial learning rate is set to 0.01 and is decreased by a factor of 0.1 after every 50 epochs. All networks are trained for 300 epochs from scratch.

5. Experimental Result and Analysis

5.1. Results and Comparisons with the State-of-the-Art

Results on MSR Action3D dataset: Experimental results and comparisons of the proposed method with the current state-of-the-art approaches on the MSR Action3D dataset [59] are summarized in Table 4. We compare the proposed method with Bag of 3D Points [59], Depth Motion Maps [69], Bi-LSTM [72], Lie Group Representation [23], FTP-SVM [72], Hierarchical LSTM [37], ST-LSTM Trust Gates [40], Graph-Based Motion [36], ST-NBNN [70], ST-NBMIM [83], S-T Pyramid [84], Ensemble TS-LSTM v2 [71] and our previous model SPMF Inception-ResNet-222 [48] using the same evaluation protocol. The proposed DenseNets (L = 100, k = 12) and DenseNet (L = 190, k = 40) achieve average accuracies of 98.76% and 98.94%, respectively. Meanwhile, the best recognition accuracies are obtained by the proposed DenseNet (L = 250, k = 24) with a total average accuracy of 99.10%. This result outperforms many previous approaches [23,36,37,40,59,69,70,72,83,84], demonstrating the superiority of the proposed method. Figure 7 (first row) shows learning curves of the proposed DenseNets on the AS1 subset/MSR Action3D dataset [59]. The recognition accuracy for each action class in the AS1 subset by the DenseNet (L = 250, k = 24) is provided in Figure 8 via its confusion matrix.

Results on KARD dataset: We performed a total of 9 experiments over three experiments A, B, and C on the KARD dataset [60]. Table 5 summarizes the obtained results on this dataset. We compute the average recognition accuracy over the three experiments and compare it with existing techniques including Hand-crafted Features [60], Posture Feature+Multi-class SVM [85], and Key Postures+Multi-class SVM [86]. As can be seen in Table 5, the proposed DenseNet (L = 250, k = 24) is able to improve state-of-the-art accuracy by 9.15% over Hand-crafted Features [60], 2.78% over Posture Feature+Multi-class SVM [85] and 0.68% over Key Postures+Multi-class SVM [86]. This result confirms that the proposed deep learning framework trained on the Enhanced-SPMFs is able to achieve better performance in the recognition of actions compared to hand-crafted based approaches.

Results on SBU Kinect Interaction dataset: As reported in Table 6, the proposed DenseNet (L = 250, k = 40) achieved an accuracy of 97.86% and outperforms many existing state-of-the-art approaches including Raw Skeleton [61], Joint Features [61], HBRNN [37], CHARM [87], Deep LSTM [41], Joint Features [88], ST-LSTM [40], Co-occurrence+Deep LSTM [41], STA-LSTM [89], ST-LSTM+Trust Gates [40], ST-NBMIM [83], Clips+CNN+MTLN [90], Two-stream RNN [91], and GCA-LSTM network [92]. Using only skeleton modality, the proposed method outperforms hand-crafted feature based approaches such as Raw Skeleton [61], Joint Features [61] and recent state-of-the-art RNN-based approaches [37,40,41,89,91,92]. In particular, the proposed method achieves a significant accuracy gain of 2.96% compared to the nearest competitor GCA-LSTM network [92]. This result demonstrates that the proposed deep learning framework is able to learn discriminative spatio–temporal features of skeleton joints containing in the proposed motion representation for classification task.

Results on NTU-RGB+D dataset: For the NTU-RGB+D dataset [39], the best configuration DenseNet (L = 250, k = 40) achieves an accuracy of 80.11% on the Cross-Subject evaluation and 86.82% on the Cross-View evaluation, as summarized in Table 7. These results demonstrate the effectiveness of the proposed representation and deep learning framework since they surpass previous state-of-the-art techniques such as Lie Group Representation [23], Hierarchical RNN [37], Dynamic Skeletons [93], Two-Layer P-LSTM [39], ST-LSTM Trust Gates [40], Geometric Features [74], Two-Stream RNN [91], Enhanced Skeleton [94], Lie Group Skeleton+CNN [95], and GCA-LSTM [92]. The experimental results have also shown that the proposed method leads to better overall action recognition performance than our previous models including Skeleton-based ResNet [51] and SPMF Inception-ResNet-222 [48]. With a high recognition rate on the Cross-View evaluation (86.82%) where the sequences provided by cameras 2 and 3 are used for training and sequences from camera 1 are used for test, the proposed method shows its effectiveness for dealing with view-independent action recognition problem. Figure 7 (last row) shows the training loss and test accuracy of the DenseNet (L = 250, k = 24) on this dataset.

5.2. An Ablation Study on the Proposed Enhanced-SPMF Representation

We believe that the use of the AHE algorithm [49] and the Savitzky-Golay smoothing filter [37,76] helps the proposed representation to be more discriminative, which improves recognition accuracy. To verify this hypothesis, we carried out an ablation study on the Enhanced-SPMF representation provided by the SBU Kinect Interaction dataset [61]. Specifically, we trained the proposed DenseNet (

L = 250, k = 24

) on both the SPMFs and Enhanced-SPMFs. During training, the same hyper-parameters and training methodology were applied. The experimental results indicate that the proposed deep network achieves better recognition accuracy when trained on the Enhanced-SPMFs. As reported in Figure 9, applying the AHE algorithm [49] and and the Savitzky–Golay smoothing filter [37,76] helps improving the accuracy by 4.09%. This result validates our hypothesis above.

5.3. Visualization of Deep Feature Maps

Different action classes have different discriminative characteristics. To better understand the internal operation of the proposed deep networks and to study what they learned from the skeleton-based representation, we input different Enhanced-SPMFs corresponding to different action classes of the MSR Action3D dataset [59] to the DenseNet (

L = 100

,

k = 12

) and visualize the individual feature maps learned by the network at the end of a dense block (intermediate layer). We observe that the designed network is able to extract discriminative features from the Enhanced-SPMF representations. This is expressed through the color of each learned feature map, as can be seen in Figure 10. These discriminative features play a key role in classifying actions.

5.4. Computational Efficiency Evaluation

In this section, we take the AS1 subset of MSR Action3D dataset [59] and the DenseNet (L = 100, k = 12) to evaluate the computational efficiency of the proposed method. Figure 11 illustrates three main stages of the deep learning framework for learning and recognizing actions from skeleton sequences, including an encoding process from input skeleton sequences to color images (Stage 1), a supervised training stage (Stage 2), and an inference stage (Stage 3). The implementation is in Python/Keras and when training on a single GeForce GTX 1080 Ti GPU, the proposed deep network only has 6.0M parameters and it takes less than six hours to reach convergence. Latency in predicting an action for a new skeleton sequence (including encoding it to color images, executed on a CPU) is about

74.8 \times 10^{- 3}

seconds per sequence. Additionally, it should be noted that the computation of the Enhanced-SPMFs can be implemented and optimized on a GPU for real-time applications. Please see Table 8 for further details. This result verifies the effectiveness of the proposed learning framework in terms of computational cost.

5.5. Limitations

The use of the Savitzky-Golay filter [76] helps reduce the effect of noise on the raw skeleton sequences. However, the proposed approach cannot overcome the problem of missing data. In other words, as the Enhanced-SPMF is a global representation for the whole skeleton sequence, data errors of local fragments in the input sequences could reduce the recognition rate. Another open problem of the proposed approach is how to scope with Online Action Recognition (OAR) task. Specifically, how to detect and recognize human actions from unsegmented streams in a continuous manner, where boundaries between different kinds of actions within the stream are unknown. A common solution for OAR is the sliding window based methods [97,98]. These approaches consider the temporal coherence within the window for prediction. We can also apply this idea to solve the current problem. E.g., during the online inference phase, we use a sliding window on the original skeleton sequences or on image-coded representations (i.e., Enhanced-SPMFs) and then predicting action by pretrained deep learning model, as we showed in Figure 11 (Stage 3). However, we understand that the performance of this approach is sensitive to the window size. Either too large or too small window size could lead to a significant drop in recognition performance. Another solution is to use Temporal Attention Networks (TANs) [99,100,101,102] that incorporate temporal attention models for video-based action recognition.

6. Conclusions

In this paper, we have presented an efficient and effective deep learning framework for 3D human action recognition from skeleton sequences. A novel motion representation, termed Enhanced-SPMF, which captures the spatio–temporal information of skeleton movements and transforms them into color images has been proposed. We exploited the Adaptive Histogram Equalization (AHE) technique to enhance the local textures of color images and generate more discriminative features for learning and classification tasks. Different Deep Convolutional Neural Networks (D-CNNs) based on the DenseNet architecture have been designed and optimized to learn and recognize actions from the proposed representation, in an end-to-end manner. Extensive empirical evaluations on four challenging public datasets demonstrate the effectiveness of the proposed approach on both individual actions, interactions, multiview and large-scale datasets. In particular, we also indicate that the proposed method is invariant to viewpoint changes and requires low computational cost for training and inference. We hope that this study opens up a new door to exploit the big potential of skeletal data, which helps to address the current challenges in building real-world action recognition applications.

Author Contributions

Conceptualization, methodology, software, validation, investigation, resources, visualization, writing—original draft preparation, H.H.P.; Data curation, formal analysis, writing—review and editing, all authors; Supervision; project administration; funding acquisition, A.C., P.Z., L.K. and S.A.V.

Funding

Sergio A. Velastin is grateful for funding received from the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement N

^{\circ}

600371, el Ministerio de Economía, Industria y Competitividad (COFUND2013-51509) el Ministerio de Educación, Cultura y Deporte (CEI-15-17) and Banco Santander.

Acknowledgments

This research was carried out at the Cerema Research Center and Informatics Research Institute of Toulouse, Paul Sabatier University, France. The authors would like to express our thanks to all the people who have made helpful comments and suggestions on a previous draft. S.A. Velastin is grateful to funding received from the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 600371, el Ministerio de Economía y Competitividad (COFUND2013-51509) and Banco Santander.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Savitzky-Golay Smoothing Filter

Savitzky-Golay (S-G) filter is a low-pass filter based on local least-squares polynomial approximation that is often used to smooth noisy data. The 3D skeleton joints obtained from depth cameras can be considered as a series of equally spaced data in the time domain, applying S-G filter on raw skeletal data helps reduce the level of noise while maintaining the 3D geometric characteristics of the input sequences.

Considering a sequence of

N = 2 M + 1

input data points

x [n]

centered at

n = 0

, given by

x = {[x_{- M}, \dots, x_{- 1}, x_{0}, x_{1}, \dots, x_{M}]}^{T} .

(A1)

The N data samples of

x

can be fitted by a polynomial

p (n) = \sum_{k = 0}^{N} c_{k} n^{k} .

(A2)

To best fit the given data

x

, Savitzky and Golay [76] proposed a method of data smoothing by finding the vector of polynomial coefficients

c = {[c_{0}, c_{1}, \dots, c_{N}]}^{T}

that minimizes the mean-squares approximation error

E_{N} = \sum_{n = - M}^{M} {(\sum_{k = 0}^{N} c_{k} n^{k} - x [n])}^{2} .

(A3)

To this end, one solution is to determine a set of coefficients that satisfies the partial derivative equation is equal to zero

\begin{matrix} \frac{\partial E_{N}}{\partial a_{i}} = \sum_{n = - M}^{M} 2 n^{i} (\sum_{k = 0}^{N} c_{k} n^{k} - x [n]) = 0 with i = 0, 1, \dots, N . \end{matrix}

(A4)

Equation (A4) is equivalent to

\sum_{k = 0}^{N} (\sum_{n = - M}^{M} n^{i + k}) c_{k} = \sum_{n = - M}^{M} n^{i} x [n] .

(A5)

Defining a matrix

A = {α_{n, i}}

as the matrix with elements

α_{n, i} = n^{i}

(A6)

where

- M \leq n \leq M

and

i = 0, 1, \dots, N

. The matrix

A

is called the design matrix for the polynomial approximation problem. Note that, the transpose of

A

is

A^{T} = {α_{i, n}}

and the product matrix

B = A^{T} A

is a symmetric matrix with elements

β_{i, k} = \sum_{n = - M}^{M} α_{i, n} α_{n, k} = \sum_{n = - M}^{M} n^{i + k}

(A7)

Therefore, Equation (A5) can be rewritten in matrix form as

B \cdot c = A^{T} \cdot A \cdot c = A^{T} \cdot x .

(A8)

The polynomial coefficients can be determined as

c = {(A^{T} \cdot A)}^{- 1} \cdot (A^{T} \cdot x) .

(A9)

For example, for smoothing by a 5-point quadratic polynomial with

N = 5, M = - 2, - 1, 0, 1, 2

, the tth filtering result,

y_{t}

is given by

y_{t} = \frac{- 3 x_{t - 2} + 12 x_{t - 1} + 17 x_{t} + 12 x_{t + 1} - 3 x_{t + 2}}{35} .

(A10)

Equation (A10) above was used in our experiments to reduce the effect of noise on the raw skeleton data.

Appendix B. Degradation Phenomenon in Training Very Deep Neural Networks

Very deep neural networks demonstrate to have a high performance on many visual-related tasks [54,55,56,57]. However, they are very difficult to optimize. One of the main challenges for training deeper networks is the vanishing and exploding gradient problems [103]. Specifically, when the network is deep enough, the supervision signals from the output layer can be completely attenuated or exploded on their way back towards the previous layers. Therefore, the network cannot learn the parameters effectively. These obstacles can be solved by recent advanced techniques in deep learning such as Normalized Initialization [104] or Batch Normalization [79]. When the deep networks start converging, a degradation phenomenon occurs. Due to this, the training and test errors increase if more layers are added to a deep architecture. This phenomenon is called by the degradation phenomenon. Figure A1 shows an experimental result [56] related to this phenomenon.

Figure A1. Degradation phenomenon during training D-CNNs. (a) Training error and (b) test error on CIFAR-10 [105] with 20-layer and 56-layer CNNs reported by He et al. [56]. The deeper network has higher error for both training and test phases. Figure was reproduced from the work of He et al. [56] and used with permission from IEEE.

References

Aggarwal, J.; Ryoo, M. Human Activity Analysis: A Review. ACM Comput. Surv. 2011, 43, 16. [Google Scholar] [CrossRef]
Boiman, O.; Irani, M. Detecting Irregularities in Images and in Video. Int. J. Comput. Vis. 2007, 74, 17–31. [Google Scholar] [CrossRef] [Green Version]
Lin, W.; Sun, M.T.; Poovandran, R.; Zhang, Z. Human activity recognition for video surveillance. In Proceedings of the IEEE International Symposium on Circuits and Systems, Seattle, WA, USA, 18 May–21 August 2008. [Google Scholar]
Gupta, A.; Kembhavi, A.; Davis, L.S. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 1775–1789. [Google Scholar] [CrossRef]
Yao, B.; Fei-Fei, L. Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1691–1703. [Google Scholar]
Dagli, I.; Brost, M.; Breuel, G. Action Recognition and Prediction for Driver Assistance Systems Using Dynamic Belief Networks. In Agent Technologies, Infrastructures, Tools, and Applications for E-Services; Springer: Berlin/Heidelberg, Germany, 2003; pp. 179–194. [Google Scholar]
Fridman, L.; Brown, D.E.; Glazer, M.; Angell, W.; Dodd, S.; Jenik, B.; Terwilliger, J.; Kindelsberger, J.; Ding, L.; Seaman, S.; et al. MIT Autonomous Vehicle Technology Study: Large-Scale Deep Learning Based Analysis of Driver Behavior and Interaction with Automation. arXiv 2017, arXiv:1711.06976. [Google Scholar]
Poppe, R. A survey on vision-based human action recognition. Image Visi. Comput. 2010, 28, 976–990. [Google Scholar] [CrossRef]
Weinland, D.; Ronfard, R.; Boyer, E. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 2011, 115, 224–241. [Google Scholar] [CrossRef] [Green Version]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef] [Green Version]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Klaeser, A.; Marszalek, M.; Schmid, C. A Spatio-Temporal Descriptor Based on 3D-Gradients. In Proceedings of the the British Machine Vision Conference, Leeds, UK, 1–4 September 2008; pp. 1–10. [Google Scholar]
Dollar, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Breckenridge, CO, USA, 7 January 2005; pp. 65–72. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Willems, G.; Tuytelaars, T.; Van Gool, L. An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2008; pp. 650–663. [Google Scholar]
Zhang, Z. Microsoft Kinect Sensor and Its Effect. IEEE MultiMed. 2012, 19, 4–10. [Google Scholar] [CrossRef] [Green Version]
Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
Oreifej, O.; Liu, Z. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 716–723. [Google Scholar]
Xia, L.; Aggarwal, J.K. Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2834–2841. [Google Scholar]
Rahmani, H.; Mahmood, A.; Huynh, D.Q.; Mian, A. HOPC: Histogram of Oriented Principal Components of 3D Pointclouds for Action Recognition. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2014; pp. 742–757. [Google Scholar]
Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 588–595. [Google Scholar]
Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Learning Actionlet Ensemble for 3D Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 914–927. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Tian, Y. Super Normal Vector for Human Activity Recognition with Depth Cameras. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1028–1039. [Google Scholar] [CrossRef]
Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
Ye, M.; Shen, Y.; Du, C.; Pan, Z.; Yang, R. Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1517–1532. [Google Scholar] [CrossRef] [Green Version]
Gu, J.; Ding, X.; Wang, S.; Wu, Y. Action and Gait Recognition From Recovered 3-D Human Joints. IEEE Trans. Syst. Man Cybern. Part B 2010, 40, 1021–1033. [Google Scholar]
Johansson, G. Visual motion perception. Sci. Am. 1975, 232, 76–89. [Google Scholar] [CrossRef]
Zhang, J.; Li, W.; Ogunbona, P.O.; Wang, P.; Tang, C. RGB-D-based action recognition datasets: A survey. Pattern Recognit. 2016, 60, 86–105. [Google Scholar] [CrossRef] [Green Version]
Xia, L.; Chen, C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
Chaudhry, R.; Ofli, F.; Kurillo, G.; Bajcsy, R.; Vidal, R. Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 471–478. [Google Scholar]
Ding, W.; Liu, K.; Fu, X.; Cheng, F. Profile HMMs for skeleton-based human action recognition. Signal Process. Image Commun. 2016, 42, 109–119. [Google Scholar] [CrossRef]
Han, L.; Wu, X.; Liang, W.; Hou, G.; Jia, Y. Discriminative human action recognition in the learned hierarchical manifold space. Image Vis. Comput. 2010, 28, 836–849. [Google Scholar] [CrossRef]
Luo, J.; Wang, W.; Qi, H. Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps. In Proceedings of the IEEE International Conference on Computer Vision, Portland, OR, USA, 23–28 June 2013; pp. 1809–1816. [Google Scholar]
Wang, P.; Yuan, C.; Hu, W.; Li, B.; Zhang, Y. Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 370–385. [Google Scholar]
Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
Veeriah, V.; Zhuang, N.; Qi, G. Differential Recurrent Neural Networks for Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4041–4049. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 816–833. [Google Scholar]
Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 3697–3703. [Google Scholar]
Liu, J.; Wang, G.; Hu, P.; Duan, L.; Kot, A.C. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3671–3680. [Google Scholar]
Lv, F.; Nevatia, R. Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 359–372. [Google Scholar]
Schuster, M.; Paliwal, K. Bidirectional Recurrent Neural Networks. Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Graves, A.; Fernández, S.; Schmidhuber, J. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In Artificial Neural Networks: Formal Models and Their Applications; Springer: Berlin/Heidelberg, Germany, 2005; pp. 799–804. [Google Scholar]
Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
Pham, H.; Khoudour, L.; Crouzil, A.; Zegers, P.; Velastin, S.A. Skeletal Movement to Color Map: A Novel Representation for 3D Action Recognition with Inception Residual Networks. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3483–3487. [Google Scholar]
Pizer, S.M.; Amburn, E.P.; Austin, J.D.; Cromartie, R.; Geselowitz, A.; Greer, T.; ter Haar Romeny, B.; Zimmerman, J.B.; Zuiderveld, K. Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 1987, 39, 355–368. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Pham, H.; Khoudour, L.; Crouzil, A.; Zegers, P.; Velastin, S.A. Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks. IET Comput. Visi. 2019, 13, 319–328. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 1725–1732. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Telgarsky, M. Benefits of depth in neural networks. arXiv 2016, arXiv:1602.04485. [Google Scholar]
He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3D points. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 9–14. [Google Scholar]
Gaglio, S.; Re, G.L.; Morana, M. Human Activity Recognition Process Using 3-D Posture Data. IEEE Trans. Hum.-Mach. Syst. 2015, 45, 586–597. [Google Scholar] [CrossRef]
Yun, K.; Honorio, J.; Chattopadhyay, D.; Berg, T.L.; Samaras, D. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 28–35. [Google Scholar]
Han, F.; Reily, B.; Hoff, W.; Zhang, H. Space-time representation of people based on 3D skeletal data: A review. Comput. Vis. Image Underst. 2017, 158, 85–105. [Google Scholar] [CrossRef] [Green Version]
Berndt, D.J.; Clifford, J. Using Dynamic Time Warping to Find Patterns in Time Series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining; AAAI Press: Seattle, WA, USA, 1994; pp. 359–370. [Google Scholar]
Eddy, S.R. Hidden Markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef]
Kirk, A.G.; O’Brien, J.F.; Forsyth, D.A. Skeletal parameter estimation from optical motion capture data. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 2, p. 1185. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.; Sheikh, Y. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar]
Bearman, A.; Dong, C. Human Pose Estimation and Activity Classification Using Convolutional Neural Networks. CS231n Course Project Reports. 2015. Available online: http://www.catherinedong.com/pdfs/231n-paper.pdf (accessed on 22 April 2019).
Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; Volume 385. [Google Scholar]
Chen, C.; Liu, K.; Kehtarnavaz, N. Real-time Human Action Recognition Based on Depth Motion Maps. J. Real-Time Image Process. 2016, 12, 155–163. [Google Scholar] [CrossRef]
Weng, J.; Weng, C.; Yuan, J. Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 445–454. [Google Scholar]
Lee, I.; Kim, D.; Kang, S.; Lee, S. Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1012–1020. [Google Scholar]
Tanfous, A.B.; Drira, H.; Amor, B.B. Coding Kendall’s Shape Trajectories for 3D Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 2840–2849. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef]
Zhang, S.; Liu, X.; Xiao, J. On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Santa Rosa, CA, USA, 24–31 March 2017; pp. 148–157. [Google Scholar]
Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust RGB-D object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 Octorber 2015; pp. 681–687. [Google Scholar]
Savitzky, A.; Golay, M.J. Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 1964, 36, 1627–1639. [Google Scholar] [CrossRef]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; Volume 15, pp. 315–323. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32th International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by Exponential Linear Units (ELUs). arXiv 2015, arXiv:1511.07289. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1026–1034. [Google Scholar]
Weng, J.; Weng, C.; Yuan, J.; Liu, Z. Discriminative Spatio-Temporal Pattern Discovery for 3D Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 1077–1089. [Google Scholar] [CrossRef]
Xu, H.; Chen, E.; Liang, C.; Qi, L.; Guan, L. Spatio-Temporal Pyramid Model based on depth maps for action recognition. In Proceedings of the IEEE 17th International Workshop on Multimedia Signal Processing, Xiamen, China, 19–21 October 2015; pp. 1–6. [Google Scholar]
Cippitelli, E.; Gasparrini, S.; Gambi, E.; Spinsante, S. A human activity recognition system using skeleton data from RGB-D sensors. Comput. Intell. Neurosci. 2016, 2016. [Google Scholar] [CrossRef]
Ling, J.; Tian, L.; Li, C. 3D Human Activity Recognition Using Skeletal Data from RGB-D Sensors. In Advances in Visual Computing; Springer International Publishing: Cham, Switzerland, 2016; pp. 133–142. [Google Scholar]
Li, W.; Wen, L.; Chuah, M.C.; Lyu, S. Category-Blind Human Action Recognition: A Practical Recognition System. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4444–4452. [Google Scholar]
Ji, Y.; Ye, G.; Cheng, H. Interactive body part contrast mining for human interaction recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshops, Chengdu, China, 14–18 July 2014; pp. 1–6. [Google Scholar]
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 1, pp. 4263–4270. [Google Scholar]
Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A New Representation of Skeleton Sequences for 3D Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4570–4579. [Google Scholar]
Wang, H.; Wang, L. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3633–3642. [Google Scholar]
Liu, J.; Wang, G.; Duan, L.; Abdiyeva, K.; Kot, A.C. Skeleton-Based Human Action Recognition with Global Context-Aware Attention LSTM Networks. IEEE Trans. Image Process. 2018, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Zheng, W.; Lai, J.; Zhang, J. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2186–2200. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
Rahmani, H.; Bennamoun, M. Learning Action Recognition Model from Depth and Skeleton Videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5833–5842. [Google Scholar]
Tas, Y.; Koniusz, P. CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps. In Proceedings of the British Machine Vision Conference 2018, Newcastle, UK, 3–6 September 2018; p. 158. [Google Scholar]
Kulkarni, K.; Evangelidis, G.; Cech, J.; Horaud, R. Continuous Action Recognition Based on Sequence Alignment. Int. J. Comput. Vis. 2015, 112, 90–114. [Google Scholar] [CrossRef]
Kviatkovsky, I.; Rivlin, E.; Shimshoni, I. Online action recognition using covariance of shape and motion. Comput. Vis. Image Underst. 2014, 129, 15–26. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212. [Google Scholar]
Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32th International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
Zang, J.; Wang, L.; Liu, Z.; Zhang, Q.; Hua, G.; Zheng, N. Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition. In Artificial Intelligence Applications and Innovations; Springer International Publishing: Cham, Switzerland, 2018; pp. 97–108. [Google Scholar] [Green Version]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
LeCun, Y.; Bottou, L.; Orr, G.B.; Müller, K.R. Efficient backprop. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 9–50. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]

Figure 1. Overview of the proposed Enhanced-SPMF representation. Each skeleton sequence is transformed into a single RGB that is a motion map called SPMF [48]. A color enhancement technique [49] is then used to highlight the motion map and form the Enhanced-SPMF, which will be learned and classified by a deep learning model. Before computing the SPMF, a smoothing filter is adopted to reduce the effect of noise on skeletal data. Section 3 describes the details of the proposed approach.

Figure 2. Schematic overview of the proposed approach. Each skeleton sequence is encoded in a single color image via a skeleton-based representation called SPMF. Each SPMF is built from pose vectors (PFs) and motion vectors (MFs) extracted from skeleton joints. They are then enhanced by an Adaptive Histogram Equalization (AHE) [49] algorithm and fed to a D-CNN for learning discriminative features and performing action classification. To achieve high-level learning performance during the training phase, we design and optimize different D-CNN models based on deep DenseNet [50], a recent state-of-the-art architecture for image recognition tasks.

Figure 3. Illustration of the encoding process that converts joint-joint distance values to color points using a JET colormap.

Figure 4. Results of the skeleton-to-image mapping process. The top row shows the proposed SPMF representations obtained from some samples of the MSR Action3D dataset [59]. The change in color reflects the change of distance and orientation between the joints. The bottom row shows generated images after applying the AHE algorithm [49].

Figure 5. Illustration of data augmentation techniques used to generate more training samples.

Figure 6. Some action classes of the NTU-RGB+D dataset [39]. Video samples have been captured by 3 Microsoft Kinect

^{TM}

v2 sensors concurrently at 30 FPS. The 3D skeletal data contains the three dimensional locations of 25 major body joints, at each frame. Figure was reproduced from the work of Shahroudy et al. [39] and used with permission from IEEE.

Figure 6. Some action classes of the NTU-RGB+D dataset [39]. Video samples have been captured by 3 Microsoft Kinect

^{TM}

v2 sensors concurrently at 30 FPS. The 3D skeletal data contains the three dimensional locations of 25 major body joints, at each frame. Figure was reproduced from the work of Shahroudy et al. [39] and used with permission from IEEE.

Figure 7. Training curves of the proposed DenseNet (L = 250, k = 24) on the MSR Action3D [59], KARD [60], SBU Kinect Interaction [61], and NTU-RGB+D [39] datasets. Almost all designed networks are able to reach the optimal weights after the first 100 epochs. The symbols k and L and denote the “growth rate” and the depth of the network, respectively.

Figure 8. Confusion matrix of the proposed DenseNet (L = 250, k = 24) on the MSR Action3D/AS1 dataset. Ground truth action labels are on rows and predictions by the proposed method are on columns. We recommend the readers to use a computer and zoom in to see clearly these figures.

Figure 9. Training loss and test accuracy of the proposed DenseNet (

L = 100

,

k = 12

) on the SBU dataset [61]. (a) shows the obtained result when trained on SPMFs, while (b) reports the obtained result when trained on Enhanced-SPMFs. The symbols k and L denote the “growth rate” and the depth of the network, respectively.

Figure 9. Training loss and test accuracy of the proposed DenseNet (

L = 100

,

k = 12

) on the SBU dataset [61]. (a) shows the obtained result when trained on SPMFs, while (b) reports the obtained result when trained on Enhanced-SPMFs. The symbols k and L denote the “growth rate” and the depth of the network, respectively.

Figure 10. Visualization of feature maps learned by the proposed DenseNet (

L = 100

,

k = 12

) from several samples of the MSR Action3D dataset [59]. Best viewed in color.

Figure 10. Visualization of feature maps learned by the proposed DenseNet (

L = 100

,

k = 12

) from several samples of the MSR Action3D dataset [59]. Best viewed in color.

Figure 11. Three main stages of the proposed deep learning framework for recognizing human actions from skeleton sequences.

Table 1. Summary of advantages and disadvantages of previous methods and our proposed method.

Method & Authors	Data Modalities	Year	Advantages/Disadvantages
Bag of 3D Points [59]	Depth maps	2010	Simple and fast/Low accuracy, viewpoint dependent.
Lie Group Representation [23]	Skeletal data	2014	Robust to temporal misalignment and noise/Low accuracy.
Hierarchical LSTM [37]	Skeletal data	2015	Fast and high accuracy/ Easy to overfit.
Depth Motion Maps [69]	Depth maps	2016	Real-time latency/Low accuracy.
ST-LSTM Trust Gates [40]	Skeletal data	2016	View-invariant representation, robust to noise and occlusion/High computational cost.
Graph-Based Motion [36]	Skeletal data	2016	Robust to noise, high accuracy/Complex, parameter-dependent.
ST-NBNN [70]	Depth maps	2017	Simple and low computational cost/Parameter-dependent.
Ensemble TS-LSTM v2 [71]	Skeletal data	2017	High accuracy, robust to scale, rotation and translation/Data-hungry, high computational cost.
Bi-LSTM [72]	Skeletal data	2018	Invariant representation/Complex, low accuracy.
Our proposed method	Skeletal data	2019	View-invariant representation, real-time latency, high accuracy/Data-hungry, sensitive to data error of local fragments.

Table 2. The list of actions in three subsets AS1, AS2, and AS3 of the MSR Action 3D dataset [59].

AS1	AS2	AS3
[a02] Horizontal arm wave	[a01] High arm wave	[a06] High throw
[a03] Hammer	[a04] Hand catch	[a14] Forward kick
[a05] Forward punch	[a07] Draw x	[a15] Side kick
[a06] High throw	[a08] Draw tick	[a16] Jogging
[a10] Hand clap	[a09] Draw circle	[a17] Tennis swing
[a13] Bend	[a11] Two hand wave	[a18] Tennis serve
[a18] Tennis serve	[a12] Forward kick	[a19] Golf swing
[a20] Pickup & Throw	[a14] Side-boxing	[a20] Pickup & Throw

Table 3. The list of actions in three subsets of the KARD dataset [60].

Action Set 1	Action Set 2	Action Set 3
Horizontal arm wave	High arm wave	Draw tick
Two-hand wave	Side kick	Drink
Bend	Catch cap	Sit down
Phone call	Draw tick	Phone call
Stand up	Hand clap	Take umbrella
Forward kick	Forward kick	Toss paper
Draw X	Bend	High throw
Walk	Sit down	Horizontal arm wave

Table 4. Experimental results and comparison of the proposed method with state-the-art approaches on the MSR Action3D dataset [59]. The list is ordered by recognition performance, in which results that outperform previous works are in bold, while the best accuracies are in blue. Our previous work on SPMF [48] are marked in red.

Method (Protocol of [59])	Year	AS1	AS2	AS3	Aver.
Bag of 3D Points [59]	2010	72.90%	71.90%	71.90%	74.70%
Depth Motion Maps [69]	2016	96.20%	83.20%	92.00%	90.47%
Bi-LSTM [72]	2018	92.72%	84.93%	97.89%	91.84%
Lie Group Representation [23]	2014	95.29%	83.87%	98.22%	92.46%
FTP-SVM [72]	2018	95.87%	86.72%	100.0%	94.19%
Hierarchical LSTM [37]	2015	99.33%	94.64%	95.50%	94.49%
ST-LSTM Trust Gates [40]	2016	N/A	N/A	N/A	94.80%
Graph-Based Motion [36]	2016	93.60%	95.50%	95.10%	94.80%
ST-NBNN [70]	2017	91.50%	95.60%	97.30%	94.80%
ST-NBMIM [83]	2018	92.50%	95.60%	98.20%	95.30%
S-T Pyramid [84]	2015	99.10%	92.90%	96.40%	96.10%
Ensemble TS-LSTM v2 [71]	2017	95.24%	96.43%	100.0%	97.22%
SPMF Inception-ResNet-222 [48]	2018	97.54%	98.73%	99.41%	98.56%
Enhanced-SPMF DenseNet (L = 100, k = 12) (ours)	2018	98.52%	98.66%	99.09%	98.76%
Enhanced-SPMF DenseNet (L = 250, k = 24) (ours)	2018	98.83%	99.06%	99.40%	99.10%
Enhanced-SPMF DenseNet (L = 190, k = 40) (ours)	2018	98.60%	98.87%	99.36%	98.94%

Table 5. Average recognition accuracies (%) over three experiments A, B, and C and comparison with previous works on the KARD dataset [60]. The best accuracies are in blue. Results that surpass previous works are in bold.

Method (Protocol of [60])	Year	Acc. (%)
Hand-crafted Features [60]	2015	90.83%
Posture Feature+Multi-class SVM [85]	2016	97.20%
Key Postures+Multi-class SVM [86]	2016	99.30%
Enhanced-SPMF DenseNet (L = 100, k = 12) (ours)	2018	99.74%
Enhanced-SPMF DenseNet (L = 250, k = 24) (ours)	2018	99.98%
Enhanced-SPMF DenseNet (L = 190, k = 40) (ours)	2018	99.88%

Table 6. Action recognition accuracies (%) and comparison with previous works on the SBU Kinect Interaction dataset [61]. The best accuracies are in blue. Results that surpass previous works are in bold.

Method (Protocol of [61])	Year	Acc. (%)
Raw Skeleton [61]	2012	49.70%
Joint Features [61]	2012	80.30%
HBRNN [37] (reported in [91] )	2015	80.40%
CHARM [87]	2015	83.90%
Deep LSTM [41]	2017	86.03%
Joint Features [88]	2014	86.90%
ST-LSTM [40]	2016	88.60%
Co-occurrence+Deep LSTM [41]	2018	90.41%
STA-LSTM [89]	2017	91.51%
ST-LSTM+Trust Gates [40]	2018	93.30%
ST-NBMIM [83]	2018	93.30%
Clips+CNN+MTLN [90]	2017	93.57%
CNN Kernel Feature Map [96]	2018	94.36%
Two-stream RNN [91]	2017	94.80%
GCA-LSTM network [92]	2018	94.90%
Enhanced-SPMF DenseNet (L = 100, k = 12) (ours)	2018	94.81%
Enhanced-SPMF DenseNet (L = 250, k = 24) (ours)	2018	96.67%
Enhanced-SPMF DenseNet (L = 190, k = 40) (ours)	2018	97.86%

Table 7. Experimental results and comparison of the proposed method with previous approaches on the NTU-RGB+D dataset [39]. The best accuracies are in blue. Results that surpass previous works are in bold. Our previous works [48,51] are marked in red.

Method (Protocol of [39])	Year	Cross-Subject	Cross-View
Lie Group Representation [23]	2014	50.10%	52.80%
Hierarchical RNN [37]	2016	59.07%	63.97%
Dynamic Skeletons [93]	2015	60.20%	65.20%
Two-Layer P-LSTM [39]	2016	62.93%	70.27%
ST-LSTM Trust Gates [40]	2016	69.20%	77.70%
Skeleton-based ResNet [51]	2018	73.40%	80.40%
Geometric Features [74]	2017	70.26%	82.39%
Two-Stream RNN [91]	2017	71.30%	79.50%
Enhanced Skeleton [94]	2017	75.97%	82.56%
Lie Group Skeleton+CNN [95]	2017	75.20%	83.10%
CNN Kernel Feature Map [96]	2018	75.35%	N/A
GCA-LSTM [92]	2018	76.10%	84.00%
SPMF Inception-ResNet-222 [48]	2018	78.89%	86.15%
Enhanced-SPMF DenseNet (L = 100, k = 12) (ours)	2018	79.31%	86.64%
Enhanced-SPMF DenseNet (L = 250, k = 24) (ours)	2018	80.11%	86.82%
Enhanced-SPMF DenseNet (L = 190, k = 40) (ours)	2018	79.28%	86.68%

Table 8. Execution time of each stage of the proposed deep learning framework.

Stage	Average Processing Time (Second/Sequence)
1	$20.8 \times 10^{- 3}$ (Intel Core i7 3.2 GHz CPU)
2	0.164 (GTX 1080 Ti GPU)
3	$74.8 \times 10^{- 3}$ (CPU + GPU time)

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pham, H.H.; Salmane, H.; Khoudour, L.; Crouzil, A.; Zegers, P.; Velastin, S.A. Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors 2019, 19, 1932. https://doi.org/10.3390/s19081932

AMA Style

Pham HH, Salmane H, Khoudour L, Crouzil A, Zegers P, Velastin SA. Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors. 2019; 19(8):1932. https://doi.org/10.3390/s19081932

Chicago/Turabian Style

Pham, Huy Hieu, Houssam Salmane, Louahdi Khoudour, Alain Crouzil, Pablo Zegers, and Sergio A. Velastin. 2019. "Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks" Sensors 19, no. 8: 1932. https://doi.org/10.3390/s19081932

APA Style

Pham, H. H., Salmane, H., Khoudour, L., Crouzil, A., Zegers, P., & Velastin, S. A. (2019). Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks. Sensors, 19(8), 1932. https://doi.org/10.3390/s19081932

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks^†

Abstract

1. Introduction

2. Related Work

2.1. Hand-Crafted Approaches for Skeleton-Based Human Action Recognition

2.2. Deep Learning Approaches for Skeleton-Based Human Action Recognition

3. Method

3.1. SPMF: Building Action Map from Skeletal Data

3.1.1. Pose Features (PFs) Computation

3.1.2. Motion Features (MFs) Computation

3.1.3. Building Global Action Map from PFs and MFs

3.2. Enhanced-SPMF: Building Enhanced Action Map

3.3. Deep Learning Model

3.3.1. Densely Connected Convolutional Networks

3.3.2. Network Design

4. Experiments

4.1. Datasets and Settings

4.2. Implementation Details

5. Experimental Result and Analysis

5.1. Results and Comparisons with the State-of-the-Art

5.2. An Ablation Study on the Proposed Enhanced-SPMF Representation

5.3. Visualization of Deep Feature Maps

5.4. Computational Efficiency Evaluation

5.5. Limitations

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Savitzky-Golay Smoothing Filter

Appendix B. Degradation Phenomenon in Training Very Deep Neural Networks

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks †

Abstract

1. Introduction

2. Related Work

2.1. Hand-Crafted Approaches for Skeleton-Based Human Action Recognition

2.2. Deep Learning Approaches for Skeleton-Based Human Action Recognition

3. Method

3.1. SPMF: Building Action Map from Skeletal Data

3.1.1. Pose Features (PFs) Computation

3.1.2. Motion Features (MFs) Computation

3.1.3. Building Global Action Map from PFs and MFs

3.2. Enhanced-SPMF: Building Enhanced Action Map

3.3. Deep Learning Model

3.3.1. Densely Connected Convolutional Networks

3.3.2. Network Design

4. Experiments

4.1. Datasets and Settings

4.2. Implementation Details

5. Experimental Result and Analysis

5.1. Results and Comparisons with the State-of-the-Art

5.2. An Ablation Study on the Proposed Enhanced-SPMF Representation

5.3. Visualization of Deep Feature Maps

5.4. Computational Efficiency Evaluation

5.5. Limitations

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Savitzky-Golay Smoothing Filter

Appendix B. Degradation Phenomenon in Training Very Deep Neural Networks

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks^†