Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks

Designing motion representations for 3D human action recognition from skeleton sequences is an important yet challenging task. An effective representation should be robust to noise, invariant to viewpoint changes and result in a good performance with low-computational demand. Two main challenges in this task include how to efficiently represent spatio–temporal patterns of skeletal movements and how to learn their discriminative features for classification tasks. This paper presents a novel skeleton-based representation and a deep learning framework for 3D action recognition using RGB-D sensors. We propose to build an action map called SPMF (Skeleton Posture-Motion Feature), which is a compact image representation built from skeleton poses and their motions. An Adaptive Histogram Equalization (AHE) algorithm is then applied on the SPMF to enhance their local patterns and form an enhanced action map, namely Enhanced-SPMF. For learning and classification tasks, we exploit Deep Convolutional Neural Networks based on the DenseNet architecture to learn directly an end-to-end mapping between input skeleton sequences and their action labels via the Enhanced-SPMFs. The proposed method is evaluated on four challenging benchmark datasets, including both individual actions, interactions, multiview and large-scale datasets. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches on all benchmark tasks, whilst requiring low computational time for training and inference.


Introduction
Human action recognition [1] is one of the most important and challenging tasks in computer vision. Detecting and recognizing correctly what humans do in unknown videos serve as a key component of many real-world applications such as smart surveillance [2,3], human-object interaction [4,5], autonomous vehicle technology [6,7], etc. Although significant progress has been achieved over two decades of research, video-based human action recognition is still a challenging issue due to a number of obstacles, e.g., changes in camera viewpoint, occlusions, background, surrounding distractions, diversity in length and speed of actions [8].
As many other visual recognition tasks, traditional approaches on human action recognition [9] have focused on extracting hand-crafted local features and building local descriptors from RGB sequences provided by 2D cameras. Some typical examples that have been widely exploited with success are SIFT [10,11], HOG/HOF [12,13], HOG-3D [14], Cuboids [15], SURF [16] and Extended SURF [17]. Since these approaches typically recognize actions based on the appearance and movement of the human body parts from a monocular RGB video sequence, they tend to lack 3D structure from the scene. Therefore, single modality human action recognition based only on RGB videos is not enough to overcome the current challenges.
The availability of low-cost and easy-to-use depth sensors such as the Microsoft Kinect TM sensor [18] has helped the computer vision community improve action recognition. These sensors are able to provide detailed 3D structural information of human motion, which is considered complex for traditional 2D cameras. Many action recognition approaches using RGB-D cameras have been proposed and advanced the state-of-the-art [19][20][21][22][23][24][25]. In particular, most of currently depth-sensing cameras have integrated real-time skeleton estimation and tracking frameworks [26,27], helping to facilitate the collection of skeleton sequences. This data source is a high-level representation allowing to describe human action in a more precise and effective way, which is suitable for the problem of action analysis and recognition. Skeleton-based human action recognition is a time-series problem. The skeletal data comprises 3D coordinates of the key joints in the human body over time. This is an effective representation for structured motion [28] because each human action can be represented through the movement of skeleton sequences. Moreover, a large set of actions can be distinguished from these movements [29]. 3D skeletal data is not only invariant to camera-viewpoint but also can be estimated in real-time. Moreover, it is available for most of depth based action datasets [30]. Hence, exploiting this data source for 3D human action recognition opens up opportunities for addressing the limitations of RGB-depth modalities-based solutions and so many skeleton-based action recognition approaches have been proposed [19,23,[31][32][33]. Our goal is to exploit the potential of low-cost consumer depth cameras for identifying salient spatio-temporal patterns in skeleton sequences and then explore them for improving the recognition of human actions using deep learning models.
In the literature of skeleton-based action recognition, there are two main issues that need to be solved. The first challenge is to find a skeleton-based representation that transforms the raw skeletal data into a representation that effectively captures the spatio-temporal dynamics of human skeleton joints. The second challenge is to model and recognize actions that are complex, variable and have large intra-class correlation, from the skeleton-based representation. Previous studies [19,23,31,[34][35][36][37][38][39][40][41][42] on this topic can be divided into two main categories: skeleton-based action recognition based on hand-crafted features and skeleton-based action recognition using deep neural networks. The first group of methods uses hand-crafted local features and probabilistic graphical models such as Hidden Markov Model (HMM) [43], Conditional Random Field (CRF) [34], or Fourier Temporal Pyramid (FTP) [23] to model and classify actions. However, almost all of these approaches are shallow, data-dependent and require a lot of feature engineering. The second group of methods considers skeletal data as a time-series patterns and proposes the use of Recurrent Neural Networks (RNNs) [44], especially Recurrent Neural Networks with Long Short-Term Memory units (RNN-LSTMs) [45,46] to analyze and model the contextual information contained in the skeleton sequences. They are considered as the most popular deep learning based approach for skeleton-based action recognition and have achieved high-level performance. Although being able to model the long-term temporal of human motion, RNN-LSTMs [45,46] just consider skeleton sequences as a kind of low-level features by feeding raw skeletal data directly into the network input. The huge number of input features makes them complex, time-consuming and may easily lead to overfitting. Nevertheless, almost all of these networks act just as classifiers and do not extract high-level features for recognition tasks [47].
A practical human action recognition system should be able to detect and recognize actions from different viewpoints, robust to noise and operate in real-time. We believe that an efficient and effective representation for 3D human motion plays a decisive role in improving recognition performance. Motivated by the success of our previous work on the SPMF (Skeleton Posture-Motion Feature) representation [48] for video-based human action recognition, in this paper we aim to find a new skeleton-based representation and take full advantages in learning highly hierarchical image features of Deep Convolutional Neural Networks (D-CNNs) to build an end-to-end learning framework for 3D human action recognition from skeletal data. Specifically, we propose a new 3D motion representation, termed as Enhanced-SPMF (Enhanced Skeleton Posture-Motion Feature). Similar to the SPMF [48], the proposed Enhanced-SPMF has a 2D image structure with three color channels, which is built from a set of spatio-temporal stages, combining 3D skeleton poses and their motions. Moreover, an Adaptive Histogram Equalization (AHE) algorithm [49] is then applied to the color images to enhance their local patterns and generate more discriminative features for classification task. Figure 1 illustrates an overview of the proposed Enhanced-SPMF. To learn image features and recognize action labels from the proposed representation, different D-CNN models based on the DenseNet architecture [50] have been designed and evaluated. Overview of the proposed Enhanced-SPMF representation. Each skeleton sequence is transformed into a single RGB that is a motion map called SPMF [48]. A color enhancement technique [49] is then used to highlight the motion map and form the Enhanced-SPMF, which will be learned and classified by a deep learning model. Before computing the SPMF, a smoothing filter is adopted to reduce the effect of noise on skeletal data. Section 3 describes the details of the proposed approach.
There are five important hypotheses that motivate us to propose a new skeleton-based representation and design DenseNets [50] for 3D human action recognition with skeletal data. First, human actions can be correctly represented through the skeleton movements [28,29]. Second, compared to RGB and depth streams that contain thousands of pixels per frame, skeletal data has a high-level abstraction with much less complexity. This makes the training and inference processes much simpler and faster. Third, as shown in our previous works [48,51], the spatio-temporal dynamics of skeleton sequences can be transformed into color images-a kind of 3D tensor-structured representation that can be effectively learned by representation learning models as D-CNNs. Fourth, many different action classes share a great number of similar primitives, which interferes with action classification. Therefore, extracting essential spatio-temporal patterns from skeleton movements plays a key role in this task. Last, recent research results indicate that CNNs have achieved outstanding performances in many image recognition tasks [52,53]. There are a many signs that seem to indicate that the learning performance of CNNs can be significantly improved by increasing the depth of their architectures [54][55][56][57]. In particular, D-CNNs with architectures such as DenseNet [50] can improve accuracy in the image recognition task since this kind of network is able to prevent overfitting and degradation phenomena [58] by maximizing information flow and facilitating features reuse as each layer in its architecture has direct access to the features from previous layers. Therefore, we explore the use of DenseNet in this work and optimize this architecture for learning and recognizing human actions on the proposed image-based representation.
The effectiveness of the proposed method is evaluated on four public benchmark RGB-D datasets, including MSR Action3D [59], KARD [60], SBU Kinect Interaction [61] and NTU-RGB+D datasets [39]. The hypotheses above were reinforced since the experimental results show that we achieve state-of-the-art performance on all the reported benchmarks. Furthermore, we also report the effectiveness of this approach in terms of computational cost, for both training time and inference latency. Overall, the main contributions of our study include two aspects: • Firstly, we present Enhanced-SPMF, a new skeleton-based representation for 3D human action recognition from skeletal data. This work is an extended version of our paper published in the 25th IEEE International Conference on Image Processing (ICIP) [48] in which the Enhanced-SPMF is an extension of SPMF (Skeleton Pose-Motion Feature). Compared to our previous work, the current work aims to improve the efficiency of the 3D motion representation via a smoothing filter and a color enhancement technique. The smoothing filter helps us to reduce the effect of noise on skeletal data, meanwhile the color enhancement technique could make the proposed Enhanced-SPMF more robust and discriminative for recognition task. An ablation study on the Enhanced-SPMF demonstrated that the new representation leads to better overall action recognition performance than the SPMF [48]. • Secondly, we present a deep learning framework (The implementation and models will be made publicly available at https://github.com/cerema-lab/Sensors-2018-HAR-SPMF). based on the DenseNet architecture [50] for learning discriminative features from the proposed Enhanced-SPMF and performing action classification. The framework directly learns an end-to-end mapping between skeleton sequences and their action labels with little pre-processing. We evaluate the proposed method on four highly competitive benchmark datasets and demonstrate significantly improvement over existing state-of-the-art approaches. Our computational efficiency evaluations show that the proposed method is able to achieve high-level of performance whilst requiring low computational time for both the training and inference stages. Compared to our previous work that exploited the Residual Inception v2 network [48], the current work uses a more powerful deep learning model for action recognition task The rest of this paper is organized as follows: Section 2 discusses related works. Section 3 presents the details of the proposed approach. Datasets and experiments are described in Section 4. The experimental results and analyses are provided in Section 5. Section 6 concludes the paper.

Related Work
In this section, we briefly review the exiting literature closely related to the topic of deep learning based approaches for 3D human action recognition from skeleton sequences, including skeleton-based action recognition using hand-crafted features and deep learning-based action recognition. We encourage the readers to refer to an extensive review by Han et al. [62] for getting a more comprehensive picture on this topic.

Hand-Crafted Approaches for Skeleton-Based Human Action Recognition
Earlier studies on skeleton-based human action recognition focus on finding well-designed hand-crafted features and using temporal graphical models to analyze the global temporal evolution of skeleton joints. Since when the first work on 3D human action recognition from depth data was introduced [59], many approaches for skeleton-based action recognition have been proposed [19,23,31,[34][35][36]. The common characteristic of these approaches is that, they extract geometric features of 3D joint movements and model their temporal information by a generative model. For instance, Wang et al. [19] represented the human motion by means of the pairwise relative positions of the skeleton joints for generating more discriminative features. Fourier Temporal Pyramid (FTP) [19] was then proposed to model the temporal dynamics of the actions from LOPs. Vemulapalli et al. [23] represented the 3D geometric relationships of body parts as points in a Lie Group and then exploited Dynamic Time Warping (DTW) [63] and Fourier Temporal Pyramid (FTP) [19] to model their temporal dynamics. Xia et al. [31] extracted and computed histograms of 3D joint locations (HOJ-3D) to represent actions via posture visual words. The temporal evolutions of those words are modeled by a discrete Hidden Markov Models (HMM) [64]. Instead of modeling temporal evolution of skeletons, Luo et al. [35] proposed a discriminative dictionary learning algorithm (called DL-GSGC) that incorporated both group sparsity and geometry constraints to learn motion features from the 3D joint positions. An encoding technique called Temporal Pyramid Matching (TPM) [35] was then used for keeping the temporal information and performing action classification.
Although promising results have been achieved, the above approaches have some limitations that are difficult to overcome. For instance in many cases, they require pre-processing input data in which the skeleton sequences need to be segmented or aligned. Unlike these approaches, we propose a skeleton-based representation and a deep learning framework for 3D human action recognition that learns to recognize actions directly from the original skeletons in an end-to-end manner, without dependence on the length of actions. Moreover, the proposed solution is general and can be applied with some other data modalities such as motion capture data [65] and the output of pose estimation algorithms [66,67].

Deep Learning Approaches for Skeleton-Based Human Action Recognition
Approaches based on Recurrent Neural Network with Long Short-Term Memory units (RNN-LSTM) [45,68] are the most popular deep learning approach for skeleton-based action recognition and have achieved high-level performance for video-based action recognition tasks [37][38][39][40][41][42]. The temporal evolutions of skeletons are spatio-temporal patterns. Thus, they can be modeled by memory cells in the structure of RNN-LSTMs [45,68]. For instance, Du et al. [37] proposed to use a hierarchical RNN to model the long-term contextual information of skeletal data, in which the human skeleton was divided into five parts according to its physical structure. Each low-level part was modeled by an RNN and then combined into the final representation of high-level parts for action classification. Shahroudy et al. [39] introduced a part-aware LSTM human action learning model by splitting a long-term memory of the entire motion to part-based cells. The long-term context of each body part was learned independently. The output of the network was then formed as a combination of independent body part context information. Liu et al. [40] presented a spatio-temporal LSTM network, called ST-LSTM, for 3D action recognition from skeletal data. They proposed a skeleton-based tree traversal technique to feed the structure of the skeletal data into a sequential LSTM network and improved the performance of the ST-LSTM by adding more trust gates. Recently, Liu et al. [42] focused on selecting the most informative skeleton joints by using a new class of LSTM network, namely Global Context-Aware Attention LSTM (GCA-LSTM), for 3D skeleton-based action recognition. Two LSTM layers were used. The first layer encodes the input sequences and generates an initial global context memory for these sequences. Meanwhile, the second layer performs attention over the input sequences with the assistance of obtained global context memory. The attention representation was then used back to refine the global context. Multiple attention iterations are executed and the final global contextual information is used for action classification task.
Compared to the approaches based on hand-crafted local features, the RNN-LSTM based approaches and their variants have been showing superior action recognition performance. However, they tend to overemphasize the temporal information and lose the spatial information of skeletons [37][38][39][40][41][42]. RNN-LSTM based approaches still struggle to cope to scope with the complex spatio-temporal variations of skeletal movements due to a number of issues such as jitters and movement speed variability. Another drawback of the RNN-LSTM networks [45,68] is that they just model the overall temporal dynamics of actions without considering the detailed temporal dynamics of them. To overcome these limitations, we propose in this study a CNN-based approach that is able to extract discriminative features of actions and model various temporal dynamics of skeleton sequences via the proposed Enhanced-SPMF representation, including both short-term, medium-term, and long-term actions. We summarize the advantages and disadvantages of our proposed method in comparison with some previous approaches in Table 1.

Method
The details of the proposed approach are presented in this section. Figure 2 illustrates the key components of the proposed learning framework for recognizing actions from skeleton sequences. We first show how skeleton pose and motion features can be combined to build an action map in the form of an image-based representation (Section 3.1), and how to use a color enhancement technique for improving the discriminative ability of the proposed representation (Section 3.2). We then introduce an end-to-end deep leaning framework based on DenseNets to learn and classify actions from the enhanced representations (Section 3.3). Before that, in order to put the proposed approach into context, it is useful to review the central ideas behind the original DenseNet architecture (Section 3.3.1).

Figure 2.
Schematic overview of the proposed approach. Each skeleton sequence is encoded in a single color image via a skeleton-based representation called SPMF. Each SPMF is built from pose vectors (PFs) and motion vectors (MFs) extracted from skeleton joints. They are then enhanced by an Adaptive Histogram Equalization (AHE) [49] algorithm and fed to a D-CNN for learning discriminative features and performing action classification. To achieve high-level learning performance during the training phase, we design and optimize different D-CNN models based on deep DenseNet [50], a recent state-of-the-art architecture for image recognition tasks.

SPMF: Building Action Map from Skeletal Data
One of the major challenges in exploiting D-CNNs for skeleton-based action recognition is how the spatio-temporal patterns of skeleton movements could be effectively represented and fed to D-CNNs for representation learning. As D-CNNs work well on image representations [73], our idea therefore is to encode the whole skeleton sequence into a single 2D image as a global representation for the action sequence. In general, two essential elements that determine a human action are poses and their motions. Hence, we decide to transform these two important elements into the static spatial structure of a color image with three R, G, B channels. Specifically, we propose a new representation, namely Enhanced-SPMF (Enhanced Skeleton Pose-Motion Feature), which is built from pose and motion vectors extracted from the skeleton joints. Note that, combining multiple kinds of geometric features such as joint coordinates, lines and planes determined by the joints will lead to lower performance than using only a single type of feature or several main type of features [74]. Moreover, it has been reported [61] that joint features such as joint-joint distance and joint-joint motion are the strongest features among many others.

Pose Features (PFs) Computation
Given a skeleton sequence S with N frames, denoted by S = {F t }, where t = 1, 2, 3, ..., N. Let p t j and p t k be the 3D coordinates of the j-th and k-th joints in F t . The Joint-Joint Distance J JD t jk between p t j and p t k at timestamp t is computed as where || · || 2 denotes the Euclidean distance between two joints. The joint distances obtained by Equation (1) for all types of actions of a specific dataset range from D min = 0 to D max = max{J JD t jk }. We note this distance space as D original . In fact, D original can be transformed into a tensor-structure and fed directly to D-CNNs for learning action features. However, since D original is a high-dimensional space, it could lead D-CNNs to overfit as well as being time-consuming. Thus, we need to describe the input skeleton sequences as low-dimensional signals such that they are easy to parameterize by learning models and discriminative enough for a classification task. To do that, we normalize all elements of D original to the range [0, 1], denoted as D [0,1] . To reflect the change in joint distances, we encode D [0,1] into a color space using a sequential discrete color palette called JET color map (A JET color map is based on the order of colors in the spectrum of visible light, ranging from blue to red, and passing through the cyan, yellow, and orange.). The encoding process converts the joint distances J JD t jk ∈ D [0,1] for all possible combinations j and k into color points JJD t RGB ∈ N 3 [0,255] performed by 256-color JET scale. To this end, we first normalize the distance values with respect to the maximum and minimum values of a grayscale image ranging from 0 to 1. As illustrated in Figure 3, the scalar distances are converted to a three channel map via a JET mapping. This technique is similar to depth encoding method presented in [75]. The use of a discrete color palette allows us to reduce complexity of input features. This helps accelerate the convergence rate of deep learning networks during the training stage. Moreover, it should be noted that point-point distances are invariant when they are moved into a new coordinates system in the 3D Euclidean space. Therefore, the use of the Joint-Joint Distance J JD t jk can help our final representation be more independent to the camera viewpoint. Apart from the distance information, the orientation between joints is also important for describing human motions. The Joint-Joint Orientation JJO t jk from joint p t j to p t k at time-stamp t is computed as The JJO t jk is a vector where all of its components p can be normalized to the range [0, 255]. This can be done via the following transformation where p norm indicates the normalized value, c max and c min are the maximum and minimum values of all coordinates over the training set, respectively. The function floor(·) rounds down to the nearest integer. We consider three components (x, y, z) of JJO t jk after normalization as the corresponding three components (R, G, B) of a color pixel and build JJO t RGB as a 3D array that is formed by all JJO t jk values. We then define "a human pose" at timestamp t by vector PF t that describes the distance and orientation relationship between skeleton joints, Here the symbol (+ +) horizontally concatenates vectors JJD t RGB and JJO t RGB together.

Motion Features (MFs) Computation
Let p t j and p t+1 k denote the 3D coordinates of the j-th and k-th joints at two consecutive frames F t and F t+1 . Similarly to J JD t jk in Equation (1), the Joint-Joint Distance J JD t,t+1 jk between p t j and p t+1 k is computed as Also, similarly to Equation (2), the Joint-Joint Orientation JJO t,t+1 jk from joint p t j to p t+1 k is computed as JJO t,t+1 jk = p t j − p t+1 k , (t = 1, 2, ..., N − 1).

Building Global Action Map from PFs and MFs
Based on the obtained PFs and MFs, we propose a skeleton-based representation called SPMF for 3D human action recognition. To this end, all PFs and MFs computed from the skeleton sequence S are concatenated into a single feature vector in temporal order from the beginning to the end of the action. It is a global representation for the whole skeleton sequence S without dependence on the range of action and can be obtained by Figure 4 (top row) shows some SPMFs obtained from the MSR Action3D dataset [59] in which all images are resized to 32 × 32 pixels. Before computing the SPMF, a Savitzky-Golay smoothing filter [37,76] is adopted to reduce the effect of noise on skeletal data. In the experiments, we use the filter where c t denotes the skeleton joint coordinates of frame F t (t = 1, 2, ..., N) and f t denotes the filtering result. This filter design method is described in detailed in Appendix A. Figure 4. Results of the skeleton-to-image mapping process. The top row shows the proposed SPMF representations obtained from some samples of the MSR Action3D dataset [59]. The change in color reflects the change of distance and orientation between the joints. The bottom row shows generated images after applying the AHE algorithm [49].

Enhanced-SPMF: Building Enhanced Action Map
The skeleton-based representations obtained by Equation (8) mainly reflect the spatio-temporal distribution of skeleton joints. We visualize these representations and observe that they tend to be low contrast images, as shown in Figure 4 (top row). In this case, a color enhancement method can be useful for increasing contrast and highlighting the texture and edges of the motion maps. Therefore, it is necessary to enhance the local features on the generated color images after encoding. The Adaptive Histogram Equalization (AHE) [49] is a common approach for this task. This technique is capable of enhancing the local features of an image. Mathematically, let I be a given digital image, represented as a r-by-c matrix of integer pixels with intensity levels in the range [0, L − 1]. The histogram of image I will be defined by where n k is the number of pixels in I with intensity k. The probability of occurrence of intensity level k in I can be estimated by The histogram equalized image is defined by transforming the pixel intensities, n, of I by the function The Histogram Equalization (HE) method is used for increasing the global contrast of the image. However, it cannot solve the problem of increasing local contrast. To overcome this limitation, the image needs to be divided into R regions and the HE is then applied in each and every one of these regions. This technique is called the Adaptive Histogram Equalization algorithm (AHE) [49]. The bottom row of Figure 4 shows samples of the enhanced motion map with R = 8 on 32 × 32 images, which we refer to it as Enhanced-SPMF, for some actions from the MSR Action 3D dataset [59].

Densely Connected Convolutional Networks
DenseNet [50], considered as the current state-of-the-art CNN architecture, has some interesting properties. In this architecture [50], each layer is connected to all the others within a dense block and all layers can access to the feature maps from their preceding layers. Besides, each layer receives direct information flow from the loss function through the shortcut connections. These properties help DenseNet [50] to be less prone to overfitting for supervised learning problems. Mathematically, traditional CNN architectures, e.g., AlexNet [52] or VGGNet [54] connect the output feature maps x l−1 of the (l − 1)th layer as input to the lth layer and try to learn a mapping function where H l (·) is a non-linear transformation and usually implemented via a series of operations such as Convolution (Conv.), Rectified Linear Unit (ReLU) [77], Pooling [78], and Batch Normalization (BN) [79]. When increasing the depth of the network, the network training process becomes complex due to the vanishing-gradient problem and the degradation phenomenon [58] (please see Appendix B for more details). To solve these problems, He et al. introduced ResNet [56]. The key idea behind the ResNet architecture [56] is the presence of shortcut connections that bypass the non-linear transformations H l (·) with an identity function id(x) = x. This way, each ResNet building block [56] produces a feature map x l by performing the following computation Inspired by the philosophy of ResNet [56], to maximize information flow through layers, Huang et al. proposed DenseNet [50] with a simple connectivity pattern: the lth layer in a dense block receives the feature maps of all preceding layers as inputs. That means where [x 0 + + x 1 + + x 2 + + ... + + x l−1 ] is a single tensor constructed by concatenation of the previous layer's output feature maps. Additionally, all layers in the architecture receive direct supervision signals from the loss function through the shortcut connections. In this manner, the network is easy to optimize and resistant to overfitting. In DenseNet [50], multiple dense blocks are connected via transition layers. Each transition layer consists of a convolutional layer followed by an average pooling layer that changes the size of feature maps (The concatenation operation used in Equation (15) is not viable when the size of feature maps changes.). Each block with its transition layer produces k feature maps and the parameter k is called as the "growth rate" of the network. The non-linear function H l (·) in the original work [50] is a composite function of three consecutive operations: BN-ReLU-Conv.

Network Design
We propose to design and optimize deep DenseNets [50] for learning and classifying human actions on the Enhanced-SPMFs. To study how recognition performance varies with architecture size, we explore different network configurations. The following configurations are used in our experiments: DenseNet (L = 100, k = 12); DenseNet (L = 250, k = 24); and DenseNet (L = 190, k = 40), where L is the depth of the network and k is the network growth rate. On all datasets, we use three dense blocks on 32 × 32 input images. In this design, H l (·) is defined as Batch Normalization (BN) [79], followed by an advanced activation layer called Exponential Linear Unit (ELU) [80] and 3 × 3 Convolution (Conv.). A Dropout [80] with a rate of 0.2 is used after each Convolution to prevent overfitting. After the feature extraction stage, a Fully Connected (FC) layer is used for classification task in which the number of neurons for this FC layer is equal to the number of action classes in each dataset. The proposed networks can be trained in an end-to-end manner by gradient descent using Adam update rule [81]. During the training stage, we minimize a cross-entropy loss function, which is measured by the difference between the true action label y and the predicted actionŷ by the networks over the training samples X . In other words, the network will be trained to solve the following optimization problem where W is the set of weights that will be optimized by the model, M denotes the number of samples in training set X and C is the number of action classes.

Experiments
We investigate the effectiveness of the proposed approach using four public benchmark action recognition datasets, comparing our method with current state-of-the-art models for each benchmark. We refer the reader to a survey by Zhang et al. [30] for a full description of current RGB-D based action recognition datasets: MSR Action3D [59], KARD [60], SBU Kinect Interaction [61], NTU-RGB+D [39]. The detailed description of each dataset is provided in Section 4.1. The implementation and training methodology are described in Section 4.2.

Datasets and Settings
MSR Action3D dataset [59]: This Kinect 1 captured dataset contains 20 actions performed by 10 subjects. Each skeleton is composed of 20 joints. The MSR Action3D dataset [59] is challenging due to its high inter-action similarities. There are 567 action sequences in total, however, 10 sequences are not valid since the skeletons were missing. Thus, our experiments were conducted on 557 valid sequences. We follow the standard protocol proposed by Li et al. [59]. Specifically, the whole dataset is divided into three subsets: AS1, AS2 and AS3. Table 2 provides a list of actions in each subset, in which all subjects with IDs 1, 3, 5, 7, 9 are selected for training and the remaining subjects with IDs 2, 4, 6, 8, 10 are used for test. Very deep neural networks such as the deep DenseNet architecture require a lot of data to train and optimize. Unfortunately, there are only 557 skeleton sequences on the MSR Action3D dataset [59]. Therefore, some data augmentation techniques, i.e., random cropping, vertical flipping, and rotation with α = 90 • have been applied on this dataset to minimize overfitting. Figure 5 illustrates three data augmentation techniques that were used in our experiments.  Kinect Activity Recognition Dataset (KARD) [60]: The KARD [60] is a Kinect 1 dataset that contains 18 actions and 540 video sequences in total. Each action is performed three times by 10 subjects. It is composed of RGB, depth and skeleton frames in which each skeleton frame contains 15 key joints. The authors of the dataset [60] proposed to divide it into three subsets (i.e., Action Set 1, Action Set 2, and Action Set 3), as listed in Table 3. For each subset, three experiments have been proposed. Specifically, the first experiment (Experiment A) uses one-third of the dataset for training and the rest for test. Meanwhile, the second experiment (Experiment B) uses two-thirds of the dataset for training and the rest for test. The last experiment (Experiment C) uses half of the dataset for training and the other half for testing. As was the case for MSR Action3D dataset [59], data augmentation techniques (i.e., random cropping, vertically flipping, and rotation with α = 90 • ) were also applied.  [61] contains skeleton joints of two subjects corresponding to an interaction, each skeleton has 15 key joints. There are 8 interactions in total, including approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. This dataset is challenging due to the fact that the joint coordinates exhibit low accuracy. Moreover, they contain non-periodic actions as well as very similar body movements. For instance, there are some pairs of actions that are difficult to distinguish such as exchanging objects-shaking hands or pushing-punching. We randomly split the whole dataset into 5 folds, in which 4 folds are used for training and the remaining 1 fold is used for test. It should be noted that each skeleton frame provided by the SBU dataset [61] contains two separate subjects. Therefore, we consider them as two data samples and feature computation is conducted separately for the two skeletons. Additionally, data augmentation (i.e., random cropping, vertically flipping, rotation with α = 90 • ) has been also applied on the SBU dataset [61].
NTU-RGB+D dataset [39]: This Kinect 2 captured dataset is a very large-scale RGB-D dataset. To the best of our knowledge, the NTU-RGB+D dataset [39] is currently the largest and state-of-the-art benchmark dataset with skeletal data for human action analysis. It provides more than 56 thousand video samples, 4 million frames, collected from 40 distinct subjects for 60 different action classes. The following actions are provided by the NTU-RGB+D dataset [39] (please see Figure 6 for some examples): drinking, eating, brushing teeth, brushing hair, dropping, picking up, throwing, sitting down, standing up, clapping, reading, writing, tearing up paper, wearing jacket, taking off jacket, wearing a shoe, taking off a shoe, wearing on glasses, taking off glasses, putting on a hat/cap, taking off a hat/cap, cheering up, hand waving, kicking something, reaching into self pocket, hopping, jumping up, making/answering a phone call, playing with phone, typing, pointing to something, taking selfie, checking time, rubbing two hands together, bowing, shaking head, wiping face, saluting, putting palms together, crossing hands in front. sneezing/coughing, staggering, falling down, touching head, touching chest, touching back, touching neck, vomiting, fanning self. punching/slapping other person, kicking other person, pushing other person, patting others back, pointing to the other person, hugging, giving something to other person, touching other persons pocket, handshaking, walking towards each other, and walking apart from each other. In the NTU-RGB+D dataset [39], each skeleton contains the 3D coordinates of 25 body joints. The authors [39] of this dataset suggested two different evaluation criteria, including Cross-Subject evaluation and Cross-View evaluation. For the Cross-Subject setting, the sequences performed by 20 subjects (with IDs 1,2,4,5,8,9,13,14,15,16,17,18,19,25,27,28,31,34,35,and 38) are used for training and the rest sequences are used for test. In Cross-View setting, the sequences provided by cameras 2 and 3 are used for training while sequences from camera 1 are used for test. This setting allows to evaluate the ability to recognize actions under multiple-viewpoints of the proposed skeleton-based representation. We do not apply any data augmentation technique on the NTU-RGB+D [39] due to the very large-scale nature of this dataset [39].

Implementation Details
For all the datasets, the proposed Enhanced-SPMF representations are computed directly from the raw skeleton sequences without using a fixed number of frames. For computational efficiency, all the image representations are resized to 32 × 32 pixels. The three network configurations: DenseNet (L = 100, k = 12); DenseNet (L = 250, k = 24); and DenseNet (L = 190, k = 40) were implemented and evaluated in Python with the support of the Keras framework using TensorFlow as back-end. During the training stage, we use mini-batches of 32 images for all networks. The weights are initialized as per the He initialization technique [82]. Adam optimizer [81] is used with default parameters (i.e., β 1 = 0.9 and β 2 = 0.999). Additionally, we use a dynamic learning rate during training. The initial learning rate is set to 0.01 and is decreased by a factor of 0.1 after every 50 epochs. All networks are trained for 300 epochs from scratch.
Results on KARD dataset: We performed a total of 9 experiments over three experiments A, B, and C on the KARD dataset [60]. Table 5 summarizes the obtained results on this dataset. We compute the average recognition accuracy over the three experiments and compare it with existing techniques including Hand-crafted Features [60], Posture Feature+Multi-class SVM [85], and Key Postures+Multi-class SVM [86]. As can be seen in Table 5, the proposed DenseNet (L = 250, k = 24) is able to improve state-of-the-art accuracy by 9.15% over Hand-crafted Features [60], 2.78% over Posture Feature+Multi-class SVM [85] and 0.68% over Key Postures+Multi-class SVM [86]. This result confirms that the proposed deep learning framework trained on the Enhanced-SPMFs is able to achieve better performance in the recognition of actions compared to hand-crafted based approaches.  [59], KARD [60], SBU Kinect Interaction [61], and NTU-RGB+D [39] datasets. Almost all designed networks are able to reach the optimal weights after the first 100 epochs. The symbols k and L and denote the "growth rate" and the depth of the network, respectively. Table 4. Experimental results and comparison of the proposed method with state-the-art approaches on the MSR Action3D dataset [59]. The list is ordered by recognition performance, in which results that outperform previous works are in bold, while the best accuracies are in blue. Our previous work on SPMF [48] are marked in red.

Bag of 3D
Results on NTU-RGB+D dataset: For the NTU-RGB+D dataset [39], the best configuration DenseNet (L = 250, k = 40) achieves an accuracy of 80.11% on the Cross-Subject evaluation and 86.82% on the Cross-View evaluation, as summarized in Table 7. These results demonstrate the effectiveness of the proposed representation and deep learning framework since they surpass previous state-of-the-art techniques such as Lie Group Representation [23], Hierarchical RNN [37], Dynamic Skeletons [93], Two-Layer P-LSTM [39], ST-LSTM Trust Gates [40], Geometric Features [74], Two-Stream RNN [91], Enhanced Skeleton [94], Lie Group Skeleton+CNN [95], and GCA-LSTM [92]. The experimental results have also shown that the proposed method leads to better overall action recognition performance than our previous models including Skeleton-based ResNet [51] and SPMF Inception-ResNet-222 [48]. With a high recognition rate on the Cross-View evaluation (86.82%) where the sequences provided by cameras 2 and 3 are used for training and sequences from camera 1 are used for test, the proposed method shows its effectiveness for dealing with view-independent action recognition problem. Figure 7 (last row) shows the training loss and test accuracy of the DenseNet (L = 250, k = 24) on this dataset. Table 6. Action recognition accuracies (%) and comparison with previous works on the SBU Kinect Interaction dataset [61]. The best accuracies are in blue. Results that surpass previous works are in bold.

Method (Protocol of [61])
Year Acc. (%)  Table 7. Experimental results and comparison of the proposed method with previous approaches on the NTU-RGB+D dataset [39]. The best accuracies are in blue. Results that surpass previous works are in bold. Our previous works [48,51] are marked in red.

An Ablation Study on the Proposed Enhanced-SPMF Representation
We believe that the use of the AHE algorithm [49] and the Savitzky-Golay smoothing filter [37,76] helps the proposed representation to be more discriminative, which improves recognition accuracy. To verify this hypothesis, we carried out an ablation study on the Enhanced-SPMF representation provided by the SBU Kinect Interaction dataset [61]. Specifically, we trained the proposed DenseNet (L = 250, k = 24) on both the SPMFs and Enhanced-SPMFs. During training, the same hyper-parameters and training methodology were applied. The experimental results indicate that the proposed deep network achieves better recognition accuracy when trained on the Enhanced-SPMFs. As reported in Figure 9, applying the AHE algorithm [49] and and the Savitzky-Golay smoothing filter [37,76] helps improving the accuracy by 4.09%. This result validates our hypothesis above.

Visualization of Deep Feature Maps
Different action classes have different discriminative characteristics. To better understand the internal operation of the proposed deep networks and to study what they learned from the skeleton-based representation, we input different Enhanced-SPMFs corresponding to different action classes of the MSR Action3D dataset [59] to the DenseNet (L = 100, k = 12) and visualize the individual feature maps learned by the network at the end of a dense block (intermediate layer). We observe that the designed network is able to extract discriminative features from the Enhanced-SPMF representations. This is expressed through the color of each learned feature map, as can be seen in Figure 10. These discriminative features play a key role in classifying actions. Figure 10. Visualization of feature maps learned by the proposed DenseNet (L = 100, k = 12) from several samples of the MSR Action3D dataset [59]. Best viewed in color.

Computational Efficiency Evaluation
In this section, we take the AS1 subset of MSR Action3D dataset [59] and the DenseNet (L = 100, k = 12) to evaluate the computational efficiency of the proposed method. Figure 11 illustrates three main stages of the deep learning framework for learning and recognizing actions from skeleton sequences, including an encoding process from input skeleton sequences to color images (Stage 1), a supervised training stage (Stage 2), and an inference stage (Stage 3). The implementation is in Python/Keras and when training on a single GeForce GTX 1080 Ti GPU, the proposed deep network only has 6.0M parameters and it takes less than six hours to reach convergence. Latency in predicting an action for a new skeleton sequence (including encoding it to color images, executed on a CPU) is about 74.8 × 10 −3 seconds per sequence. Additionally, it should be noted that the computation of the Enhanced-SPMFs can be implemented and optimized on a GPU for real-time applications. Please see Table 8 for further details. This result verifies the effectiveness of the proposed learning framework in terms of computational cost.

Limitations
The use of the Savitzky-Golay filter [76] helps reduce the effect of noise on the raw skeleton sequences. However, the proposed approach cannot overcome the problem of missing data. In other words, as the Enhanced-SPMF is a global representation for the whole skeleton sequence, data errors of local fragments in the input sequences could reduce the recognition rate. Another open problem of the proposed approach is how to scope with Online Action Recognition (OAR) task. Specifically, how to detect and recognize human actions from unsegmented streams in a continuous manner, where boundaries between different kinds of actions within the stream are unknown. A common solution for OAR is the sliding window based methods [97,98]. These approaches consider the temporal coherence within the window for prediction. We can also apply this idea to solve the current problem. E.g., during the online inference phase, we use a sliding window on the original skeleton sequences or on image-coded representations (i.e., Enhanced-SPMFs) and then predicting action by pretrained deep learning model, as we showed in Figure 11 (Stage 3). However, we understand that the performance of this approach is sensitive to the window size. Either too large or too small window size could lead to a significant drop in recognition performance. Another solution is to use Temporal Attention Networks (TANs) [99][100][101][102] that incorporate temporal attention models for video-based action recognition.

Conclusions
In this paper, we have presented an efficient and effective deep learning framework for 3D human action recognition from skeleton sequences. A novel motion representation, termed Enhanced-SPMF, which captures the spatio-temporal information of skeleton movements and transforms them into color images has been proposed. We exploited the Adaptive Histogram Equalization (AHE) technique to enhance the local textures of color images and generate more discriminative features for learning and classification tasks. Different Deep Convolutional Neural Networks (D-CNNs) based on the DenseNet architecture have been designed and optimized to learn and recognize actions from the proposed representation, in an end-to-end manner. Extensive empirical evaluations on four challenging public datasets demonstrate the effectiveness of the proposed approach on both individual actions, interactions, multiview and large-scale datasets. In particular, we also indicate that the proposed method is invariant to viewpoint changes and requires low computational cost for training and inference. We hope that this study opens up a new door to exploit the big potential of skeletal data, which helps to address the current challenges in building real-world action recognition applications.

Appendix A. Savitzky-Golay Smoothing Filter
Savitzky-Golay (S-G) filter is a low-pass filter based on local least-squares polynomial approximation that is often used to smooth noisy data. The 3D skeleton joints obtained from depth cameras can be considered as a series of equally spaced data in the time domain, applying S-G filter on raw skeletal data helps reduce the level of noise while maintaining the 3D geometric characteristics of the input sequences.
Considering a sequence of N = 2M + 1 input data points x[n] centered at n = 0, given by (A3) To this end, one solution is to determine a set of coefficients that satisfies the partial derivative equation is equal to zero The polynomial coefficients can be determined as For example, for smoothing by a 5-point quadratic polynomial with N = 5, M = −2, −1, 0, 1, 2, the tth filtering result, y t is given by (A10) Equation (A10) above was used in our experiments to reduce the effect of noise on the raw skeleton data.

Appendix B. Degradation Phenomenon in Training Very Deep Neural Networks
Very deep neural networks demonstrate to have a high performance on many visual-related tasks [54][55][56][57]. However, they are very difficult to optimize. One of the main challenges for training deeper networks is the vanishing and exploding gradient problems [103]. Specifically, when the network is deep enough, the supervision signals from the output layer can be completely attenuated or exploded on their way back towards the previous layers. Therefore, the network cannot learn the parameters effectively. These obstacles can be solved by recent advanced techniques in deep learning such as Normalized Initialization [104] or Batch Normalization [79]. When the deep networks start converging, a degradation phenomenon occurs. Due to this, the training and test errors increase if more layers are added to a deep architecture. This phenomenon is called by the degradation phenomenon. Figure A1 shows an experimental result [56] related to this phenomenon.  [56]. The deeper network has higher error for both training and test phases. Figure was reproduced from the work of He et al. [56] and used with permission from IEEE.