A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera

We present a deep learning-based multitask framework for joint 3D human pose estimation and action recognition from RGB sensors using simple cameras. The approach proceeds along two stages. In the first, a real-time 2D pose detector is run to determine the precise pixel location of important keypoints of the human body. A two-stream deep neural network is then designed and trained to map detected 2D keypoints into 3D poses. In the second stage, the Efficient Neural Architecture Search (ENAS) algorithm is deployed to find an optimal network architecture that is used for modeling the spatio-temporal evolution of the estimated 3D poses via an image-based intermediate representation and performing action recognition. Experiments on Human3.6M, MSR Action3D and SBU Kinect Interaction datasets verify the effectiveness of the proposed method on the targeted tasks. Moreover, we show that the method requires a low computational budget for training and inference. In particular, the experimental results show that by using a monocular RGB sensor, we can develop a 3D pose estimation and human action recognition approach that reaches the performance of RGB-depth sensors. This opens up many opportunities for leveraging RGB cameras (which are much cheaper than depth cameras and extensively deployed in private and public places) to build intelligent recognition systems.


Introduction
Human action recognition from videos has been researched for decades, since this topic plays a key role in various areas such as intelligent surveillance, human-robot interaction, robot vision and so on.Although significant progress has been achieved in the past few years, building an accurate, fast and efficient system for the recognition of actions in unseen videos is still a challenging task due to a number of obstacles, e.g.changes in camera viewpoint, occlusions, background, speed of motion, etc. Traditional approaches on video-based action recognition [1] have focused on extracting hand-crafted local features and building motion descriptors from RGB sequences.Many spatio-temporal representations of human motion have been proposed and widely exploited with success such as SIFT [2], HOF [3] or Cuboids [4].However, one of the major limitations of these approaches is the lack of 3D structure from the scene and recognizing human actions based only on RGB information is not enough to overcome the current challenges of the field.
The rapid development of depth-sensing time-of-flight camera technology has helped in dealing with this problem, which is considered complex for traditional cameras.Lowcost and easy-to-use depth cameras are able to provide detailed 3D structural information of human motion.In particular, most of the current depth cameras have integrated real-time skeleton estimation and tracking frameworks [5], facilitating the collection of skeletal data.This is a high-level representation of the human body, which is suitable for the problem of motion analysis.Hence, exploiting skeletal data for 3D action recognition opens up opportunities for addressing the limitations of RGB-based solutions and many skeleton-based action recognition approaches have been proposed [6,7,8,9,10].However, depth sensors have some significant drawbacks with respect to 3D pose estimation.For instance, they are only able to operate up to a limited distance and within a limited field of view.Moreover, a major drawback of depth cameras is the inability to work in bright light, especially sunlight [11].Our focus in this paper is therefore to propose a 3D skeleton-based action recognition Figure 1: Overview of the proposed method.In the estimation stage, we first run OpenPose [12] -a real-time, state-of-the-art multi-person 2D pose detector to generate 2D human body keypoints.A deep neural network is then trained to produce 3D poses from the 2D detections.In the recognition stage, the 3D estimated poses are encoded into a compact image-based representation and finally fed into a deep convolutional network for supervised classification task, which is automatically searched by the ENAS algorithm [13].
approach without depth sensors.Specifically, we are interested in building a unified deep framework for both 3D pose estimation and action recognition from RGB video sequences.As shown in Figure 1, our approach consists of two stages.In the first, estimation stage, the system recovers the 3D human poses from the input RGB video.In the second, recognition stage, an action recognition approach is developed and stacked on top of the 3D pose estimator in a unified framework, where the estimated 3D poses are used as inputs to learn the spatio-temporal motion features and predict action labels.
There are four hypotheses that motivate us to build a deep learning framework for human action recognition from 3D poses.First, actions can be correctly represented through the 3D pose movements [14,15].Second, the 3D human pose has a high-level of abstraction with much less complexity compared to RGB and depth streams.This makes the training and inference processes much simpler and faster.Third, depth cameras are able to provide highly accurate skeletal data for 3D action recognition.However, they are expensive and not always available (e.g. for outdoor scenes).A fast and accurate approach of 3D pose estimation from only RGB input is highly desirable.Fourth, state-of-the-art 2D pose detectors [12,16] are able to provide 2D poses with a high degree of accuracy in real-time.Meanwhile, deep networks have proved their capacity to learn complex functions from high-dimensional data.Hence, a simple network model can also learn a mapping to convert 2D poses into 3D.
The effectiveness of the proposed method is evaluated on public benchmark datasets (Human3.6M[17], MSR Action3D [18], and SBU [19]).Far beyond our expectations, the experimental results demonstrate state-of-the-art performances on the targeted tasks (Section 4.3) and support our hypotheses above.Furthermore, we show that this approach has a low computational cost (Section 4.4).Overall, our main contributions are as follows: • First, we present a two-stream, lightweight neural network to recover 3D human poses from RGB images provided by a monocular camera.Our proposed method achieves stateof-the-art result on 3D human pose estimation task and benefits action recognition.
• Second, we propose to put an action recognition approach on top of the 3D pose estimator to form a unified framework for 3D pose-based action recognition.It takes the 3D estimated poses as inputs, encodes them into a compact image-based representation and finally feeds to a deep convolutional network, which is designed automatically by using a neural architecture search algorithm.Surprisingly, our experiments show that we reached state-of-the-art results on this task, even when compared with methods using depth cameras.
The rest of this paper is organized as follows.We present a review of the related work in Section 2. The proposed method is explained in Section 3. Experiments are provided in Section 4 and Section 5 concludes the paper.

Related work
This section reviews two main topics that are directly related to ours, i.e. 3D pose estimation from RGB images and 3D pose-based action recognition.Due to the limited size of a conference paper, an extensive literature review is beyond the scope of this section.Instead, the interested reader is referred to the surveys of Sarafianos et al. [20] for recent advances in 3D human pose estimation and Presti et al. [21] for 3D skeleton-based action recognition.

3D human pose estimation
The problem of 3D human pose estimation has been intensively studied in the recent years.Almost all early approaches for this task were based on feature engineering [17,22,23], while the current state-of-the-art methods are based on deep neural networks [24,25,26,27,28,29].Many of them are regression-based approaches that directly predict 3D poses from RGB images via 2D/3D heatmaps.For instance, Li et al. [24] designed a deep convolutional network for human detection and pose regression.The regression network learns to predict 3D poses from single images using the output of a body part detection network.Tekin et al. [25] proposed to use a deep network to learn a regression mapping that directly estimates the 3D pose in a given frame of a sequence from a spatio-temporal volume centered on it.Pavlakos et al. [26] used multiple fully convolutional networks to construct a volumetric stacked hourglass architecture, which is able to recover 3D poses from RGB images.Pavllo et al. [27] exploited a temporal dilated convolutional network [30] for estimating 3D poses.However, this approach led to a significant increase in the number of parameters as well as the required memory.Mehta et al. [28] introduced a real-time approach to predict 3D poses from a single RGB camera.They used ResNets [31] to jointly predict 2D and 3D heatmaps as regression tasks.Recently, Katircioglu et al. [29] introduced a deep regression network for predicting 3D human poses from monocular images via 2D joint location heatmaps.This architecture is in fact an overcomplete autoencoder that learns a high-dimensional latent pose representation and accounts for joint dependencies, in which a Long Short-Term Memory (LSTM) network [32] is used to enforce temporal consistency on 3D pose predictions.
To the best of our knowledge, several studies [26,28,29] stated that regressing the 3D pose from 2D joint locations is difficult and not too accurate.However, motivated by Martinez et al. [33], we believe that a simple neural network can learn effectively a direct 2D-to-3D mapping.Therefore, this paper aims at proposing a simple, effective and real-time approach for 3D human pose estimation that benefits action recognition.To this end, we design and optimize a two-stream deep neural network that performs 3D pose predictions from the 2D human poses.These 2D poses are generated by a state-of-the-art 2D detector that is able to run in real-time for multiple people.We empirically show that although the proposed approach is computationally inexpensive, it is still able to improve the state-of-the-art.

3D pose-based action recognition
Human action recognition from skeletal data or 3D poses is a challenging task.Previous works on this topic can be divided into two main groups of method.The first group [6,9,34] extracts hand-crafted features and uses probabilistic graphical models, e.g.Hidden Markov Model (HMM) [34] or Conditional Random Field (CRF) [35] to recognize actions.However, almost all of these approaches require a lot of feature engineering.The second group [36,37,38] considers the 3D pose-based action recognition as a time-series problem and proposes to use Recurrent Neural Networks with Long-Short Term Memory units (RNN-LSTMs) [32] for modeling the dynamics of the skeletons.Although RNN-LSTMs are able to model the long-term temporal characteristics of motion and have advanced the state-of-the-art, this approach feeds raw 3D poses directly into the network and just considers them as a kind of low-level feature.The large number of input features makes RNNs very complex and may easily lead to overfitting.Moreover, many RNN-LSTMs act merely as classifiers and cannot extract high-level features for recognition tasks [39].
In the literature, 3D human pose estimation and action recognition are closely related.However, both problems are generally considered as two distinct tasks [40].Although some approaches have been proposed for tackling the problem of jointly predicting 3D poses and recognizing actions in RGB images or video sequences [41,42,43], they are data-dependent and require a lot of feature engineering, except the work of Luvizon et al. [43].Unlike in previous studies, we propose a multitask learning framework for 3D pose-based action recognition by reconstructing 3D skeletons from RGB images and exploiting them for action recognition in a joint way.Experimental results on public and challenging datasets show that our framework is able to solve the two tasks in an effective way.

Proposed Method
We explain the proposed method is this section.First, our approach for 3D human pose estimation is presented.We then introduce our solution for 3D pose-based action recognition.

Problem definition
Given an RGB video clip of a person who starts to perform an action at time t = 0 and ends at t = T , the problem studied in this work is to generate a sequence of 3D poses P = (p 0 , ..., p T ), where p i ∈ R 3×M , i ∈ {0, ..., T } at the estimation stage.The generated P is then used as input for the recognition stage to predict the corresponding action label A by a supervised learning model.See Figure 1 for an illustration of the problem.

3D human pose estimation
Given an input RGB image I ∈ R W ×H×3 , we aim to estimate the body joint locations in the 3dimensional space, noted as p3D ∈ R 3×M .To this end, we first run the state-of-the-art human 2D pose detector, namely OpenPose [12], to produce a series of 2D keypoints p 2D ∈ R 2×N .
To recover the 3D joint locations, we try to learn a direct 2D-to-3D mapping f r : p 2D f r − → p3D .This transformation can be implemented by a deep neural network in a supervised manner where θ is a set of trainable parameters of the function f r .To optimize f r , we minimize the prediction error over a labelled dataset of C poses by solving the optimization problem arg min Here x i and y i are the input 2D poses and the ground truth 3D poses, respectively; L denotes a loss function.In our implementation the robust Huber loss [44] is used to deal with outliers.

Network design
State-of-the-art deep learning architectures such as ResNet [31], Inception-ResNet-v2 [45], DenseNet [46], or NASNet [47] have achieved an impressive performance in supervised learning tasks with high dimensional data, e.g.2D or 3D images.However, the use of these architectures [31,45,46,47] on low dimensional data like the coordinates of the 2D human joints could lead to overfitting.Therefore, our design is based on a simple and lightweight multilayer network architecture without the convolution operations.In the design process, we exploit some recent improvements in the optimization of the modern deep learning models [31,46].Concretely, we propose a two-stream network.Each stream comprises linear layers, Batch Normalization (BN) [48], Dropout [49], SELU [50] and Identity connections [31].During the training phase, the first stream takes the ground truth 2D locations as input.
The 2D human joints predicted by OpenPose [12] are inputted to the second stream.The outputs of the two streams are then averaged.Figure 2 illustrates our network design.Note that learning with the ground truth 2D locations for both of these streams could lead to a higher level of performance.However, training with the 2D OpenPose detections could improve the generalization ability of the network and makes it more robust during the inference, when only the OpenPose's 2D output is used to deal with action recognition in the wild.

3D pose-based action recognition
In this section, we explain how to integrate the estimation stage with the recognition stage in a unified framework.Specifically, the proposed recognition approach is stacked on top of the 3D pose estimator.To explore the high-level information of the estimated 3D poses, we encode them into a compact image-based representation.These intermediate representations are then fed to a Deep Convolutional Neural Network (D-CNNs) for learning and classifying actions.This idea has been proven effective in [51,52,53].Thus, the spatio-temporal patterns of a 3D pose sequence are transformed into a single color image as a global representation called Enhanced-SPMF [53] via two important elements of a human movement: 3D poses and their motions.Due to the limited space available, detailed description of the Enhanced-SPMF is not included.We refer the interested reader to the work described in [53] Figure 2: Diagram of the proposed two-stream network for training our 3D pose estimator.
for further technical details. Figure 3 visualizes some Enhanced-SPMF representations from samples of the MSR Action3D dataset [18].For learning and classifying the obtained images, we propose to use the Efficient Neural Architecture Search (ENAS) [13] -a recent state-of-the-art technique for automatic design of deep neural networks.The ENAS is in fact an extension of an important advance in deep learning called NAS [47], which is able to automatize the designing process of convolutional architectures on a dataset of interest.This method proposes to search for optimal building blocks (called cells, including normal cells and reduction cells) and the final architecture is then constructed from the best cells achieve.In NAS, an RNN is used.It first samples a candidate architecture called child model.This child model is then trained to convergence on the desired task and reports its performance.Next, the RNN uses the performance as a guiding signal to find a better architecture.This process is repeated for many times, making NAS computationally expensive and time-consuming (e.g. on CIFAR-10, NAS needs 4 days with 450 GPUs to discover the best architecture).ENAS has been proposed to improve the efficiency of NAS.Its key idea of ENAS [13] is the use of sharing parameters among child models, which helps reducing the time of training each child model from scratch to convergence.State-of-the-art performance has been achieved by ENAS on well known public datasets.We encourage the readers to refer to the original paper [13] for more details.Figure 4 illustrates the entire pipeline of our approach for the recognition stage.

Datasets and settings
We evaluate the proposed method on three challenging datasets: Human3.6M,MSR Ac-tion3D and SBU Kinect Interaction.The Human3.6M is used for evaluating 3D pose estimation.Meanwhile, the other two datasets are used for validating action recognition.The characteristics of each dataset are as follows.
Human3.6M [17]: This is a very large-scale dataset containing 3.6 million different 3D articulated poses captured from 11 actors for 17 actions, under 4 different viewpoints.For each subject, the dataset provides 32 body joints, from which only 17 joints are used for training and computing scores.In particular, 2D joint locations and 3D poses ground truth are available for evaluating supervised learning models.
MSR Action3D [18]: This dataset contains 20 actions, performed by 10 subjects.Our experiment was conducted on 557 video sequences of the MSR Action3D, in which the whole dataset is divided into three subsets: AS1, AS2, and AS3.There are 8 actions classes for each subset.Half of the data is selected for training and the rest is used for testing.[19]: This dataset contains a total of 300 interactions, performed by 7 participants for 8 actions.This is a challenging dataset due to the fact that it contains pairs of actions that are difficult to distinguish such as exchanging objectsshaking hands or pushingpunching.We randomly split the whole dataset into 5 folds, in which 4 folds are used for training and the remaining 1 fold is used for testing.

Implementation details
The proposed networks were implemented in Python with Keras/TensorFlow backend.The two streams of the 3D pose estimator are trained separately with the same hyperparameters setting, in which we use mini-batches of 128 poses with 0.25 dropout rate.The weights are initialized by the He initialization [54].Adam optimizer [55] is used with default parameters.The initial learning rate is set to 0.001 and is decreased by a factor of 0.5 after every 50 epochs.The network is trained for 300 epochs from scratch on the Human3.6Mdataset [17].For action recognition task, we run OpenPose [12] to generate 2D detections on MSR Action3D [18] and SBU Kinect Interaction [19].The pre-trained 3D pose estimator on Hu-man3.6M[17] is then used to provide 3D poses.We use standard data pre-processing and augmentation techniques such as randomly cropping and flipping on these two datasets due to their small sizes.To discover optimal recognition networks, we use ENAS [13] with the same parameter setting as the original work.Concretely, the shared parameters ω are trained with Nesterov's accelerated gradient descent [56] using Cosine learning rate [57].The candidate architectures are initialized by He initialization [54] and trained by Adam optimizer [55] with a learning rate of 0.00035.Additionally, each search is run for 200 epochs.

Evaluation on 3D pose estimation
We evaluate the effectiveness of the proposed 3D pose estimation network using the standard protocol of the Human3.6Mdataset [17,26,28,33].Five subjects S1, S5, S6, S7, S8 are used for training and the rest two subjects S9, S11 are used for evaluation.Experimental results are reported by the average error in millimeters between the ground truth and the corresponding predictions over all joints.Much to our surprise, our method outperforms the previous best result from the literature [33] by 3.1mm, corresponding to an error reduction of 6.8% even when combining the ground truth 2D locations with the 2D OpenPose detections.This result proves that our network design can learn to recover the 3D pose from the 2D joint locations with a remarkably low error rate, which to the best of our knowledge, has established a new state-of-the-art on 3D human pose estimation (see Table 1 and Figure 5).Table 1: Experimental results and comparison with previous state-of-the-art 3D pose estimation approaches on the Human3.6Mdataset [17] The symbol denotes that a 2D detector was used and the symbol † denotes the ground truth 2D joint locations were used.
Figure 5: Visualization of 3D output of the estimation stage with some samples on the test set of Human3.6M[17].For each example, from left to right are 2D poses, 3D ground truths and our 3D predictions, respectively.

Evaluation on action recognition
Table 2 reports the experimental results and comparisons with state-of-the-art methods on the MSR Action3D dataset [18].The ENAS algorithm [13] is able to explore a diversity of network architectures and the best design is identified based on its validation score.Thus, the final architecture achieved a total average accuracy of 97.98% over three subset AS1, AS2 and AS3.This result outperforms many previous studies [9,18,36,37,64,65,66,67,68], and among them, many are depth sensor-based approaches.Figure 6 provides a schematic diagram of the best cells and optimal architecture found by ENAS on the AS1 subset [18].
For the SBU Kinect Interaction dataset [19], the best model achieved an accuracy of 96.30%,  as shown in Table 3.Our reported results indicated an important observation that by using only the 3D predicted poses, we are able to outperform previous works reported in [36,69,70,71,72,73,74] and reach state-of-the-art results provided in [53,75], which deploy accurate skeletal data provided by Kinect v2 sensor.[13] on AS1 subset [18].They were then used to construct the final network architecture (c).We recommend the interested readers to [13] to better understand this procedure.

Computational efficiency evaluation
On a single GeForce GTX 1080Ti GPU with 11GB memory, the runtime of OpenPose [12] is less than 0.1s per frame on a image size of 800 × 450 pixels.On the Human3.6Mdataset [17], the 3D pose estimation stage takes around 15ms to complete a pass (forward + backward) through each stream with a mini-batches of size 128.Each epoch was done within 3 minutes.For the action recognition stage, our implementation of ENAS algorithm takes about 2 hours to find the final architecture (∼2.3M parameters) on each subset of MSR Ac-tion3D dataset [18], whilst it takes around 3 hours on the SBU Kinect Interaction dataset [19] to discover the best architecture (∼3M parameters).With small architecture sizes, the dis-covered networks require low computing time for the inference stage, making our approach more practical for large-scale problems and real-time applications.

Conclusions
In this paper, we presented a unified deep learning framework for joint 3D human pose estimation and action recognition from RGB video sequences.The proposed method first runs a state-of-the-art 2D pose detector to estimate 2D locations of body joints.A deep neural network is then designed and trained to learn a direct 2D-to-3D mapping and predict human poses in 3D space.Experimental results demonstrated that the 3D human poses can be effectively estimated by a simple network design and training methodology over 2D keypoints.We also introduced a novel action recognition approach based on a compact image-based representation and automated machine learning, in which an advanced neural architecture search algorithm is exploited to discover the best performing architecture for each recognition task.Our experiments on public and challenging action recognition datasets indicated that the proposed framework is able to reach state-of-the-art performance, whilst requiring less computation budget for training and inference.Despite that, our method naturally depends on the quality of the output of the 2D detectors.Hence, a limitation is that it cannot recover 3D poses from 2D failed output.To tackle this problem, we are currently expanding this study by adding more visual evidence to the network in order to further gains in performance.The preliminary results are encouraging.

Figure 3 :
Figure 3: Immediate image-based representations for the recognition stage.

Figure 4 :
Figure 4: Illustration of the proposed approach for 3D pose-based action recognition.

Figure 6 :
Figure 6: Diagram of the top performing normal cell (a) and reduction cell (b) discovered by ENAS[13] on AS1 subset[18].They were then used to construct the final network architecture (c).We recommend the interested readers to[13] to better understand this procedure. .