Deep Full-Body HPE for Activity Recognition from RGB Frames Only

Sameh Neili Boualia; Najoua Essoukri Ben Amara

doi:10.3390/informatics8010002

and

¹

University of Tunis El Manar, National Engineering School of Tunis, 1002 Tunis, Tunisia

²

Université de Sousse, Ecole Nationale d’Ingénieurs de Sousse, LATIS- Laboratory of Advanced Technology and Intelligent Systems, 4023 Sousse, Tunisie

^*

Authors to whom correspondence should be addressed.

Informatics2021, 8(1), 2;https://doi.org/10.3390/informatics8010002

This article belongs to the Section Machine Learning

Version Notes

Order Reprints

Abstract

Human Pose Estimation (HPE) is defined as the problem of human joints’ localization (also known as keypoints: elbows, wrists, etc.) in images or videos. It is also defined as the search for a specific pose in space of all articulated joints. HPE has recently received significant attention from the scientific community. The main reason behind this trend is that pose estimation is considered as a key step for many computer vision tasks. Although many approaches have reported promising results, this domain remains largely unsolved due to several challenges such as occlusions, small and barely visible joints, and variations in clothing and lighting. In the last few years, the power of deep neural networks has been demonstrated in a wide variety of computer vision problems and especially the HPE task. In this context, we present in this paper a Deep Full-Body-HPE (DFB-HPE) approach from RGB images only. Based on ConvNets, fifteen human joint positions are predicted and can be further exploited for a large range of applications such as gesture recognition, sports performance analysis, or human-robot interaction. To evaluate the proposed deep pose estimation model, we apply it to recognize the daily activities of a person in an unconstrained environment. Therefore, the extracted features, represented by deep estimated poses, are fed to an SVM classifier. To validate the proposed architecture, our approach is tested on two publicly available benchmarks for pose estimation and activity recognition, namely the J-HMDBand CAD-60datasets. The obtained results demonstrate the efficiency of the proposed method based on ConvNets and SVM and prove how deep pose estimation can improve the recognition accuracy. By means of comparison with state-of-the-art methods, we achieve the best HPE performance, as well as the best activity recognition precision on the CAD-60 dataset.

Keywords:

human pose estimation; human activity recognition; deep learning; ConvNets; SVM

1. Introduction

Currently, the amount of available video data is explosively expanding due to the pervasiveness of digital recording devices. Estimating human poses in those videos is one of the longstanding research topics in the computer vision community, which has been extensively studied in recent years. Scientifically speaking, Human Pose Estimation (HPE) refers to the the method of localizing the human body parts (3D pose) or their projection onto a picture plane (2D pose). Video-based HPE has attracted increasing interest in recent years thanks to its wide range of applications including: human-computer interaction [,], sports performance analysis [], and video surveillance [,,]. Although the research has advanced in this field, there are still many remaining challenges such as: the high changes in human body shapes, clothing and viewpoint variations, and the conditions of system acquisition (day and night illumination variations, occlusions, etc.).

Previous works on HPE have commonly used graphical models for estimating human poses. Generally, those models are composed of joints and rigid parts. Using image-based observations, most of these classic methods follow a two step framework. The first step is based on the extraction of hand-crafted features from raw data, and the second one consists of learning classifiers on the obtained features. In [], the authors presented a graphical model for HPE with image-dependent pairwise relations. They used the local image measurements, not only to detect joints, but also to predict the spatial relationships between them. This aims to learn conditional probabilities for the presence of parts and their spatial relationships. After that, another approach was proposed using puppets []. It estimates the body poses at one frame, then checks its performance in neighboring ones using the optical flow.

Recently, following their significant progress in static image classification, Convolutional Neural Networks (CNNs/ConvNets) have been extended to take into account motion information in order to be exploited in video-based HPE. Compared with the conventional machine learning methods, deep learning techniques have a more powerful learning ability. They have shown remarkable progress due to their high precision and robustness.

In this work, we are particularly interested in estimating human poses and detecting different body parts under challenging conditions. Those human poses, which represent extracted features, will be fed to a classification stage using SVM in order to recognize daily activities. This paper presents the following novel contributions:

We present an end-to-end CNN that exploits RGB data only for a full-body pose estimation. The estimated person poses are then considered as discriminative features to recognize different human activities.
We extensively evaluate various aspects of our HPE architecture: We test different model parameters (including: iteration number, data augmentation techniques, and heat map size). We compare the proposed model with previous approaches on common benchmark datasets (i.e., J-HMDBand CAD-60) for which interesting results for HPE and activity recognition are reported.
We recognize human activities using human poses rather than RGB information. We conclude that the quality of the estimated poses significantly affects the recognition performance.

The remainder of this paper is organized as follows. In Section 2, we review recent work on 2D HPE, which can be divided into two main classes: traditional HPE approaches (Section 2.1) and deep learning-based ones (Section 2.2). Recent deep learning-based HAR approaches are explored in Section 2.3. Then, we describe the proposed DFB-HPE (Deep Full-Body-HPE) approach in Section 3 where different training details are explained. In Section 4, we present the datasets used (Section 4.1) and different evaluation metrics (Section 4.2). After that, we discuss the obtained results on the benchmarks used. Finally, we conclude our work in Section 5, where potential future studies are proposed.

2. Related Work

Human poses are important cues for video analysis in a variety of tasks such as activity/action recognition [,], multi-object detection [], and sign language processing and recognition []. Generally, HPE approaches can be divided into two main groups: traditional HPE approaches and deep learning-based ones. For more details on pose-based HAR, you can refer to our review [].

2.1. Traditional HPE Approaches

Past work on HPE has been basically founded on hand-crafted features extracted from raw data. At first, the traditional approaches frequently utilized graphical structure models for recovering human poses using image-based observations. Generative approaches (referred to as model-based or top-down) aim to locate different body parts in video frames []. Based essentially on joints and rigid parts, those techniques use a priori information such as motion [] and context []. Thus, an HPE process based on a generative approach is principally composed of two levels: modeling the human body explicitly and then estimating the different joint positions. Depicted as a skeleton, the human body is represented through a collection of parts connected by a set of constraints imposed by different joints (Figure 1).

Figure 1. Example of a human skeleton body model with 19 joints.

For example, Ferrari et al. [] proposed a methodology dependent on a conventional detector, which uses an upper-body pose estimation model from TV and movie video shots. Unless they exploited the belief propagation technique to refine estimated poses, the suggested detector seemed to be sensitive to self-occlusions. To solve this problem, Shotton et al. [] proposed a new technique to reformulate the pose estimation problem into a simpler per-pixel classification task. This method was based on a human body-part segmentation from depth frames. In order to reduce the over-fitting problem, the authors exploited a randomized decision forest. Later on, Chen et al. [] used the local image measurements, not only to detect joints, but also to predict the spatial relationships between them. The types of part relations were learned with K-means clustering in the experiments and governed spatial connections between the parts. Besides, a new approach was put forward in [] utilizing puppets. The solution was to estimate the pose of the body only at one frame and then use the optical flow technique to check its performance in neighboring ones.

Unlike generative approaches, the discriminative ones are model-free and do not assume any particular human skeleton structure constraint. They are based essentially on learning a mapping between image observations and body poses. For example, Poppe [] presented an example-based approach to pose recovery using histograms of oriented gradients as image descriptors. Niyogi and Freeman [] estimated the pose of human heads using a nonlinear mapping from the input image to an output parametric description. The mapping was calculated through examples from a training set, where the output pose was presented as that of the nearest example input neighbor.

2.2. Deep HPE Approaches

The classical pipeline of HPE has shown some limitations. Recently, this domain has been greatly reshaped by new deep learning techniques. This new type of technology no longer needs hand-crafted features. They provide several layers of feature extractors, which make it easier to implicitly learn the patterns of each joint. With the introduction of “DeepPose” by Toshev et al. [], researchers of HPE began to shift from classic approaches to deep learning. Most of the recent pose estimation systems have adopted ConvNets as their main building block, largely replacing hand-crafted features and graphical models. This technique has yielded great improvements on standard benchmarks. DeepPose was the first work that benefited from deep learning for HPE. In this approach, pose estimation was formulated as a CNN-based regression problem towards body joints. The authors used a cascade of DNN regressors to refine the estimated pose. Gkioxari [] used a CNN architecture for both pose estimation and action detection. In order to determine different human attributes, a collection of CNNs was trained in [], where each one learned a poselet [] from a set of image patches. The architecture consisted of four stages of convolution-normalization-pooling layers, one fully connected layer, and a logistic regression one utilized as a classifier of a linear nature. Poselets have been commonly used in conjunction with CNNs for people detection and pose estimation as well. Later on, in [], the authors introduced a new top-down procedure, called iterative error feedback, which allowed error predictions to be fed back in the CNN to progressively change the initial solution. Another study [] proposed to apply the convolutions and pooling steps in a way that would permit the image to be processed repeatedly in a bottom-up and top-down manner with intermediate supervision. Later, Belagiannis and Zisserman [] combined feed forward and recurrent modules in a CNN-based HPE model. In [], the authors suggested to integrate a consensus voting scheme within a CNN, where votes gathered from every location per keypoint were aggregated to obtain a probability distribution for each keypoint location. Another CNN was trained in [] to infer 3D human poses from uncertainty maps of 2D joint estimates. To estimate human poses in videos, the authors in [] exploited the ability of CNNs to benefit from temporal context, which was established by combining information between successive time frames using an optical flow. Recently, Nibali et al. [] proposed some improvements in the HPE domain. They extended the heat map-based output strategies commonly used in 2D pose estimation to the task of 3D HPE. They predicted three two-dimensional marginal heat maps per joint under an augmented soft-argmax scheme. Using post-data augmentation techniques to improve the quality of extreme/wild motions’ pose estimation, Toyoda et al. [] proposed a method that augmented the input data with rotation augmentation, then applied the pose estimation technique multiple times for every frame. The most consistent pose was then selected followed by a motion reconstruction for smoothing. In [], Kreiss et al. proposed a new bottom-up method for multi-person 2D human pose estimation that was particularly well suited for urban mobility such as self-driving cars and delivery robots. Their method was based on two parts: the PIF (Part Intensity Field) to localize different body parts and the PAF (Part Association Field) to associate body parts with each other and form full human poses. All models used were based on ImageNet pretrained base networks. Gartner et al. [] proposed a fully trainable deep reinforcement learning-based active pose estimation architecture, which learns to select appropriate views, in space and time, to feed an underlying monocular pose estimator: “Pose-DRL”. Considering the progress in computer vision for HPE, the authors in [] showed how new deep learning architectures can influence animal pose estimation, which encourages neuroscience laboratories to leverage these tools for better quantification of behavior.

2.3. Deep HAR Approaches

For Human Action Recognition (HAR) (such as “walking”, “open door”, “sit down”, etc.), many approaches based on deep learning techniques have been proposed in the last few years. Those approaches can be classified according to the DL model used: 2D or 3D. In the following, we present a selection of deep learning-based research works for HAR. For 2D CNN-based HAR approaches, Simonyan et al. [] implemented a two stream ConvNet where the spatial stream recognizes the action from still frames and the temporal stream performs recognition from the motion in the form of dense optical flow. This method achieved good results on the UCF-101 and HMDB-51 datasets. However, according to the authors, the proposed model may not be suitable for real-time applications due to its computational complexity. Moreover, in [], the authors adapted the successful deep learning architectures to the design of a two stream ConvNet for action recognition in videos, which they called “very deep two stream ConvNets”. They empirically studied both GoogLeNet and VGG-16 for the design of the proposed model. In relation to [], they presented two novelties: (i) they extended the famous Caffe toolbox into a multi-GPU implementation with high efficiency and low memory consumption and (ii) proposed several good practices for the training of the ConvNet architecture (learning rate arrangement, data augmentation techniques, etc.). For evaluation, the UCF101 dataset was used with which they achieved a recognition accuracy of 91.4%. Later on, Ijjina et al. [] proposed a new approach for HAR based on Genetic Algorithms (GAs) and CNNs. They demonstrated that initializing the weights of a CNN classifier based on solutions generated by the GA minimized the classification error. To demonstrate the efficacy of the proposed classification system, they evaluated their CNN-GA model on the UCF50 dataset, achieving 96.88% as the average accuracy rate.

Most of the current CNN methods use architectures with 2D convolutions, enabling shift-invariant representations in the image plane. However, the invariance to translations in the time axis is also important for HAR since the beginning and the end of the action are generally unknown. Thus, a CNN with 3D spatio-temporal convolutions addresses this issue and provides a natural extension of a 2D CNN to video. In [], the authors developed a novel deep model for automatic activity recognition from RGB-D videos. Each human activity was presented as an ensemble of cubic-like video segments and learned to discover the temporal structures for each category of activities. Their proposed ConvNet-based model consisted of 3D convolutions and max-pooling operators over the video segments. Later, Shao et al. [] mixed appearance and motion features for recognizing group activities in crowded scenes collected from the web. For the combination of the different modalities, the authors applied multitask deep learning. By these means, they were able to capture the intra-class correlations between the learned attributes while they proposed a novel dataset of crowed scene understanding called the “WWWcrowd” dataset. Another approach using spatio-temporal features with a 3D convolutional network was proposed in []. Experimentally, the authors showed that 3D CNNs are more suitable for spatio-temporal features than 2D CNNs. Furthermore, they empirically demonstrated that the CNN architecture with small

3 \times 3 \times 3

kernels was the best choice for spatio-temporal features. Achieving 52.8% accuracy on the UCF101 dataset, their model was computationally efficient due to the fast inference of ConvNets. Just recently, Varol et al. [] proposed the LTC-CNN model: a combination of Long-term Temporal Convolutions (LTC) with CNN in order to learn video representations. They investigated multi-resolution representations of both motion and appearance. They demonstrated the importance of high-quality optical flow estimation on action recognition accuracy. The model was tested on two recent and challenging human action benchmarks: UCF101 and HMDB51 and reported state-of-the-art performance. Shou et al. [] also designed a novel 3D CNN model named the Convolutional-De-Convolutional (CDC) network, where CDC filters were implemented prior to a 3D ConvNet. Shou et al. were the first to combine two reverse operations (convolution and de-convolution) into a joint CDC filter. The proposed CDC conducted down-sampling in space and up-sampling in time simultaneously to infer both high-level action semantics and temporal dynamics.

3. Materials and Methods

The proposed DFB-HPE approach was basically inspired by []. The basic HPE architecture consists of a two stage process for upper-body pose estimation: (i) spatial layers and (ii) temporal layers. The first stage is used to calculate different upper joint positions from RGB video frames. The heat map joints are then fed to the second stage, the “temporal pooler”, in order to consider the temporal dimension with the optical flow technique.

In order to take into account the full-body pose estimation, we modified the already mentioned architecture considering the fact that adding the lower-body joints should improve the pose estimation results as far as the activity recognition rate [], opening up other possibilities for applications. Indeed, the suggested architecture consists of several convolution, pooling, and loss layers. As depicted in Figure 2, the overall network is composed of two levels: (i) fully-convolutional layers and (ii) fusion layers. The input is a set of RGB video frames with a

320 \times 240

resolution. For each frame, fifteen key joint positions are predicted. The output of the last loss layer (

l o s s 2

) represents the 2D coordinates of the full-body joint positions. Later, those positions will be the input to the SVM classifier in order to recognize the human activity. The first stage of the proposed architecture is fully convolutional: eight convolution layers with a stride equal to 1, where the first two layers are followed by a

2 \times 2

max-pooling layer with a stride equal to 2. The output of the “conv8” layer is a set of heat maps with a fixed size

i \times j \times k

, where i and j represent the heat map size and k is the number of joints to regress (here,

60 \times 60 \times 15

). In order to learn the dependencies between the locations of human body parts, the convolution layer “conv7”, which shows pre-heat map activations, is concatenated with “conv3”, which represents a skip layer. In fact, training deep networks especially with a small amount of data can lead to many problems, namely vanishing and exploding gradients. In order to deal with this issue, we used a specific layer named the skip connection/layer, where activations are taken from one layer and fed to another one that is deeper in the network. This concatenation represents the input of the second stage of the fusion layers. We should note that the proposed network architecture is based on regressing heat maps for each joint instead of directly regressing the positions of the joints as this is a highly non-linear problem.

Figure 2. Overview of our Deep Full-Body (DFB)-Human Pose Estimation (HPE) architecture from RGB frames.

As a loss function, the suggested architecture uses the Euclidean loss layer, which computes the sum of squares of differences between its two inputs, as shown in Equation (1). As our network is trained to regress the location of the human full-body joints, the

l_{2}

loss layer penalizes the

l_{2}

distance between the predicted joint positions and the Ground Truth (GT) ones.

l o s s = \frac{1}{2 N} \sum_{i = 1}^{N} {∥ (y_{i}^{1} - y_{i}^{2}) ∥}^{2}

(1)

where N is the number of samples,

{y_{i}}^{1}

represents the ith predicted joint location, and

{y_{i}}^{2}

is the ith GT joint location.

For the classification of different human activities, we used the multi-class “one-against-one” SVM classifier. We used the LIBSVM implementation with the polynomial function as a kernel. The SVM’s input is the vector of the 2D positions of all fifteen joints calculated in the previous pose estimation stage. Each frame is associated with its fifteen 2D joint positions and its activity label. In order to have the best SVM configuration, we utilized the 10-fold cross-validation process for the training and testing splits. Then, the predicted SVM model was tested, and the accuracy rate, as well as confusion matrix were calculated.

In order to find a good enough set of weights for the specific mapping function from inputs to outputs, we used the stochastic optimization algorithm of Stochastic Gradient Descent (SGD). It is based on randomness in selecting a starting point for the search where all the weights were initialized to small random values. This process was repeated multiple times in order to have the most effective configuration. For that goal, we chose to train our network not just by fine-tuning with the pre-trained model available, but from scratch, which allowed us to control all parameters’ initialization. We began with a number of iterations equal to 150 K and increased it to view its effect on the convergence of the loss to 0. The network weights were learned using a mini-batch stochastic gradient descent with the momentum set to 0.95. In each training iteration, fourteen training frames were taken randomly and used as a mini-batch. To present maximally varying input data to the network and avoid the over-fitting problem, some data augmentation techniques were used. Each frame, with a

320 \times 240

input size, was randomly shuffled prior to training and randomly cropped to a

232 \times 232

sub-image to be then fed forward through the network to compute human joint locations. The validation set was used for hyper-parameter estimation. At training time, the GT labels were heat maps synthesized for each joint separately by placing a Gaussian with a fixed variance at the ground truth joint position. We then utilized an

l_{2}

loss, which penalized the squared pixel-wise differences between the predicted heat map and the synthesized ground truth one. In order to determine the best ConvNet parameter initialization, a 4-fold cross-validation was applied on the used dataset. The ConvNet training was performed on a single NVIDIA GTX Titan GPU using the Caffe framework [].

4. Results

4.1. Datasets

We utilized two public well-known datasets: J-HMDB [] and CAD-60 [].

J-HMDB: Extracted from the HMDB51 dataset, J-HMDB contains 928 clips comprising 21 action categories. It is not only a human action dataset, but also a good benchmark for pose estimation and human detection. Each frame was annotated using a 2D articulated human puppet model [] providing: a scale, a pose, a segmentation, a coarse viewpoint, and a dense optical flow for humans in action.

CAD-60: concerns 12 classes of daily-life actions (e.g., wearing contact glasses, opening a pill container, brushing teeth) in addition to two non-action classes relative to still and random behaviors. It was performed only by four actors and offers images relative to the RGB and depth frames, besides the skeletal streams relative to 15 body joints. Its main challenge is having one left-handed actor out of those present. The skeleton data are illustrated in Figure 3.

Figure 3. Key joint positions in the CAD-60dataset.

4.2. Evaluation Metrics

In all pose estimation experiments, we compared the estimated joints against the GT ones. The GT joint positions were given in a real-world coordinate system. Thus, they were converted into image-plane coordinates (x, y). For any particular joint localization precision radius r (measured in a Euclidean pixel distance), we report the percentage of correct joints in the test set within this radius. Indeed, for a test set of size N, radius r, and a particular joint i, the accuracy is given by Equation (2):

a c c_{i} (r) = \frac{100}{N} \sum_{t = 1}^{N} 1 (\frac{∥ (y_{i}^{t *} - y_{i}^{t}) ∥}{h_{t} / 100} \leq r)

(2)

where

y_{i}^{t *}

is the ith predicted joint location on test sample t and

h_{t}

represents the torso height of the tth sample.

In addition to the accuracy evaluation metric, the Percentage of Correct Parts (PCP), the Percentage of Correct Keypoints (PCK), and the Percent of Detected Joints (PDJ) have been commonly used in recent pose estimation work:

PCP: It describes a broadly-adopted evaluation protocol that measures the percentage of correctly localized body parts. A candidate body part is labeled as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints [,].
PCK: It defines a candidate keypoint to be correct if it falls within $α \times max (h, w)$ pixels of the GT keypoint, where h and w are respectively the height and width of the bounding box and $α$ is the relative threshold for correctness [].
PDJ: A joint is considered detected if the distance between the predicted joint and the true one is within a certain fraction of the torso diameter. By varying this fraction, detection rates are obtained for varying degrees of localization precision. This metric alleviates the drawback of PCP since the detection criteria for all joints are based on the same distance threshold [].

4.3. Results of J-HMDB Dataset

Based on the work of Charles et al. [], a joint is considered to be correctly located if it is within a set distance of d pixels from a marked joint center in the GT. Accordingly, different results are presented as graphs that plot accuracy per joint type vs. distance from the GT in pixels in Figure 4.

Figure 4. Pose estimation results on J-HMDB: accuracy per joint type according to the allowed distance from the GT.

Those results are confirmed with those presented in Figure 5, which shows PDJ results per joint type according to the normalized precision threshold. For upper-body joints, the detection rate can achieve approximately 90% even from a 0.5 precision threshold. We note that our pose estimator performs well for almost all action classes, although it is about real-world occluded scenarios. For some actions as “brush hair” or “wave”, the accuracy rate is lower for principally the knee and ankle joints. Indeed, for those action classes, the provided RGB frames are just upper-body, which makes it difficult to estimate lower-body joints such as the ankle or the knee.

Figure 5. Percent of Detected Joints (PDJ) results on J-HMDB: detection rate per joint type according to the normalized precision threshold.

We compare the proposed approach with seven state-of-the-art methods tested on the same dataset in Table 1. The first two methods: Dense Trajectories (DTs) [] and the Spatial Temporal “And/Or” Graph Model (STAOGM) [] are hand-crafted. However, the remaining approaches are CNN based: Pose-CNN (P-CNN) [], Action-tubes (A-tubes) [], Semantic Region-based CNN (SR-CNN) [], Motion-Salient Region CNN (MSR-CNN) [] and Human-Related Multi-Stream CNN (HR-MSCNN) []. From the comparison with DTs and STAOGM, we find that the deep learned features outperform the hand-crafted ones for action recognition. For P-CNN, the pose-estimator used does not always perform well. Our method achieves close results to those A-tubes. However, the authors used an empirically selected parameter

α

, which is fixed as constant and might not be optimal for different kinds of videos. The two stream SR-CNN algorithm is similar to our method. It incorporates semantic regions that are detected by Faster R-CNN [] into the original two stream CNNs. This method uses all detected regions, not only the human body, but also other foreground and background regions. The extracted features in those regions may negatively impact the performance of SR-CNN. In contrast, our method focuses on the human body region where the features are beneficial for the task of action recognition. Compared to MSR-CNN, the authors in [] used a spatio-temporal 3D convolutional method for fusion. Thus, their network performs a little better. Regarding the HR-MSCNN results, the proposed architecture combines two traditional streams: appearance (R1) and motion (R2), in addition to the captured tubes of the human-related regions (R3), which can make the computation time a bit long. In fact, they achieve a 62.98% accuracy rate when using only one region input (R1) and 71.17% when using all of them (R1 + R2 + R3).

Table 1. Comparison with state-of-the-art methods on the J-HMDB dataset. DTs, Dense Trajectories; STAOGM, Spatial Temporal “And/Or” Graph Model; P-CNN, Pose-CNN; A-tubes, Action-tubes; SR, Semantic Region; MSR, Motion-Salient Region; HR-MSCNN, Human-Related Multi-Stream CNN.

4.4. Results of the CAD-60 Dataset

For the CAD-60 dataset, different pose estimation results are presented in Figure 6 as accuracy graphs according to the allowed distance from the GT after applying the four-fold cross-validation process.

Figure 6. HPE results on CAD-60 with four-fold cross-validation: accuracy per joint type according to the allowed distance from the GT.

In Table 2, we report the different PCK-0.5results on the CAD-60 dataset.

Table 2. PCK-0.5 resultsof CAD-60 dataset.

For the upper-body parts of the CAD-60 dataset, the pose estimation results are good enough for different joints. However, for lower-body parts, each iteration seems to be effective for a well-defined part of the human body. For example, in the forth iteration of the cross-validation process, the pose prediction reaches about an 83.1% accuracy rate for “knee”. Despite being a left-handed person in the third iteration (k = 3), the estimation seems to be more effective for “foot”: nearly 100% accuracy. This contrast is due mainly to the joints provided with the CAD-60 dataset. In fact, coming from the Kinect (i.e., not manually calculated), the joints are generally sensitive to noise. In addition, their ability to detect lower parts is almost non-existent, since the distance between the camera and the person must not exceed a few meters. Those facts may explain the different failsobserved especially for lower-body parts. Accuracy results are confirmed with the PCK ones in Table 2, where the scores are reported for each key joint separately and for the whole body. HPE algorithms can be useful for various tasks in many areas, such as action recognition, human detection, human attribute recognition, and various gait processing tasks []. We chose the HAR task as it represents many challenges due to occlusions and overlapping scenes. For such a purpose, a multi-class one-against-one SVM classifier was used through the LIBSVM (Library for Support Vector Machines) [] to recognize different activities. To determine the best configuration of such a classifier, a four-fold cross-validation was applied. As a kernel function, we chose the polynomial one. The SVM input is the vector of the 2D positions of all 15 joints calculated in the previous pose estimation stage. In the training stage, we used 14,294 sample frames of 21,442 and left the rest for the testing stage.

As the HAR results, we show the confusion matrix for the CAD-60 dataset in Figure 7. In fact, we have some confusion errors between the “drinking water” and “talking on phone” and between “rinsing mouth with water” and “talking on the phone” activities with 0.02% and 0.03%, respectively. This is due to the great similarity existing between the different activity classes (such as “drinking water” and “talking on phone”). We remember that in our work, we estimate a full-body pose directly from RGB images and then recognize the corresponding activity. Table 3 proves the competitiveness of our approach with the CAD-60 dataset. Using the accuracy measure, our solution ranks in the first position and demonstrates a robust precision/recall ratio (95.4% and 95.6%, respectively). It reaches a higher value of 95.5% for the accuracy in terms of correctly labeled samples. Note that it admits the highest recall of 95.6%, as shown on the confusion matrix in Figure 7. Our approach achieves promising performance even in challenging cases (left-handed actor in the CAD-60 dataset) and using only RGB frames as the system input.

Figure 7. CAD-60 confusion matrix for 12 activities.

Table 3. Comparison with state-of-the-art results on CAD-60. DBN, Dynamic Bayesian Network; MRF, Markov Random Field; BOW, Bag Of Words; GMM, Gaussian Mixture Modeling; HMM, Hidden Markov Model; STIP, Spatio-Temporal Interest Point. (* means which input data: Skeleton, RGB or Depth is used)

5. Conclusions

In this work, we put forward a new approach for 2D full-body HPE. As pose estimation is a key step for a wide range of applications, the more precise it is, the more effective the recognition will be. That is why we took advantage of a deep architecture: ConvNet, given its precision and robustness. The main contribution of our work is to estimate full-body human poses via a ConvNet architecture adapted to a regression problem. From RGB frames only, we extracted deep features represented by 15 key joint positions of the human body. In order to evaluate the proposed HPE model, we applied it to recognize daily activities of a person in an unconstrained environment. Therefore, deep estimated poses were fed to an SVM classifier. The evaluation on challenging datasets (J-HMDB and CAD-60) and the comparison with the state-of-the-art demonstrate that our method achieves competitive ranking for the benchmarks used. The obtained results show the efficiency of using the ConvNet-based pose estimation technique to improve the activity recognition rate.

However, the proposed approach can be further improved. First, an interesting direction is the investigation of more data augmentation techniques such as image translation, color contrasting, and temporal variation [,]. Second, a straightforward perspective is to use better performing methods to improve the pose estimation level. Therefore, we can explore the temporal dimension of input videos via 3D CNNs, which show a better adaptability to the data with continuous temporal and spatial domain characteristics of the video [].

Author Contributions

Conceptualization, S.N.B.; Formal analysis, N.E.B.A.; Investigation, S.N.B.; Methodology, S.N.B.; Project administration, N.E.B.A.; Software, S.N.B.; Supervision, N.E.B.A.; Validation, N.E.B.A.; Writing—original draft, S.N.B.; Writing—review and editing, S.N.B. and N.E.B.A. All authors read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement