You are currently viewing a new version of our website. To view the old version click .
Informatics
  • Article
  • Open Access

18 January 2021

Deep Full-Body HPE for Activity Recognition from RGB Frames Only

and
1
University of Tunis El Manar, National Engineering School of Tunis, 1002 Tunis, Tunisia
2
Université de Sousse, Ecole Nationale d’Ingénieurs de Sousse, LATIS- Laboratory of Advanced Technology and Intelligent Systems, 4023 Sousse, Tunisie
*
Authors to whom correspondence should be addressed.
This article belongs to the Section Machine Learning

Abstract

Human Pose Estimation (HPE) is defined as the problem of human joints’ localization (also known as keypoints: elbows, wrists, etc.) in images or videos. It is also defined as the search for a specific pose in space of all articulated joints. HPE has recently received significant attention from the scientific community. The main reason behind this trend is that pose estimation is considered as a key step for many computer vision tasks. Although many approaches have reported promising results, this domain remains largely unsolved due to several challenges such as occlusions, small and barely visible joints, and variations in clothing and lighting. In the last few years, the power of deep neural networks has been demonstrated in a wide variety of computer vision problems and especially the HPE task. In this context, we present in this paper a Deep Full-Body-HPE (DFB-HPE) approach from RGB images only. Based on ConvNets, fifteen human joint positions are predicted and can be further exploited for a large range of applications such as gesture recognition, sports performance analysis, or human-robot interaction. To evaluate the proposed deep pose estimation model, we apply it to recognize the daily activities of a person in an unconstrained environment. Therefore, the extracted features, represented by deep estimated poses, are fed to an SVM classifier. To validate the proposed architecture, our approach is tested on two publicly available benchmarks for pose estimation and activity recognition, namely the J-HMDBand CAD-60datasets. The obtained results demonstrate the efficiency of the proposed method based on ConvNets and SVM and prove how deep pose estimation can improve the recognition accuracy. By means of comparison with state-of-the-art methods, we achieve the best HPE performance, as well as the best activity recognition precision on the CAD-60 dataset.

1. Introduction

Currently, the amount of available video data is explosively expanding due to the pervasiveness of digital recording devices. Estimating human poses in those videos is one of the longstanding research topics in the computer vision community, which has been extensively studied in recent years. Scientifically speaking, Human Pose Estimation (HPE) refers to the the method of localizing the human body parts (3D pose) or their projection onto a picture plane (2D pose). Video-based HPE has attracted increasing interest in recent years thanks to its wide range of applications including: human-computer interaction [,], sports performance analysis [], and video surveillance [,,]. Although the research has advanced in this field, there are still many remaining challenges such as: the high changes in human body shapes, clothing and viewpoint variations, and the conditions of system acquisition (day and night illumination variations, occlusions, etc.).
Previous works on HPE have commonly used graphical models for estimating human poses. Generally, those models are composed of joints and rigid parts. Using image-based observations, most of these classic methods follow a two step framework. The first step is based on the extraction of hand-crafted features from raw data, and the second one consists of learning classifiers on the obtained features. In [], the authors presented a graphical model for HPE with image-dependent pairwise relations. They used the local image measurements, not only to detect joints, but also to predict the spatial relationships between them. This aims to learn conditional probabilities for the presence of parts and their spatial relationships. After that, another approach was proposed using puppets []. It estimates the body poses at one frame, then checks its performance in neighboring ones using the optical flow.
Recently, following their significant progress in static image classification, Convolutional Neural Networks (CNNs/ConvNets) have been extended to take into account motion information in order to be exploited in video-based HPE. Compared with the conventional machine learning methods, deep learning techniques have a more powerful learning ability. They have shown remarkable progress due to their high precision and robustness.
In this work, we are particularly interested in estimating human poses and detecting different body parts under challenging conditions. Those human poses, which represent extracted features, will be fed to a classification stage using SVM in order to recognize daily activities. This paper presents the following novel contributions:
  • We present an end-to-end CNN that exploits RGB data only for a full-body pose estimation. The estimated person poses are then considered as discriminative features to recognize different human activities.
  • We extensively evaluate various aspects of our HPE architecture: We test different model parameters (including: iteration number, data augmentation techniques, and heat map size). We compare the proposed model with previous approaches on common benchmark datasets (i.e., J-HMDBand CAD-60) for which interesting results for HPE and activity recognition are reported.
  • We recognize human activities using human poses rather than RGB information. We conclude that the quality of the estimated poses significantly affects the recognition performance.
The remainder of this paper is organized as follows. In Section 2, we review recent work on 2D HPE, which can be divided into two main classes: traditional HPE approaches (Section 2.1) and deep learning-based ones (Section 2.2). Recent deep learning-based HAR approaches are explored in Section 2.3. Then, we describe the proposed DFB-HPE (Deep Full-Body-HPE) approach in Section 3 where different training details are explained. In Section 4, we present the datasets used (Section 4.1) and different evaluation metrics (Section 4.2). After that, we discuss the obtained results on the benchmarks used. Finally, we conclude our work in Section 5, where potential future studies are proposed.

3. Materials and Methods

The proposed DFB-HPE approach was basically inspired by []. The basic HPE architecture consists of a two stage process for upper-body pose estimation: (i) spatial layers and (ii) temporal layers. The first stage is used to calculate different upper joint positions from RGB video frames. The heat map joints are then fed to the second stage, the “temporal pooler”, in order to consider the temporal dimension with the optical flow technique.
In order to take into account the full-body pose estimation, we modified the already mentioned architecture considering the fact that adding the lower-body joints should improve the pose estimation results as far as the activity recognition rate [], opening up other possibilities for applications. Indeed, the suggested architecture consists of several convolution, pooling, and loss layers. As depicted in Figure 2, the overall network is composed of two levels: (i) fully-convolutional layers and (ii) fusion layers. The input is a set of RGB video frames with a 320 × 240 resolution. For each frame, fifteen key joint positions are predicted. The output of the last loss layer ( l o s s 2 ) represents the 2D coordinates of the full-body joint positions. Later, those positions will be the input to the SVM classifier in order to recognize the human activity. The first stage of the proposed architecture is fully convolutional: eight convolution layers with a stride equal to 1, where the first two layers are followed by a 2 × 2 max-pooling layer with a stride equal to 2. The output of the “conv8” layer is a set of heat maps with a fixed size i × j × k , where i and j represent the heat map size and k is the number of joints to regress (here, 60 × 60 × 15 ). In order to learn the dependencies between the locations of human body parts, the convolution layer “conv7”, which shows pre-heat map activations, is concatenated with “conv3”, which represents a skip layer. In fact, training deep networks especially with a small amount of data can lead to many problems, namely vanishing and exploding gradients. In order to deal with this issue, we used a specific layer named the skip connection/layer, where activations are taken from one layer and fed to another one that is deeper in the network. This concatenation represents the input of the second stage of the fusion layers. We should note that the proposed network architecture is based on regressing heat maps for each joint instead of directly regressing the positions of the joints as this is a highly non-linear problem.
Figure 2. Overview of our Deep Full-Body (DFB)-Human Pose Estimation (HPE) architecture from RGB frames.
As a loss function, the suggested architecture uses the Euclidean loss layer, which computes the sum of squares of differences between its two inputs, as shown in Equation (1). As our network is trained to regress the location of the human full-body joints, the l 2 loss layer penalizes the l 2 distance between the predicted joint positions and the Ground Truth (GT) ones.
l o s s = 1 2 N i = 1 N ( y i 1 y i 2 ) 2
where N is the number of samples, y i 1 represents the ith predicted joint location, and y i 2 is the ith GT joint location.
For the classification of different human activities, we used the multi-class “one-against-one” SVM classifier. We used the LIBSVM implementation with the polynomial function as a kernel. The SVM’s input is the vector of the 2D positions of all fifteen joints calculated in the previous pose estimation stage. Each frame is associated with its fifteen 2D joint positions and its activity label. In order to have the best SVM configuration, we utilized the 10-fold cross-validation process for the training and testing splits. Then, the predicted SVM model was tested, and the accuracy rate, as well as confusion matrix were calculated.
In order to find a good enough set of weights for the specific mapping function from inputs to outputs, we used the stochastic optimization algorithm of Stochastic Gradient Descent (SGD). It is based on randomness in selecting a starting point for the search where all the weights were initialized to small random values. This process was repeated multiple times in order to have the most effective configuration. For that goal, we chose to train our network not just by fine-tuning with the pre-trained model available, but from scratch, which allowed us to control all parameters’ initialization. We began with a number of iterations equal to 150 K and increased it to view its effect on the convergence of the loss to 0. The network weights were learned using a mini-batch stochastic gradient descent with the momentum set to 0.95. In each training iteration, fourteen training frames were taken randomly and used as a mini-batch. To present maximally varying input data to the network and avoid the over-fitting problem, some data augmentation techniques were used. Each frame, with a 320 × 240 input size, was randomly shuffled prior to training and randomly cropped to a 232 × 232 sub-image to be then fed forward through the network to compute human joint locations. The validation set was used for hyper-parameter estimation. At training time, the GT labels were heat maps synthesized for each joint separately by placing a Gaussian with a fixed variance at the ground truth joint position. We then utilized an l 2 loss, which penalized the squared pixel-wise differences between the predicted heat map and the synthesized ground truth one. In order to determine the best ConvNet parameter initialization, a 4-fold cross-validation was applied on the used dataset. The ConvNet training was performed on a single NVIDIA GTX Titan GPU using the Caffe framework [].

4. Results

4.1. Datasets

We utilized two public well-known datasets: J-HMDB [] and CAD-60 [].
J-HMDB: Extracted from the HMDB51 dataset, J-HMDB contains 928 clips comprising 21 action categories. It is not only a human action dataset, but also a good benchmark for pose estimation and human detection. Each frame was annotated using a 2D articulated human puppet model [] providing: a scale, a pose, a segmentation, a coarse viewpoint, and a dense optical flow for humans in action.
CAD-60: concerns 12 classes of daily-life actions (e.g., wearing contact glasses, opening a pill container, brushing teeth) in addition to two non-action classes relative to still and random behaviors. It was performed only by four actors and offers images relative to the RGB and depth frames, besides the skeletal streams relative to 15 body joints. Its main challenge is having one left-handed actor out of those present. The skeleton data are illustrated in Figure 3.
Figure 3. Key joint positions in the CAD-60dataset.

4.2. Evaluation Metrics

In all pose estimation experiments, we compared the estimated joints against the GT ones. The GT joint positions were given in a real-world coordinate system. Thus, they were converted into image-plane coordinates (x, y). For any particular joint localization precision radius r (measured in a Euclidean pixel distance), we report the percentage of correct joints in the test set within this radius. Indeed, for a test set of size N, radius r, and a particular joint i, the accuracy is given by Equation (2):
a c c i ( r ) = 100 N t = 1 N 1 ( ( y i t y i t ) h t / 100 r )
where y i t is the ith predicted joint location on test sample t and h t represents the torso height of the tth sample.
In addition to the accuracy evaluation metric, the Percentage of Correct Parts (PCP), the Percentage of Correct Keypoints (PCK), and the Percent of Detected Joints (PDJ) have been commonly used in recent pose estimation work:
  • PCP: It describes a broadly-adopted evaluation protocol that measures the percentage of correctly localized body parts. A candidate body part is labeled as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints [,].
  • PCK: It defines a candidate keypoint to be correct if it falls within α × max ( h , w ) pixels of the GT keypoint, where h and w are respectively the height and width of the bounding box and α is the relative threshold for correctness [].
  • PDJ: A joint is considered detected if the distance between the predicted joint and the true one is within a certain fraction of the torso diameter. By varying this fraction, detection rates are obtained for varying degrees of localization precision. This metric alleviates the drawback of PCP since the detection criteria for all joints are based on the same distance threshold [].

4.3. Results of J-HMDB Dataset

Based on the work of Charles et al. [], a joint is considered to be correctly located if it is within a set distance of d pixels from a marked joint center in the GT. Accordingly, different results are presented as graphs that plot accuracy per joint type vs. distance from the GT in pixels in Figure 4.
Figure 4. Pose estimation results on J-HMDB: accuracy per joint type according to the allowed distance from the GT.
Those results are confirmed with those presented in Figure 5, which shows PDJ results per joint type according to the normalized precision threshold. For upper-body joints, the detection rate can achieve approximately 90% even from a 0.5 precision threshold. We note that our pose estimator performs well for almost all action classes, although it is about real-world occluded scenarios. For some actions as “brush hair” or “wave”, the accuracy rate is lower for principally the knee and ankle joints. Indeed, for those action classes, the provided RGB frames are just upper-body, which makes it difficult to estimate lower-body joints such as the ankle or the knee.
Figure 5. Percent of Detected Joints (PDJ) results on J-HMDB: detection rate per joint type according to the normalized precision threshold.
We compare the proposed approach with seven state-of-the-art methods tested on the same dataset in Table 1. The first two methods: Dense Trajectories (DTs) [] and the Spatial Temporal “And/Or” Graph Model (STAOGM) [] are hand-crafted. However, the remaining approaches are CNN based: Pose-CNN (P-CNN) [], Action-tubes (A-tubes) [], Semantic Region-based CNN (SR-CNN) [], Motion-Salient Region CNN (MSR-CNN) [] and Human-Related Multi-Stream CNN (HR-MSCNN) []. From the comparison with DTs and STAOGM, we find that the deep learned features outperform the hand-crafted ones for action recognition. For P-CNN, the pose-estimator used does not always perform well. Our method achieves close results to those A-tubes. However, the authors used an empirically selected parameter α , which is fixed as constant and might not be optimal for different kinds of videos. The two stream SR-CNN algorithm is similar to our method. It incorporates semantic regions that are detected by Faster R-CNN [] into the original two stream CNNs. This method uses all detected regions, not only the human body, but also other foreground and background regions. The extracted features in those regions may negatively impact the performance of SR-CNN. In contrast, our method focuses on the human body region where the features are beneficial for the task of action recognition. Compared to MSR-CNN, the authors in [] used a spatio-temporal 3D convolutional method for fusion. Thus, their network performs a little better. Regarding the HR-MSCNN results, the proposed architecture combines two traditional streams: appearance (R1) and motion (R2), in addition to the captured tubes of the human-related regions (R3), which can make the computation time a bit long. In fact, they achieve a 62.98% accuracy rate when using only one region input (R1) and 71.17% when using all of them (R1 + R2 + R3).
Table 1. Comparison with state-of-the-art methods on the J-HMDB dataset. DTs, Dense Trajectories; STAOGM, Spatial Temporal “And/Or” Graph Model; P-CNN, Pose-CNN; A-tubes, Action-tubes; SR, Semantic Region; MSR, Motion-Salient Region; HR-MSCNN, Human-Related Multi-Stream CNN.

4.4. Results of the CAD-60 Dataset

For the CAD-60 dataset, different pose estimation results are presented in Figure 6 as accuracy graphs according to the allowed distance from the GT after applying the four-fold cross-validation process.
Figure 6. HPE results on CAD-60 with four-fold cross-validation: accuracy per joint type according to the allowed distance from the GT.
In Table 2, we report the different PCK-0.5results on the CAD-60 dataset.
Table 2. PCK-0.5 resultsof CAD-60 dataset.
For the upper-body parts of the CAD-60 dataset, the pose estimation results are good enough for different joints. However, for lower-body parts, each iteration seems to be effective for a well-defined part of the human body. For example, in the forth iteration of the cross-validation process, the pose prediction reaches about an 83.1% accuracy rate for “knee”. Despite being a left-handed person in the third iteration (k = 3), the estimation seems to be more effective for “foot”: nearly 100% accuracy. This contrast is due mainly to the joints provided with the CAD-60 dataset. In fact, coming from the Kinect (i.e., not manually calculated), the joints are generally sensitive to noise. In addition, their ability to detect lower parts is almost non-existent, since the distance between the camera and the person must not exceed a few meters. Those facts may explain the different failsobserved especially for lower-body parts. Accuracy results are confirmed with the PCK ones in Table 2, where the scores are reported for each key joint separately and for the whole body. HPE algorithms can be useful for various tasks in many areas, such as action recognition, human detection, human attribute recognition, and various gait processing tasks []. We chose the HAR task as it represents many challenges due to occlusions and overlapping scenes. For such a purpose, a multi-class one-against-one SVM classifier was used through the LIBSVM (Library for Support Vector Machines) [] to recognize different activities. To determine the best configuration of such a classifier, a four-fold cross-validation was applied. As a kernel function, we chose the polynomial one. The SVM input is the vector of the 2D positions of all 15 joints calculated in the previous pose estimation stage. In the training stage, we used 14,294 sample frames of 21,442 and left the rest for the testing stage.
As the HAR results, we show the confusion matrix for the CAD-60 dataset in Figure 7. In fact, we have some confusion errors between the “drinking water” and “talking on phone” and between “rinsing mouth with water” and “talking on the phone” activities with 0.02% and 0.03%, respectively. This is due to the great similarity existing between the different activity classes (such as “drinking water” and “talking on phone”). We remember that in our work, we estimate a full-body pose directly from RGB images and then recognize the corresponding activity. Table 3 proves the competitiveness of our approach with the CAD-60 dataset. Using the accuracy measure, our solution ranks in the first position and demonstrates a robust precision/recall ratio (95.4% and 95.6%, respectively). It reaches a higher value of 95.5% for the accuracy in terms of correctly labeled samples. Note that it admits the highest recall of 95.6%, as shown on the confusion matrix in Figure 7. Our approach achieves promising performance even in challenging cases (left-handed actor in the CAD-60 dataset) and using only RGB frames as the system input.
Figure 7. CAD-60 confusion matrix for 12 activities.
Table 3. Comparison with state-of-the-art results on CAD-60. DBN, Dynamic Bayesian Network; MRF, Markov Random Field; BOW, Bag Of Words; GMM, Gaussian Mixture Modeling; HMM, Hidden Markov Model; STIP, Spatio-Temporal Interest Point. (* means which input data: Skeleton, RGB or Depth is used)

5. Conclusions

In this work, we put forward a new approach for 2D full-body HPE. As pose estimation is a key step for a wide range of applications, the more precise it is, the more effective the recognition will be. That is why we took advantage of a deep architecture: ConvNet, given its precision and robustness. The main contribution of our work is to estimate full-body human poses via a ConvNet architecture adapted to a regression problem. From RGB frames only, we extracted deep features represented by 15 key joint positions of the human body. In order to evaluate the proposed HPE model, we applied it to recognize daily activities of a person in an unconstrained environment. Therefore, deep estimated poses were fed to an SVM classifier. The evaluation on challenging datasets (J-HMDB and CAD-60) and the comparison with the state-of-the-art demonstrate that our method achieves competitive ranking for the benchmarks used. The obtained results show the efficiency of using the ConvNet-based pose estimation technique to improve the activity recognition rate.
However, the proposed approach can be further improved. First, an interesting direction is the investigation of more data augmentation techniques such as image translation, color contrasting, and temporal variation [,]. Second, a straightforward perspective is to use better performing methods to improve the pose estimation level. Therefore, we can explore the temporal dimension of input videos via 3D CNNs, which show a better adaptability to the data with continuous temporal and spatial domain characteristics of the video [].

Author Contributions

Conceptualization, S.N.B.; Formal analysis, N.E.B.A.; Investigation, S.N.B.; Methodology, S.N.B.; Project administration, N.E.B.A.; Software, S.N.B.; Supervision, N.E.B.A.; Validation, N.E.B.A.; Writing—original draft, S.N.B.; Writing—review and editing, S.N.B. and N.E.B.A. All authors read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Qiang, L.; Zhang, W.; Hongliang, L.; Ngan, K.N. Hybrid human detection and recognition in surveillance. Neurocomputing 2016, 194, 10–23. [Google Scholar]
  2. D’Eusanio, A.; Simoni, A.; Pini, S.; Borghi, G.; Vezzani, R.; Cucchiara, R. Multimodal hand gesture classification for the human–car interaction. Informatics 2020, 7, 31. [Google Scholar] [CrossRef]
  3. Unzueta, L.; Goenetxea, J.; Rodriguez, M.; Linaza, M.T. Dependent 3D human body posing for sports legacy recovery from images and video. In Proceedings of the 2014 22nd European Signal Processing Conference (EUSIPCO), Lisbon, Portugal, 1–5 September 2014; pp. 361–365. [Google Scholar]
  4. Chen, C.; Yang, Y.; Nie, F.; Odobez, J.M. 3D human pose recovery from image by efficient visual feature selection. Comput. Vis. Image Underst. 2011, 115, 290–299. [Google Scholar] [CrossRef]
  5. Rahimi, M.; Alghassi, A.; Ahsan, M.; Haider, J. Deep Learning Model for Industrial Leakage Detection Using Acoustic Emission Signal. Informatics 2020, 4, 49. [Google Scholar] [CrossRef]
  6. Konstantaras, A. Deep Learning and Parallel Processing Spatio-Temporal Clustering Unveil New Ionian Distinct Seismic Zone. Informatics 2020, 4, 39. [Google Scholar] [CrossRef]
  7. Chen, X.; Yuille, A.L. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; pp. 1736–1744. [Google Scholar]
  8. Zuffi, S.; Romero, J.; Schmid, C.; Black, M.J. Estimating human pose with flowing puppets. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3312–3319. [Google Scholar]
  9. Seddik, B.; Gazzah, S.; Essoukri Ben Amara, N. Hybrid Multi-modal Fusion for Human Action Recognition. In Proceedings of the International Conference Image Analysis and Recognition, Montreal, QC, Canada, 5–7 July 2017; pp. 201–209. [Google Scholar]
  10. Seddik, B.; Gazzah, S.; Essoukri Ben Amara, N. Hands, face and joints for multi-modal human-action temporal segmentation and recognition. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 1143–1147. [Google Scholar]
  11. Mhalla, A.; Chateau, T.; Maamatou, H.; Gazzah, S.; Essoukri Ben Amara, N. SMC faster R-CNN: Toward a scene-specialized multi-object detector. Comput. Vis. Image Underst. 2017, 164, 3–15. [Google Scholar] [CrossRef]
  12. Seddik, B.; Gazzah, S.; Essoukri Ben Amara, N. Modalities combination for Italian sign language extraction and recognition. In International Conference on Image Analysis and Processing; Springer: Cham, Swizerland, 2015; pp. 710–721. [Google Scholar]
  13. Boualia, S.N.; Essoukri Ben Amara, N. Pose-based Human Activity Recognition: A review. In Proceedings of the 2019 15th International Wireless Communications Mobile Computing Conference (IWCMC), Tangier, Morocco, 24–28 June 2019; pp. 1468–1475. [Google Scholar] [CrossRef]
  14. Daubney, B.; Gibson, D.; Campbell, N. Estimating pose of articulated objects using low-level motion. Comput. Vis. Image Underst. 2012, 116, 330–346. [Google Scholar] [CrossRef]
  15. Ning, H.; Xu, W.; Gong, Y.; Huang, T. Discriminative learning of visual words for 3D human pose estimation. In Proceedings of the 2008 Computer Vision and Pattern Recognition—CVPR 2008, Anchorage, AK, USA, 24–26 June 2008; pp. 1–8. [Google Scholar]
  16. Ferrari, V.; Marin-Jimenez, M.; Zisserman, A. Progressive search space reduction for human pose estimation. In Proceedings of the Computer Vision and Pattern Recognition—CVPR 2008, Anchorage, AK, USA, 24–26 June 2008; pp. 1–8. [Google Scholar]
  17. Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. Real-time human pose recognition in parts from single depth images. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
  18. Poppe, R. Evaluating example-based pose estimation: Experiments on the humaneva sets. In Proceedings of the CVPR 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, Minneapolis, MN, USA, 22 June 2007; pp. 1–8. [Google Scholar]
  19. Niyogi, S.; Freeman, W.T. Example-based head tracking. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, Killington, VT, USA, 14–16 October 1996; pp. 374–378. [Google Scholar] [CrossRef]
  20. Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
  21. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  22. Zhang, N.; Paluri, M.; Ranzato, M.; Darrell, T.; Bourdev, L. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1637–1644. [Google Scholar]
  23. Pishchulin, L.; Andriluka, M.; Gehler, P.; Schiele, B. Poselet conditioned pictorial structures. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 588–595. [Google Scholar]
  24. Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4733–4742. [Google Scholar]
  25. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Swizerland, 2016; pp. 483–499. [Google Scholar]
  26. Belagiannis, V.; Zisserman, A. Recurrent human pose estimation. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 468–475. [Google Scholar]
  27. Lifshitz, I.; Fetaya, E.; Ullman, S. Human pose estimation using deep consensus voting. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Swizerland, 2016; pp. 246–260. [Google Scholar]
  28. Zhou, X.; Zhu, M.; Leonardos, S.; Derpanis, K.G.; Daniilidis, K. Sparseness meets deepness: 3D human pose estimation from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4966–4975. [Google Scholar]
  29. Pfister, T.; Charles, J.; Zisserman, A. Flowing convnets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1913–1921. [Google Scholar]
  30. Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. 3d human pose estimation with 2d marginal heat maps. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1477–1485. [Google Scholar]
  31. Toyoda, K.; Kono, M.; Rekimoto, J. Post-Data Augmentation to Improve Deep Pose Estimation of Extreme and Wild Motions. arXiv 2019, arXiv:1902.04250. [Google Scholar]
  32. Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. arXiv 2019, arXiv:1903.06593. [Google Scholar]
  33. Gärtner, E.; Pirinen, A.; Sminchisescu, C. Deep Reinforcement Learning for Active Human Pose Estimation. arXiv 2020, arXiv:2001.02024. [Google Scholar] [CrossRef]
  34. Mathis, M.W.; Mathis, A. Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 2020, 60, 1–11. [Google Scholar] [CrossRef] [PubMed]
  35. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
  36. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y. Towards good practices for very deep two stream convnets. arXiv 2015, arXiv:1507.02159. [Google Scholar]
  37. Ijjina, E.P.; Chalavadi, K.M. Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recognit. 2016, 59, 199–212. [Google Scholar] [CrossRef]
  38. Wang, K.; Wang, X.; Lin, L.; Wang, M.; Zuo, W. 3D human activity recognition with reconfigurable convolutional neural networks. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 97–106. [Google Scholar]
  39. Shao, J.; Kang, K.; Change Loy, C.; Wang, X. Deeply learned attributes for crowded scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4657–4666. [Google Scholar]
  40. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
  41. Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef] [PubMed]
  42. Shou, Z.; Chan, J.; Zareian, A.; Miyazawa, K.; Chang, S.F. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1417–1426. [Google Scholar]
  43. Neili, S.; Gazzah, S.; El Yacoubi, M.A.; Essoukri Ben Amara, N. Human posture recognition approach based on ConvNets and SVM classifier. In Proceedings of the 2017 International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Fez, Morocco, 22–24 May 2017; pp. 1–6. [Google Scholar]
  44. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; ACM: New York, NY, USA, 2014; pp. 675–678. [Google Scholar]
  45. Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
  46. Sung, J.; Ponce, C.; Selman, B.; Saxena, A. Human Activity Detection from RGBD Images. Plan Act. Intent Recognit. 2011, 64, 47–55. [Google Scholar]
  47. Zuffi, S.; Freifeld, O.; Black, M.J. From pictorial structures to deformable structures. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3546–3553. [Google Scholar]
  48. Sapp, B.; Taskar, B. Modec: Multimodal decomposable models for human pose estimation. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3674–3681. [Google Scholar]
  49. Wang, H.; Kläser, A.; Schmid, C.; Liu, C.L. Action recognition by dense trajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011; pp. 3169–3176. [Google Scholar]
  50. Xiaohan Nie, B.; Xiong, C.; Zhu, S.C. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1293–1301. [Google Scholar]
  51. Chéron, G.; Laptev, I.; Schmid, C. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3218–3226. [Google Scholar]
  52. Gkioxari, G.; Malik, J. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 759–768. [Google Scholar]
  53. Wang, Y.; Song, J.; Wang, L.; Van Gool, L.; Hilliges, O. Two-Stream SR-CNNs for Action Recognition in Videos. In Proceedings of the BMVC, York, UK, 19–22 September 2016. [Google Scholar]
  54. Tu, Z.; Cao, J.; Li, Y.; Li, B. MSR-CNN: Applying motion salient region based descriptors for action recognition. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3524–3529. [Google Scholar]
  55. Tu, Z.; Xie, W.; Qin, Q.; Poppe, R.; Veltkamp, R.C.; Li, B.; Yuan, J. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 2018, 79, 32–43. [Google Scholar] [CrossRef]
  56. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  57. Petrov, I.; Shakhuro, V.; Konushin, A. Deep probabilistic human pose estimation. IET Comput. Vis. 2018, 12, 578–585. [Google Scholar] [CrossRef]
  58. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
  59. Sung, J.; Ponce, C.; Selman, B.; Saxena, A. Unstructured human activity detection from rgbd images. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation (ICRA), Saint Paul, MN, USA, 14–18 May 2012; pp. 842–849. [Google Scholar]
  60. Koppula, H.S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from rgb-d videos. Int. J. Robot. Res. 2013, 32, 951–970. [Google Scholar] [CrossRef]
  61. Zhang, C.; Tian, Y. RGB-D camera-based daily living activity recognition. J. Comput. Vis. Image Process. 2012, 2, 12. [Google Scholar]
  62. Yang, X.; Tian, Y. Effective 3d action recognition using eigenjoints. J. Vis. Commun. Image Represent. 2014, 25, 2–11. [Google Scholar] [CrossRef]
  63. Piyathilaka, L.; Kodagoda, S. Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features. In Proceedings of the 2013 8th IEEE Conference on Industrial Electronics and Applications (ICIEA), Melbourne, Australia, 19–21 June 2013; pp. 567–572. [Google Scholar]
  64. Ni, B.; Pei, Y.; Moulin, P.; Yan, S. Multilevel depth and image fusion for human activity detection. IEEE Trans. Cybern. 2013, 43, 1383–1394. [Google Scholar] [PubMed]
  65. Gupta, R.; Chia, A.Y.S.; Rajan, D. Human activities recognition using depth images. In Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain, 21 October 2013; ACM: New York, NY, USA, 2013; pp. 283–292. [Google Scholar]
  66. Wang, J.; Liu, Z.; Wu, Y. Learning actionlet ensemble for 3D human action recognition. In Human Action Recognition with Depth Cameras; Springer: Cham, Swizerland, 2014; pp. 11–40. [Google Scholar]
  67. Zhu, Y.; Chen, W.; Guo, G. Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 2014, 32, 453–464. [Google Scholar] [CrossRef]
  68. Faria, D.R.; Premebida, C.; Nunes, U. A probabilistic approach for human everyday activities recognition using body motion from RGB-D images. In Proceedings of the 2014 RO-MAN: The 23rd IEEE International Symposium on Robot and Human Interactive Communication, Edinburgh, UK, 25–29 August 2014; pp. 732–737. [Google Scholar]
  69. Shan, J.; Akella, S. 3D human action segmentation and recognition using pose kinetic energy. In Proceedings of the 2014 IEEE Workshop on Advanced Robotics and Its Social Impacts (ARSO), Evanston, IL, USA, 11–13 September 2014; pp. 69–75. [Google Scholar]
  70. Gaglio, S.; Re, G.L.; Morana, M. Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 2015, 45, 586–597. [Google Scholar] [CrossRef]
  71. Parisi, G.I.; Weber, C.; Wermter, S. Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobotics 2015, 9, 3. [Google Scholar] [CrossRef]
  72. Cippitelli, E.; Gasparrini, S.; Gambi, E.; Spinsante, S. A human activity recognition system using skeleton data from RGBD sensors. Comput. Intell. Neurosci. 2016, 2016, 4351435. [Google Scholar] [CrossRef]
  73. Seddik, B.; Gazzah, S.; Essoukri Ben Amara, N. Human-action recognition using a multi-layered fusion scheme of Kinect modalities. IET Comput. Vis. 2017, 11, 530–540. [Google Scholar] [CrossRef]
  74. Rogez, G.; Schmid, C. Mocap-guided data augmentation for 3d pose estimation in the wild. Adv. Neural Inf. Process. Syst. 2016, 29, 3108–3116. [Google Scholar]
  75. Peng, X.; Tang, Z.; Yang, F.; Feris, R.S.; Metaxas, D. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2226–2234. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.