A Comprehensive Survey of Vision-Based Human Action Recognition Methods
Abstract
:1. Introduction
- (1)
- We provide a comprehensive survey of human action feature representation methods involving different data types, including hand-designed action features and deep learning-based action feature representation methods for RGB, depth, and skeleton data.
- (2)
- Different from most previous reviews, which discussed methods of action modeling and action classification only, human–object interaction recognition and action detection methods are also summarized in this work.
- (3)
- We propound some suggestions for future research on human action recognition.
2. Overview of Human Action Recognition
3. Human Action Feature Representation Methods
3.1. Overview of Handcrafted Action Features for RGB Data
3.2. Overview of Handcrafted Action Feature for Depth and Skeleton Data
3.3. Overview of Action Feature Representation Methods Based on Deep Learning
3.4. Performance Evaluation Criteria and Datset
4. Overview of Interaction Recognition Methods
- (1)
- Local interaction features should be dense enough to represent information at various locations in the image.
- (2)
- The model of interaction between the human and the object(s) is based on the structure of body parts.
- (3)
- The core of the interaction model is the co-occurrence and position relationship between the human body and the object(s).
- (4)
- Features with higher discriminative power are selected from dense features.
5. Overview of Human Action Detection Methods
6. Conclusions and Discussion
- (1)
- In action recognition research, suitable data to capture the action must first be selected. In addition, a reasonable algorithm should be used to recognize human action.
- (2)
- For action feature learning problems, deep learning-based methods have superior performance.
- (3)
- In addition to action classification of primitive and single-person actions, interaction recognition, and action detection have become the new prominent research topics.
- (1)
- Multimodal visual perception and action representation problems based on multimodal fusion methods. In current research, based on accurate depth information and skeleton data, human action features can be efficiently studied. However, in most real scenes, the data acquisition platform can only provide RGB data. While depth sensors can be applied outdoors, problems with accuracy and cost mean they are not suitable for monitoring scenarios. Obtaining accurate modal data, such as accurate depth information and skeleton data from existing RGB data, is not only the basis of efficient action recognition, but also has important applications in many visual analysis tasks. Based on RGB data, depth data, and skeleton data, the integration of multimodal data is a key issue in behavior recognition research.
- (2)
- Interaction recognition problems. The recognition of interactions between humans and objects has a high level of semantic information, such as carrying dangerous items, legacy items, and waving equipment. Modeling the interaction between humans and articles based on multimodal data, and quickly analyzing the interaction, is not currently possible to an appropriate level of accuracy. This will be an important direction of future research on human action recognition.
- (3)
- Fast action detection in the spatiotemporal dimension. Research on human action recognition is more about classifying segmented video content. Although there have been some attempts to discuss how the action is located in the time and space dimensions, the effect and speed are below current application requirements. Analyzing the features of different information based on multimodal data perception, and achieving fast and accurate action detection, are the key issues for the successful application of human action recognition. In recent years, several methods have been proposed, but this is another challenge that is yet unresolved.
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Aggarwal, J.K.; Ryoo, M.S. Human activity analysis: A review. ACM Comput. Surv. 2011, 43. [Google Scholar] [CrossRef]
- Ziaeefard, M.; Bergevin, R. Semantic human activity recognition: A literature review. Pattern Recognit. 2015, 48, 2329–2345. [Google Scholar] [CrossRef]
- Van Gemert, J.C.; Jain, M.; Gati, E.; Snoek, C.G. APT: Action localization proposals from dense trajectories. In Proceedings of the British Machine Vision Conference 2015: BMVC 2015, Swansea, UK, 7–10 September 2015; p. 4. [Google Scholar]
- Zhu, H.; Vial, R.; Lu, S. Tornado: A spatio-temporal convolutional regression network for video action proposal. In Proceedings of the CVPR, Venice, Italy, 22–29 October 2017; pp. 5813–5821. [Google Scholar]
- Papadopoulos, G.T.; Axenopoulos, A.; Daras, P. Real-time skeleton-tracking-based human action recognition using kinect data. In Proceedings of the International Conference on Multimedia Modeling, Dublin, Ireland, 6–10 January 2014; pp. 473–483. [Google Scholar]
- Presti, L.L.; Cascia, M.L. 3D Skeleton-based Human Action Classification: A Survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
- Paul, S.N.; Singh, Y.J. Survey on Video Analysis of Human Walking Motion. Int. J. Signal Process. Image Process. Pattern Recognit. 2014, 7, 99–122. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Dawn, D.D.; Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 2016, 32, 289–306. [Google Scholar] [CrossRef]
- Nguyen, T.V.; Song, Z.; Yan, S.C. STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition. IEEE Trans. Circ. Syst. Video Technol. 2015, 25, 77–86. [Google Scholar] [CrossRef]
- Shao, L.; Zhen, X.T.; Tao, D.C.; Li, X.L. Spatio-Temporal Laplacian Pyramid Coding for Action Recognition. IEEE Trans. Cybern. 2014, 44, 817–827. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Burghouts, G.J.; Schutte, K.; ten Hove, R.J.M.; van den Broek, S.P.; Baan, J.; Rajadell, O.; van Huis, J.R.; van Rest, J.; Hanckmann, P.; Bouma, H.; et al. Instantaneous threat detection based on a semantic representation of activities, zones and trajectories. Signal Image Video Process. 2014, 8, 191–200. [Google Scholar] [CrossRef] [Green Version]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the ICCV, Sydney, NSW, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Yang, X.; Tian, Y.L. Super Normal Vector for Activity Recognition Using Depth Sequences. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 804–811. [Google Scholar]
- Ye, M.; Zhang, Q.; Wang, L.; Zhu, J.; Yang, R.; Gall, J. A survey on human motion analysis from depth data. In Proceedings of the Dagstuhl 2012 Seminar on Time-of-Flight Imaging: Sensors, Algorithms, and Applications and Workshop on Imaging New Modalities, GCPR 2013, Saarbrucken, Germany, 21–26 October 2012; pp. 149–187. [Google Scholar]
- Oreifej, O.; Liu, Z. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences. In Proceedings of the Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 716–723. [Google Scholar]
- Li, M.; Leung, H.; Shum, H.P.H. Human action recognition via skeletal and depth based feature fusion. In Proceedings of the Motion in Games 2016, Burlingame, CA, USA, 10–12 October 2016; pp. 123–132. [Google Scholar]
- Yang, X.; Tian, Y.L. Effective 3D action recognition using EigenJoints. J. Vis. Commun. Image Represent. 2014, 25, 2–11. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the NIPS 2014: Neural Information Processing Systems Conference, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
- Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision 2015, Las Condes, Chile, 11–18 December 2015; pp. 4489–4497. [Google Scholar]
- Güler, R.A.; Neverova, N.; Kokkinos, I. DensePose: Dense Human Pose Estimation in the Wild. arXiv, 2018; arXiv:1802.00434. [Google Scholar]
- Fang, H.-S.; Xie, S.; Tai, Y.-W.; Lu, C. RMPE: Regional Multi-Person Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; p. 7. [Google Scholar]
- Chen, C.; Liu, K.; Kehtarnavaz, N. Real-time human action recognition based on depth motion maps. J. Real-Time Image Process. 2016, 12, 155–163. [Google Scholar] [CrossRef]
- Guo, G.; Lai, A. A survey on still image based human action recognition. Pattern Recognit. 2014, 47, 3343–3361. [Google Scholar] [CrossRef]
- Meng, M.; Drira, H.; Boonaert, J. Distances evolution analysis for online and off-line human object interaction recognition. Image Vis. Comput. 2018, 70, 32–45. [Google Scholar] [CrossRef] [Green Version]
- Chao, Y.; Wang, Z.; He, Y.; Wang, J.; Deng, J. HICO: A Benchmark for Recognizing Human-Object Interactions in Images. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1017–1025. [Google Scholar]
- Le, D.-T.; Uijlings, J.; Bernardi, R. Tuhoi: Trento universal human object interaction dataset. In Proceedings of the Third Workshop on Vision and Language, Chicago, IL, USA, 22 August 2014; pp. 17–24. [Google Scholar]
- Peng, X.; Schmid, C. Multi-region two-stream R-CNN for action detection. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 744–759. [Google Scholar]
- Liu, J.; Li, Y.; Song, S.; Xing, J.; Lan, C.; Zeng, W. Multi-Modality Multi-Task Recurrent Neural Network for Online Action Detection. IEEE Trans. Circ. Syst. Video Technol. 2018. [Google Scholar] [CrossRef]
- Patrona, F.; Chatzitofis, A.; Zarpalas, D.; Daras, P. Motion analysis: Action detection, recognition and evaluation based on motion capture data. Pattern Recognit. 2018, 76, 612–622. [Google Scholar] [CrossRef]
- Herath, S.; Harandi, M.; Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 2017, 60, 4–21. [Google Scholar] [CrossRef] [Green Version]
- Subetha, T.; Chitrakala, S. A survey on human activity recognition from videos. In Proceedings of the IEEE 2016 International Conference on Information Communication and Embedded Systems, Chennai, India, 25–26 February 2016. [Google Scholar]
- Vishwakarma, S.; Agrawal, A. A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 2013, 29, 983–1009. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A Survey. Pattern Recognit. Lett. 2018. [Google Scholar] [CrossRef]
- Yu, K.; Yun, F. Human Action Recognition and Prediction: A Survey. arXiv, 2018; arXiv:1806.11230. [Google Scholar]
- Liu, A.; Xu, N.; Nie, W.; Su, Y.; Wong, Y.; Kankanhalli, M.S. Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition. IEEE Trans. Syst. Man Cybern. 2017, 47, 1781–1794. [Google Scholar] [CrossRef] [PubMed]
- Liu, A.; Su, Y.; Nie, W.; Kankanhalli, M.S. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. IEEE Trans. Pattern Anal. 2017, 39, 102–114. [Google Scholar] [CrossRef] [PubMed]
- Gao, Z.; Zhang, Y.; Zhang, H.; Xue, Y.; Xu, G. Multi-dimensional human action recognition model based on image set and group sparisty. Neurocomputing 2016, 215, 138–149. [Google Scholar] [CrossRef]
- Fernando, B.; Gavves, E.; Oramas, M.J.; Ghodrati, A.; Tuytelaars, T. Modeling video evolution for action recognition. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5378–5387. [Google Scholar]
- Zhang, J.; Li, W.; Ogunbona, P.; Wang, P.; Tang, C. RGB-D-based action recognition datasets. Pattern Recognit. 2016, 60, 86–105. [Google Scholar] [CrossRef]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Proceedings of the ECCV, Amsterdam, The Netherlands, 8–16 October 2016; pp. 816–833. [Google Scholar]
- Zhang, H.-B.; Lei, Q.; Zhong, B.-N.; Du, J.-X.; Peng, J.; Hsiao, T.-C.; Chen, D.-S. Multi-Surface Analysis for Human Action Recognition in Video; SpringerPlus: Berlin/Heidelberg, Germany, 2016; Volume 5. [Google Scholar]
- Zhang, Z.; Liu, S.; Han, L.; Shao, Y.; Zhou, W. Human action recognition using salient region detection in complex scenes. In Proceedings of Third International Conference on Communications, Signal Processing, and Systems; Mu, J., Ed.; Springer: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
- Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. arXiv, 2018; arXiv:1801.07455. [Google Scholar]
- Zhao, Y.; Xiong, Y.; Wang, L.; Wu, Z.; Tang, X.; Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zhou, Z.; Shi, F.; Wu, W. Learning Spatial and Temporal Extents of Human Actions for Action Detection. IEEE Trans. Multimed. 2015, 17, 512–525. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.B.; Li, S.Z.; Chen, S.Y.; Su, S.Z.; Lin, X.M.; Cao, D.L. Locating and recognizing multiple human actions by searching for maximum score subsequences. Signal Image Video Process. 2015, 9, 705–714. [Google Scholar] [CrossRef]
- Shu, Z.; Yun, K.; Samaras, D. Action Detection with Improved Dense Trajectories and Sliding Window. In Proceedings of ECCV; Springer: Cham, Switzerland, 2014; pp. 541–551. [Google Scholar]
- Oneata, D.; Verbeek, J.J.; Schmid, C. Efficient Action Localization with Approximately Normalized Fisher Vectors. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2545–2552. [Google Scholar]
- Chakraborty, B.K.; Sarma, D.; Bhuyan, M.K.; MacDorman, K.F. Review of constraints on vision-based gesture recognition for human–computer interaction. IET Comput. Vis. 2018, 12, 3–15. [Google Scholar] [CrossRef]
- Ibrahim, M.S.; Muralidharan, S.; Deng, Z.; Vahdat, A.; Mori, G. A Hierarchical Deep Temporal Model for Group Activity Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas Valley, NV, USA, 27–30 June 2016; pp. 1971–1980. [Google Scholar]
- Bobick, A.F.; Davis, J.W. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. 2001, 23, 257–267. [Google Scholar] [CrossRef] [Green Version]
- Zhang, Z.; Hu, Y.; Chan, S.; Chia, L. Motion Context: A New Representation for Human Action Recognition. In Proceedings of European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2008; pp. 817–829. [Google Scholar]
- Klaser, A.; Marszalek, M.; Schmid, C. A Spatio-Temporal Descriptor Based on 3D-Gradients. In Proceedings of the British Machine Vision Conference, Leeds, UK, 1–4 September 2008; pp. 1–10. [Google Scholar]
- Somasundaram, G.; Cherian, A.; Morellas, V.; Papanikolopoulos, N. Action recognition using global spatio-temporal features derived from sparse representations. Comput. Vis. Image Underst. 2014, 123, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Laptev, I. On space-time interest points. Int. J. Comput. Vis. 2005, 64, 107–123. [Google Scholar] [CrossRef]
- Chakraborty, B.; Holte, M.B.; Moeslund, T.B.; Gonzalez, J. Selective spatio-temporal interest points. Comput. Vis. Image Underst. 2012, 116, 396–410. [Google Scholar] [CrossRef] [Green Version]
- Peng, X.; Wang, L.; Wang, X.; Qiao, Y. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput. Vis. Image Underst. 2016, 150, 109–125. [Google Scholar] [CrossRef] [Green Version]
- Nazir, S.; Yousaf, M.H.; Velastin, S.A. Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 2018. [Google Scholar] [CrossRef]
- Gaidon, A.; Harchaoui, Z.; Schmid, C. Activity representation with motion hierarchies. Int. J. Comput. Vis. 2014, 107, 219–238. [Google Scholar] [CrossRef]
- Wang, H.; Oneata, D.; Verbeek, J.J.; Schmid, C. A Robust and Efficient Video Representation for Action Recognition. Int. J. Comput. Vis. 2016, 119, 219–238. [Google Scholar] [CrossRef]
- Peng, X.; Zou, C.; Qiao, Y.; Peng, Q. Action Recognition with Stacked Fisher Vectors. In Proceedings of European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 581–595. [Google Scholar]
- Miao, J.; Jia, X.; Mathew, R.; Xu, X.; Taubman, D.; Qing, C. Efficient action recognition from compressed depth maps. In Proceedings of the International Conference on Image Processing, Phoenix, AZ, USA, 25–28 September 2016; pp. 16–20. [Google Scholar]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. Action Recognition from Depth Sequences Using Depth Motion Maps-Based Local Binary Patterns. In Proceedings of the Workshop on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 1092–1099. [Google Scholar]
- Rahmani, H.; Mahmood, A.; Huynh, D.Q.; Mian, A. Real time action recognition using histograms of depth gradients and random decision forests. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; pp. 626–633. [Google Scholar]
- Yang, X.; Zhang, C.; Tian, Y.L. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the ACM International Conference on Multimedia, Nara, Japan, 29 October–2 November 2012; pp. 1057–1060. [Google Scholar]
- Pazhoumanddar, H.; Lam, C.P.; Masek, M. Joint movement similarities for robust 3D action recognition using skeletal data. J. Vis. Commun. Image Represent. 2015, 30, 10–21. [Google Scholar] [CrossRef]
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group. In Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
- Gowayyed, M.A.; Torki, M.; Hussein, M.E.; Elsaban, M. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In Proceedings of the International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; pp. 1351–1357. [Google Scholar]
- Xia, L.; Chen, C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
- Keceli, A.S.; Can, A.B. Recognition of Basic Human Actions using Depth Information. Int. J. Pattern Recognit. Artif. Intell. 2014, 28, 1450004. [Google Scholar] [CrossRef]
- Liu, L.; Shao, L. Learning discriminative representations from RGB-D video data. In Proceedings of the International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013; pp. 1493–1500. [Google Scholar]
- Chaaraoui, A.A.; Padillalopez, J.R.; Florezrevuelta, F. Fusion of Skeletal and Silhouette-Based Features for Human Action Recognition with RGB-D Devices. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Sydney, NSW, Australia, 2–8 December 2013; pp. 91–97. [Google Scholar]
- Chen, W.; Guo, G. TriViews: A general framework to use 3D depth data effectively for action recognition. J. Vis. Commun. Image Represent. 2015, 26, 182–191. [Google Scholar] [CrossRef]
- Shotton, J.; Sharp, T.; Fitzgibbon, A.; Blake, A.; Cook, M.; Kipman, A.; Finocchio, M.; Moore, R. Real-Time human pose recognition in parts from single depth images. Commun. ACM 2013, 56, 116–124. [Google Scholar] [CrossRef]
- Althloothi, S.; Mahoor, M.H.; Zhang, X.; Voyles, R.M. Human activity recognition using multi-features and multiple kernel learning. Pattern Recognit. 2014, 47, 1800–1812. [Google Scholar] [CrossRef]
- Sanchezriera, J.; Hua, K.; Hsiao, Y.; Lim, T.; Hidayati, S.C.; Cheng, W. A comparative study of data fusion for RGB-D based visual recognition. Pattern Recognit. Lett. 2016, 73, 1–6. [Google Scholar] [CrossRef]
- Jalal, A.; Kim, Y.; Kim, Y.; Kamal, S.; Kim, D. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognit. 2017, 61, 295–308. [Google Scholar] [CrossRef]
- Ni, B.; Pei, Y.; Moulin, P.; Yan, S. Multilevel Depth and Image Fusion for Human Activity Detection. IEEE Trans. Syst. Man Cybern. 2013, 43, 1383–1394. [Google Scholar]
- Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-Time Action Recognition With Deeply Transferred Motion Vector CNNs. IEEE Trans. Image Process. 2018, 27, 2326–2339. [Google Scholar] [CrossRef] [PubMed]
- Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. 2017, 39, 677–691. [Google Scholar] [CrossRef] [PubMed]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas Valley, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Gool, L.V. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In Proceedings of ECCV; Springer: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
- Lan, Z.; Zhu, Y.; Hauptmann, A.G.; Newsam, S. Deep Local Video Feature for Action Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1219–1225. [Google Scholar]
- Zhou, B.; Andonian, A.; Torralba, A. Temporal Relational Reasoning in Videos. arXiv, 2017; arXiv:1711.08496. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
- Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Arzani, M.M.; Yousefzadeh, R.; Van Gool, L. Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification. arXiv, 2017; arXiv:1711.08200. [Google Scholar]
- Zhu, J.; Zou, W.; Zhu, Z. End-to-end Video-Level Representation Learning for Action Recognition. arXiv, 2017; arXiv:1711.04161. [Google Scholar]
- Ng, Y.H.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the Computer Vision & Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. Spatio-Temporal Attention Based LSTM Networks for 3D Action Recognition and Detection. IEEE Trans. Image Process. 2018, 27, 3459–3471. [Google Scholar] [CrossRef] [PubMed]
- Wang, P.; Li, W.; Gao, Z.; Zhang, J.; Tang, C.; Ogunbona, P. Deep Convolutional Neural Networks for Action Recognition Using Depth Map Sequences. arXiv, 2015; arXiv:1501.04686. [Google Scholar]
- Ye, Y.; Tian, Y. Embedding Sequential Information into Spatiotemporal Features for Action Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1110–1118. [Google Scholar]
- Zhu, Y.; Lan, Z.; Newsam, S.; Hauptmann, A.G. Hidden two-stream convolutional networks for action recognition. arXiv, 2017; arXiv:1704.00389. [Google Scholar]
- Marszalek, M.; Laptev, I.; Schmid, C. Actions in context. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, Miami, FL, USA, 20–25 June 2009; pp. 2929–2936. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Stiefelhagen, R.; Serre, T. HMDB51: A Large Video Database for Human Motion Recognition; Springer: Berlin/Heidelberg, Germany, 2013; pp. 571–582. [Google Scholar]
- Niebles, J.C.; Chen, C.W.; Li, F.F. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. In Proceedings of European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2010; pp. 392–405. [Google Scholar]
- Reddy, K.K.; Shah, M. Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 2013, 24, 971–981. [Google Scholar] [CrossRef]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes from Videos in The Wild. arXiv, 2012; arXiv:1212.0402. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P. The Kinetics Human Action Video Dataset. arXiv, 2017; arXiv:1705.06950. [Google Scholar]
- Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3D points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops, CVPRW 2010, San Francisco, CA, USA, 13–18 June 2010; pp. 9–14. [Google Scholar]
- Wu, Y. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
- Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-view Action Modeling, Learning and Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 168–172. [Google Scholar]
- Ni, B.; Wang, G.; Moulin, P. RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In Proceedings of the ICCV Workshops, Barcelona, Spain, 6–13 November 2011; pp. 1147–1153. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas Valley, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Wang, L.M.; Qiao, Y.; Tang, X.O. MoFAP: A Multi-level Representation for Action Recognition. Int. J. Comput. Vis. 2016, 119, 254–271. [Google Scholar] [CrossRef]
- Lan, Z.; Ming, L.; Li, X.; Hauptmann, A.G.; Raj, B. Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Gaidon, A.; Harchaoui, Z.; Schmid, C. Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. 2013, 35, 2782–2795. [Google Scholar] [CrossRef] [PubMed]
- Prest, A.; Ferrari, V.; Schmid, C. Explicit Modeling of Human-Object Interactions in Realistic Videos. IEEE Trans. Pattern Anal. 2013, 35, 835–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yao, B.; Fei-Fei, L. Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. 2012, 34, 1691–1703. [Google Scholar] [CrossRef]
- Desai, C.; Ramanan, D. Detecting actions, poses, and objects with relational phraselets. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012; pp. 158–172. [Google Scholar]
- Meng, M.; Drira, H.; Daoudi, M.; Boonaert, J. Human Object Interaction Recognition Using Rate-Invariant Shape Analysis of Inter Joint Distances Trajectories. In Proceedings of the Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 999–1004. [Google Scholar]
- Koppula, H.S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from rgb-d videos. Int. J. Robot. Res. 2013, 32, 951–970. [Google Scholar] [CrossRef]
- Gupta, S.; Malik, J. Visual Semantic Role Labeling. arXiv, 2015; arXiv:1505.04474. [Google Scholar]
- Yu, G.; Liu, Z.; Yuan, J. Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Chao, Y.-W.; Liu, Y.; Liu, X.; Zeng, H.; Deng, J. Learning to Detect Human-Object Interactions. arXiv, 2017; arXiv:1702.05448. [Google Scholar]
- Mallya, A.; Lazebnik, S. Learning models for actions and person-object interactions with transfer to question answering. In Proceedings of European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 414–428. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Dollár, P.; He, K. Detecting and recognizing human-object interactions. arXiv, 2017; arXiv:1704.07333. [Google Scholar]
- Gorban, A.; Idrees, H.; Jiang, Y.; Zamir, A.R.; Laptev, I.; Shah, M.; Sukthankar, R. THUMOS challenge: Action recognition with a large number of classes. In Proceedings of the CVPR Workshop, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Shou, Z.; Wang, D.; Chang, S.-F. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Vegas Valley, NV, USA, 26 June–1 July 2016; pp. 1049–1058. [Google Scholar]
- Yu, G.; Yuan, J. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 1302–1311. [Google Scholar]
- Weinzaepfel, P.; Harchaoui, Z.; Schmid, C. Learning to track for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision 2015, Las Condes, Chile, 11–18 December 2015; pp. 3164–3172. [Google Scholar]
- Heilbron, F.C.; Escorcia, V.; Ghanem, B.; Niebles, J.C. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- A Large-Scale Video Benchmark for Human Activity Understanding. Available online: http://activity-net.org/index.html (accessed on 29 January 2019).
Dataset Name | Color | Depth | Skeleton | Samples | Classes |
---|---|---|---|---|---|
Hollywood2 [97] | √ | × | × | 1707 | 12 |
HMDB51 [98] | √ | × | × | 6766 | 51 |
Olympic Sports [99] | √ | × | × | 783 | 16 |
UCF50 [100] | √ | × | × | 6618 | 50 |
UCF101 [101] | √ | × | × | 13,320 | 101 |
Kinetics [102] | √ | × | × | 306,245 | 400 |
MSR-Action3D [103] | × | √ | √ | 567 | 20 |
MSR-Daily Activity [104] | √ | √ | √ | 320 | 16 |
Northwestern-UCLA [105] | √ | √ | √ | 1475 | 10 |
UTD-MHAD [106] | √ | √ | √ | 861 | 27 |
RGBD-HuDaAct [107] | √ | √ | × | 1189 | 13 |
NTU RGB+D [108] | √ | √ | √ | 56,880 | 60 |
Methods | Year | Hollywood2 | HMDB51 | Olympic Sports | UCF50 | UCF101 | Kinetic |
---|---|---|---|---|---|---|---|
[82]D | 2018 | 86.4% | |||||
[46]D | 2018 | 81.5% | |||||
[61]D | 2018 | 68.1% | 94% | ||||
[96]D | 2017 | 78.7% | 97.1% | ||||
[89]D | 2017 | 63.5% | 93.2% | ||||
[86]D | 2017 | 75% | 95.3% | ||||
[102]D | 2017 | 79% | |||||
[90]D | 2017 | 74.8% | 95.8% | ||||
[88]D | 2017 | 80.2% | 97.9% | ||||
[95]D | 2016 | 55.2% | 85.4% | ||||
[85]D | 2016 | 69.4% | 94.2% | ||||
[63] | 2016 | 66.8% | 60.1% | 90.4% | 91.7% | 86% | |
[29]D | 2016 | 94.8% | 78.86% | ||||
[84]D | 2016 | 65.4% | 92.5% | ||||
[45]D | 2015 | 63.2% | 91.5% | ||||
[109] | 2016 | 61.7% | 88.3% | ||||
[110] | 2015 | 68% | 65.1% | 91.4% | 94.4% | 89.1% | |
[20]D | 2015 | 49.9% | 85.2% | 79.5% | |||
[40] | 2015 | 70% | 61.8% | ||||
[57] | 2014 | 37.3% | 86.04% | 70.1% | |||
[50] | 2014 | 57.2% | 85.9% | ||||
[19] | 2014 | 59.4% | 88% | 81.3% | |||
[62] | 2014 | 54.4% | 41.3% | 85.5% | |||
[100] | 2013 | 27.02% | 68.20% | ||||
[111] | 2013 | 49.3% |
Methods | MSR-Action3D | MSR-Daily Activity | Northwestern-UCLA | UTD-MHAD | NTU RGB+D | |
---|---|---|---|---|---|---|
[31] | 2018 | 96.2% | ||||
[93]D | 2018 | 73.4% | ||||
[46]D | 2018 | 30.7% | ||||
[80] | 2017 | 93.3 | 94.1% | |||
[108]D | 2016 | 62.93% | ||||
[42]D | 2016 | 100% | 69.2% | |||
[94]D | 2015 | 100% | 81.88% | |||
[69] | 2015 | 91.2% | ||||
[76] | 2015 | 94.9% | 83.8% | |||
[106] | 2015 | 79.1% | ||||
[14] | 2014 | 93.09% | 86.25% | 31.82% | ||
[105] | 2014 | 73.1% | 81.6% | |||
[18] | 2014 | 82.30% | ||||
[70] | 2014 | 92.46% | 50.1% | |||
[67] | 2014 | 88.82% | 81.25% | |||
[78] | 2014 | 94.4% | 93.1% | |||
[16] | 2013 | 88.89% | 30.56% | |||
[74] | 2013 | 85.6% | ||||
[71] | 2013 | 91.26% | ||||
[75] | 2013 | 91.80% |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, H.-B.; Zhang, Y.-X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.-X.; Chen, D.-S. A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors 2019, 19, 1005. https://doi.org/10.3390/s19051005
Zhang H-B, Zhang Y-X, Zhong B, Lei Q, Yang L, Du J-X, Chen D-S. A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors. 2019; 19(5):1005. https://doi.org/10.3390/s19051005
Chicago/Turabian StyleZhang, Hong-Bo, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. 2019. "A Comprehensive Survey of Vision-Based Human Action Recognition Methods" Sensors 19, no. 5: 1005. https://doi.org/10.3390/s19051005
APA StyleZhang, H.-B., Zhang, Y.-X., Zhong, B., Lei, Q., Yang, L., Du, J.-X., & Chen, D.-S. (2019). A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors, 19(5), 1005. https://doi.org/10.3390/s19051005