Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches
Abstract
:1. Introduction
2. Related Work
3. Mid-Level Patch Mining Based on Motion Saliency
3.1. Motion Saliency-Based Motion Region Partition
3.2. Adaptive Motion Region Segmentation
3.3. Object Proposal Generation by the Idea of the Huffman Algorithm
Algorithm 1 Object proposal generation algorithm. |
|
3.4. Unsupervised Mid-Level Patch Detector Training
4. Action Recognition with Graph Structure
4.1. The Graph Structure
4.2. Motion Cooperation Relationship of Mid-Level Patches
4.3. Action Recognition Model
5. Experiments
5.1. Parameter Selection for Object Proposal Generation
5.2. Action Recognition Accuracies of Different Features
5.3. Comparison with State-of-the-Art Methods
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Rohrbach, M.; Amin, S.; Andriluka, M.; Schiele, B. A database for fine grained activity detection of cooking activities. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1194–1201. [Google Scholar] [CrossRef] [Green Version]
- Ni, B.; Yang, X.; Gao, S. Progressively Parsing Interactional Objects for Fine Grained Action Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 1020–1028. [Google Scholar] [CrossRef]
- Fernando, B.; Gavves, E.; Mogrovejo, J.O.; Ghodrati, A.; Tuytelaars, T. Rank Pooling for Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 773–787. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Perrett, T.; Damen, D. DDLSTM: Dual-Domain LSTM for Cross-Dataset Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 7852–7861. [Google Scholar]
- Cherian, A.; Gould, S. Second-order Temporal Pooling for Action Recognition. Int. J. Comput. Vis. 2019, 127, 340–362. [Google Scholar] [CrossRef] [Green Version]
- Wang, L.; Koniusz, P.; Huynh, D. Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition with CNNs. In Proceedings of the International Conference on Computer Vision (ICCV 2019), Seoul, South Korea, 27 October–2 November 2019; pp. 8697–8707. [Google Scholar]
- Ahad, M.A.R.; Antar, A.D.; Shahid, O. Vision-based Action Understanding for Assistive Healthcare: A Short Review. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 1–11. [Google Scholar]
- Feng, Y.; Wu, X.; Wang, H.; Liu, J. Multi-group Adaptation for Event Recognition from Videos. In Proceedings of the 22nd International Conference on Pattern Recognition (ICPR 2014), Stockholm, Sweden, 24–28 August 2014; pp. 3915–3920. [Google Scholar] [CrossRef]
- Yang, Z.; Ni, B.; Yan, S.; Moulin, P.; Qi, T. Pipelining Localized Semantic Features for Fine-Grained Action Recognition. In Proceedings of the European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, 6–12 September 2014; pp. 481–496. [Google Scholar] [CrossRef]
- Yang, Z.; Ni, B.; Hong, R.; Meng, W.; Qi, T. Interaction part mining: A mid-level approach for fine-grained action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015; pp. 3323–3331. [Google Scholar] [CrossRef] [Green Version]
- Lan, T.; Zhu, Y.; Zamir, A.R.; Savarese, S. Action Recognition by Hierarchical Mid-Level Action Elements. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; pp. 4552–4560. [Google Scholar] [CrossRef] [Green Version]
- Wang, H.; Kläser, A.; Schmid, C.; Liu, C. Action recognition by dense trajectories. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3169–3176. [Google Scholar] [CrossRef] [Green Version]
- Liu, C.; Hou, J.; Wu, X.; Jia, Y. A discriminative structural model for joint segmentation and recognition of human actions. Multimed. Tools Appl. 2018, 77, 31627–31645. [Google Scholar] [CrossRef]
- Liu, C.; Wu, X.; Jia, Y. A Hierarchical Video Description for Complex Activity Understanding. Int. J. Comput. Vis. 2016, 118, 240–255. [Google Scholar] [CrossRef]
- Singh, S.; Gupta, A.; Efros, A.A. Unsupervised Discovery of Mid-Level Discriminative Patches. In Proceedings of the European Conference on Computer Vision (ECCV 2012), Florence, Italy, 7–13 October 2012; pp. 73–86. [Google Scholar] [CrossRef] [Green Version]
- Cheng, M.; Zhang, Z.; Lin, W.; Torr, P.H.S. BING: Binarized Normed Gradients for Objectness Estimation at 300fps. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 3286–3293. [Google Scholar] [CrossRef] [Green Version]
- Packer, B.; Saenko, K.; Koller, D. A combined pose, object, and feature model for action understanding. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1378–1385. [Google Scholar] [CrossRef]
- Prest, A.; Ferrari, V.; Schmid, C. Explicit Modeling of Human-Object Interactions in Realistic Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 835–848. [Google Scholar] [CrossRef]
- Wang, H.; Schmid, C. Action Recognition with Improved Trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar] [CrossRef] [Green Version]
- Koppula, H.S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from RGB-D videos. J. Robot. Res. 2013, 32, 951–970. [Google Scholar] [CrossRef]
- Raptis, M.; Kokkinos, I.; Soatto, S. Discovering discriminative action parts from mid-level video representations. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1242–1249. [Google Scholar] [CrossRef] [Green Version]
- Uijlings, J.R.R.; van de Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
- Ballas, N.; Yang, Y.; Lan, Z.; Delezoide, B.; Prêteux, F.J.; Hauptmann, A.G. Space-Time Robust Representation for Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), Sydney, Australia, 1–8 December 2013; pp. 2704–2711. [Google Scholar] [CrossRef]
- Sharma, G.; Jurie, F.; Schmid, C. Discriminative spatial saliency for image classification. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3506–3513. [Google Scholar] [CrossRef] [Green Version]
- Zhou, F.; Kang, S.B.; Cohen, M.F. Time-Mapping Using Space-Time Saliency. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 3358–3365. [Google Scholar] [CrossRef] [Green Version]
- Ni, B.; Paramathayalan, V.R.; Moulin, P. Multiple Granularity Analysis for Fine-Grained Action Detection. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 756–763. [Google Scholar] [CrossRef]
- Rohrbach, M.; Rohrbach, A.; Regneri, M.; Amin, S.; Andriluka, M.; Pinkal, M.; Schiele, B. Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data. Int. J. Comput. Vis. 2016, 119, 346–373. [Google Scholar] [CrossRef] [Green Version]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar] [CrossRef] [Green Version]
- Dalal, N.; Triggs, B.; Schmid, C. Human Detection Using Oriented Histograms of Flow and Appearance. In Proceedings of the 9th European Conference on Computer Vision (ECCV 2006), Graz, Austria, 7–13 May 2006; pp. 428–441. [Google Scholar] [CrossRef] [Green Version]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Collaborative Spatiotemporal Feature Learning for Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 7872–7881. [Google Scholar]
- Chéron, G.; Laptev, I.; Schmid, C. P-CNN: Pose-Based CNN Features for Action Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; pp. 3218–3226. [Google Scholar] [CrossRef] [Green Version]
- Li, W.; Feng, C.; Xiao, B.; Chen, Y. Binary Hashing CNN Features for Action Recognition. TIIS 2018, 12, 4412–4428. [Google Scholar] [CrossRef] [Green Version]
- Cherian, A.; Sra, S.; Hartley, R. Sequence Summarization Using Order-constrained Kernelized Feature Subspaces. arXiv 2017, arXiv:1705.08583. [Google Scholar]
- LeCun, Y.; Boser, B.E.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.E.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Wang, J.; Li, Y.; Shan, J.; Bao, J.; Zong, C.; Zhao, L. Large-Scale Text Classification Using Scope-Based Convolutional Neural Network: A Deep Learning Approach. IEEE Access 2019, 7, 171548–171558. [Google Scholar] [CrossRef]
- Srivastava, G.; Kumar, C.V.; Kavitha, V.; Parthiban, N.; Venkataramanparthiban, R. Two-Stage Data Encryption using Chaotic Neural Networks. J. Intell. Fuzzy Syst. 2019. [Google Scholar] [CrossRef]
- Brendel, W.; Todorovic, S. Learning spatiotemporal graphs of human activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2011), Barcelona, Spain, 6–13 November 2011; pp. 778–785. [Google Scholar] [CrossRef] [Green Version]
- Ma, S.; Zhang, J.; Ikizler-Cinbis, N.; Sclaroff, S. Action Recognition and Localization by Hierarchical Space-Time Segments. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), Sydney, Australia, 1–8 December 2013; pp. 2744–2751. [Google Scholar] [CrossRef]
- Weinzaepfel, P.; Harchaoui, Z.; Schmid, C. Learning to Track for Spatio-Temporal Action Localization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; pp. 3164–3172. [Google Scholar] [CrossRef] [Green Version]
- Lan, T.; Chen, L.; Deng, Z.; Zhou, G.; Mori, G. Learning Action Primitives for Multi-level Video Event Understanding. In Proceedings of the Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, 6–7 and 12 September 2014; pp. 95–110. [Google Scholar] [CrossRef]
- Ma, S.; Zhang, J.; Sclaroff, S.; Ikizler-Cinbis, N.; Sigal, L. Space-Time Tree Ensemble for Action Recognition and Localization. Int. J. Comput. Vis. 2018, 126, 314–332. [Google Scholar] [CrossRef]
- Zitnick, C.L.; Dollár, P. Edge Boxes: Locating Object Proposals from Edges. In Proceedings of the Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 391–405. [Google Scholar] [CrossRef]
- Arbeláez, P.A.; Pont-Tuset, J.; Barron, J.T.; Marqués, F.; Malik, J. Multiscale Combinatorial Grouping. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 328–335. [Google Scholar] [CrossRef] [Green Version]
- Erhan, D.; Szegedy, C.; Toshev, A.; Anguelov, D. Scalable Object Detection Using Deep Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 2155–2162. [Google Scholar] [CrossRef] [Green Version]
- Hosang, J.H.; Benenson, R.; Dollár, P.; Schiele, B. What Makes for Effective Detection Proposals? IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 814–830. [Google Scholar] [CrossRef] [Green Version]
- Feng, Y.; Ma, L.; Liu, W.; Luo, J. Spatio-Temporal Video Re-Localization by Warp LSTM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–20 June 2019; pp. 1288–1297. [Google Scholar]
- Feng, Y.; Ma, L.; Liu, W.; Zhang, T.; Luo, J. Video Re-localization. In Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 55–70. [Google Scholar]
- Huang, X.; Zhang, J.; Wu, Q.; Fan, L.; Yuan, C. A Coarse-to-Fine Algorithm for Matching and Registration in 3D Cross-Source Point Clouds. IEEE Trans. Circuits Syst. Video Techn. 2018, 28, 2965–2977. [Google Scholar] [CrossRef] [Green Version]
- Zhao, L.; Al-Dubai, A.; Zomaya, A.Y.; Min, G.; Hawban, A.; Li, J. Routing Schemes in Software-defined Vehicular Networks: Design, Open Issues and Challenges. IEEE Intell. Transp. Syst. Mag (Early Access). 2020. [Google Scholar] [CrossRef] [Green Version]
- Hawbani, A.; Torbosh, E.; Wang, X.; Sincak, P.; Zhao, L.; Al-Dubai, A. Fuzzy based Distributed Protocol for Vehicle to Vehicle Communication. IEEE Trans. Fuzzy Syst (Early Access). 2020. [Google Scholar] [CrossRef] [Green Version]
- Yeom, S. Multi-Level Segmentation of Infrared Images with Region of Interest Extraction. Int. J. Fuzzy Log. Intell. Syst. 2016, 16, 246–253. [Google Scholar] [CrossRef] [Green Version]
- Huang, X.; Yuan, C.; Zhang, J. Graph Cuts Stereo Matching Based on Patch-Match and Ground Control Points Constraint. In Proceedings of the Pacific-Rim Conference on Multimedia (PCM 2015), Gwangju, South Korea, 16–18 September 2015; pp. 14–23. [Google Scholar] [CrossRef]
- Huang, X.; Zhang, J.; Fan, L.; Wu, Q.; Yuan, C. A Systematic Approach for Cross-Source Point Cloud Registration by Preserving Macro and Micro Structures. IEEE Trans. Image Process. 2017, 26, 3261–3276. [Google Scholar] [CrossRef]
- Cai, X.; Shang, J.; Jin, Z.; Liu, F.; Qiang, B.; Xie, W.; Zhao, L. DBGE: Employee Turnover Prediction based on Dynamic Bipartite Graph Embedding. IEEE Access 2020. [Google Scholar] [CrossRef]
- Srivastava, G.; Citulsky, E.; Tilbury, K. The Effects of Ant Colony Optimization on the Anonymization of Graphs. J. Comput. (JoC) 2016, 5, 92–101. [Google Scholar]
- Srivastava, G.; Shumay, M.; Citulsky, E. Social Network Anonymity using Ant Colony Systems. In Proceedings of the International Conference on Computer Games, Multimedia & Allied Technology (CGAT), Singapore, 10–11 April 2017; pp. 64–73. [Google Scholar]
Color Space | mAP(%) |
---|---|
RGB | 53.7 |
nRGB | 54.4 |
Lab | 53.5 |
rgI | 52.1 |
HSV | 47.6 |
With or Without Size Factor | Weight of Color Space | Weight of Saliency | Weight of Size | mAP(%) |
---|---|---|---|---|
with size factor | 0.33 | 0.33 | 0.33 | 54.4 |
without size factor | 0.5 | 0.5 | 0 | 49.1 |
Feature | Pr | Rc | mAP |
---|---|---|---|
HOG | 43.5 | 38.2 | 59 |
HOF | 45.5 | 40.8 | 59.4 |
MBHx | 45.2 | 39.7 | 61.2 |
MBHy | 45.3 | 41.5 | 66.6 |
MBH | 49.0 | 42.5 | 67.4 |
Combined | 52.0 | 46.0 | 69.8 |
Method | mAP(%) |
---|---|
Holistic + Pose [1] | 57.9 |
Holistic Dense Trajectories [1] | 59.2 |
Hierarchical Mid-Level Actions [11] | 66.8 |
Interaction Part Mining (Max Pooling) [10] | 69.1 |
Interaction Part Mining (Max-N Pooling) [10] | 72.4 |
Semantic Features [9] | 70.5 |
IDT-FV [31] | 67.6 |
P-CNN [31] | 62.3 |
Binary Hashing Convolutional Neural Network (CNN) Features [32] | 63.8 |
Order-constrained Kernelized Feature [33] | 53.0 |
Our Model | 69.8 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, F.; Zhao, L.; Cheng, X.; Dai, Q.; Shi, X.; Qiao, J. Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches. Appl. Sci. 2020, 10, 2811. https://doi.org/10.3390/app10082811
Liu F, Zhao L, Cheng X, Dai Q, Shi X, Qiao J. Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches. Applied Sciences. 2020; 10(8):2811. https://doi.org/10.3390/app10082811
Chicago/Turabian StyleLiu, Fang, Liang Zhao, Xiaochun Cheng, Qin Dai, Xiangbin Shi, and Jianzhong Qiao. 2020. "Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches" Applied Sciences 10, no. 8: 2811. https://doi.org/10.3390/app10082811
APA StyleLiu, F., Zhao, L., Cheng, X., Dai, Q., Shi, X., & Qiao, J. (2020). Fine-Grained Action Recognition by Motion Saliency and Mid-Level Patches. Applied Sciences, 10(8), 2811. https://doi.org/10.3390/app10082811