Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos
Abstract
:1. Introduction
- We propose a novel hierarchical excitation aggregation and disentanglement network (Hi-EADN) with a multiple frame excitation aggregation (MFEA) module to better model long-short range motion and calculate the feature-level spatial-temporal difference.
- The squeeze-and-excitation hierarchical disentanglement (SEHD) module excites motion-sensitive channels by a series of squeeze-and-excitation operations and enhances the interactivity of different motion frames. Moreover, the spatial-temporal information with neighbourhoods is aggregated and efficiently enlarges the modelling of short-long ranges.
- The two components of Hi-EADN are complementary and cooperate to realize multiple information exchanges with adjacent frames and increase the interaction of neighbours. We proposed methods that outperform other state-of-the-art methods on the UCF101 and HMDB51 benchmark datasets.
2. Related Work
3. Our Proposed Method
3.1. Multiple Frames Excitation Aggregation (MFEA) Module
3.2. Squeeze-and-Excitation Hierarchical Disentanglement (SEHD) Module
Sharing the Squeeze-and-Excitation Layer
3.3. Hierarchical Integration Strategy
3.4. Reconstruction of the Loss Function
Algorithm 1: The recognition process of the proposed Hi-EAD framework for action based on videos |
|
4. Experimental Results and Related Discussion
4.1. Dataset Preparation
4.2. Training Parameters and Environments Configuration
4.3. Evaluation Metrics and Baseline Methods
- GD-RN [10]. The frameworks integrated Gaussian mixture and dilated convolution residual network (GD-RN), and using Resnet-101 achieved the multi-scale feature, then conducted the recognition of action based on videos.
- T3D-ConvNet [11]. An end-to-end Two-stream convNet fusion networks. The frameworks can recognize actions based on videos and capture multiple features of action.
- SENet [9]. The methods utilize fully connected layers and SE blocks to produce modulation weights from original features and apply the obtained weights to rescale the features to improve the recognition accuracy;
- TS-ConvNet [12]. The method is a two-stream ConvNet architecture that incorporates spatial and temporal networks. This demonstrates that a ConvNet trained on multi-frame dense optical flow can achieve better performance with little training data.
- CNN+LSTM [13]. The approaches employ a recurrent neural network that uses long short-term memory (LSTM) cells connected to the underlying CNN’s output and then achieve recognition of human action in videos.
- BCA [14]. Concatenates the output of multiple attention units applied in parallel via attention clusters (BACs) and introduces a shifting operation to capture more diverse signals.
- ISPA [15]. The methods are referred to as interaction-aware spatiotemporal pyramid attention networks. It can learn the correlation of local features at neighbouring spatial positions.
- ST Multiplier [16]. This method uses a general ConvNet architecture for video action recognition based on the multiplicative interaction of spacetime features. The appearance and motion pathways of a two-stream architecture are combined by motion gating and trained end-to-end.
4.4. Experimental Results and Analysis
4.4.1. Comparison with Other State-of-the-Art Methods
4.4.2. Discussion of the Component
4.4.3. Analysis Interaction of Hierarchical and Cross-Modal Information
4.5. Ablation Study and Discussion
4.5.1. The number of Training Samples
4.5.2. The Number Size of SE and SA Blocks
4.5.3. Visual Demonstration of Hi-EAD
4.6. Effectiveness of the Loss Function
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
Hi-EAD | Hierarchical Excitation Aggregation and Disentanglement Networks |
MFEA | Multiple Frames Excitation Aggregation |
SEHD | Squeeze-and-Excitation Hierarchical Disentanglement |
GAP | Global Average Pooling |
SA | multi-head self-guided Attention |
References
- Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 591–600. [Google Scholar]
- Saponara, S.; Greco, M.S.; Gini, F. Radar-on-chip/in-package in autonomous driving vehicles and intelligent transport systems: Opportunities and challenges. IEEE Signal Process. Mag. 2019, 36, 71–84. [Google Scholar] [CrossRef]
- An, F.P. Human action recognition algorithm based on adaptive initialization of deep learning model parameters and support vector machine. IEEE Access 2018, 6, 59405–59421. [Google Scholar] [CrossRef]
- Yang, H.; Yuan, C.; Li, B.; Du, Y.; Xing, J.; Hu, W.; Maybank, S.J. Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit. 2019, 85, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Chen, X.; Weng, J.; Lu, W.; Xu, J.; Weng, J. Deep manifold learning combined with convolutional neural networks for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3938–3952. [Google Scholar] [CrossRef]
- Jing, C.; Wei, P.; Sun, H.; Zheng, N. Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput. Appl. 2020, 32, 4293–4302. [Google Scholar] [CrossRef]
- Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 2020, 22, 2990–3001. [Google Scholar] [CrossRef]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010. [Google Scholar]
- Peng, Y.; Shu, T.; Lu, H. Weak integration of form and motion in two-stream CNNs for action recognition. J. Vis. 2020, 20, 615. [Google Scholar] [CrossRef]
- Lin, Y.; Chi, W.; Sun, W.; Liu, S.; Fan, D. Human Action Recognition Algorithm Based on Improved ResNet and Skeletal Keypoints in Single Image. Math. Probl. Eng. 2020, 2020, 6954174. [Google Scholar] [CrossRef]
- Bose, S.R.; Kumar, V.S. An Efficient Inception V2 based Deep Convolutional Neural Network for Real-Time Hand Action Recognition. IET Image Process. 2019, 14, 688–696. [Google Scholar] [CrossRef]
- Li, W.; Feng, C.; Xiao, B.; Chen, Y. Binary Hashing CNN Features for Action Recognition. TIIS 2018, 12, 4412–4428. [Google Scholar]
- Rahman, S.; See, J.; Ho, C.C. Deep CNN object features for improved action recognition in low quality videos. Adv. Sci. Lett. 2017, 23, 11360–11364. [Google Scholar] [CrossRef]
- Cherian, A.; Gould, S. Second-order Temporal Pooling for Action Recognition. Int. J. Comput. Vis. 2019, 127, 340–362. [Google Scholar] [CrossRef] [Green Version]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Seemanthini, K.; Manjunath, S.S. Human Detection and Tracking using HOG for Action Recognition. Procedia Comput. Sci. 2018, 132, 1317–1326. [Google Scholar]
- Chen, W.Q.; Xiao, G.Q.; Tang, X.Q. An Action Recognition Model Based on the Bayesian Networks. Appl. Mech. Mater. 2014, 513, 886–1889. [Google Scholar] [CrossRef]
- Tran, D.T.; Yamazoe, H.; Lee, J.H. Multi-scale affined-HOF and dimension selection for view-unconstrained action recognition. Appl. Intell. 2020, 50, 1–19. [Google Scholar] [CrossRef]
- Wang, L.; Koniusz, P.; Huynh, D.Q. Hallucinating Bag-of-Words and Fisher Vector IDT terms for CNN-based Action Recognition. arXiv 2019, arXiv:1906.05910. [Google Scholar]
- Wang, L.; Zhi-Pan, W.U. A Comparative Review of Recent Kinect-based Action Recognition Algorithms. arXiv 2019, arXiv:1906.09955. [Google Scholar] [CrossRef] [Green Version]
- Jagadeesh, B.; Patil, C.M. Video based action detection and recognition human using optical flow and SVM classifier. In Proceedings of the 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT), Bangalore, India, 20–21 May 2016; IEEE: New York, NY, USA, 2016. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. Available online: https://dl.acm.org/doi/10.1109/CVPR.2014.223 (accessed on 10 March 2021).
- Patil, G.G.; Banyal, R.K. Techniques of Deep Learning for Image Recognition. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Pune, India, 29–31 March 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
- Kang, B.R.; Lee, H.; Park, K.; Ryu, H.; Kim, H.Y. BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks. Pattern Recognit. Lett. 2020, 131, 449–455. [Google Scholar] [CrossRef] [Green Version]
- Sungheetha, A.; Rajesh, S.R. Comparative Study: Statistical Approach and Deep Learning Method for Automatic Segmentation Methods for Lung CT Image Segmentation. J. Innov. Image Process. 2020, 2, 187–193. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term Recurrent Convolutional Networks for Visual Recognition and, Description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal Residual Networks for Video Action Recognition. 2017. Available online: https://papers.nips.cc/paper/2016/file/3e7e0224018ab3cf51abb96464d518cd-Paper.pdf (accessed on 10 March 2021).
- Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based Action Recognition with Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017. [Google Scholar]
- Liao, X.; He, L.; Yang, Z.; Zhang, C. Video-based Person Re-identification via 3D Convolutional Networks and Non-local Attention. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
- Kalfaoglu, M.E.; Kalkan, S.; Alatan, A. Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Anvarov, F.; Kim, D.H.; Song, B.C. Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention. Electronics 2020, 9, 147. [Google Scholar] [CrossRef] [Green Version]
- Jalal, M.A.; Aftab, W.; Moore, R.K.; Mihaylova, L. Dual stream spatio-temporal motion fusion with self-attention for action recognition. In Proceedings of the 22nd International Conference on Information Fusion, Ottawa, ON, Canada, 2–5 July 2019. [Google Scholar]
- Purwanto, D.; Pramono, R.R.; Chen, Y.T.; Fang, W.H. Three-Stream Network with Bidirectional Self-Attention for Action Recognition in Extreme Low-Resolution Videos. IEEE Signal Process. Lett. 2019, 26, 1187–1191. [Google Scholar] [CrossRef]
- Yu, T.; Guo, C.; Wang, L.; Gu, H.; Xiang, S.; Pan, C. Joint Spatial-Temporal Attention for Action Recognition. Pattern Recognit. Lett. 2018, 112, 226–233. [Google Scholar] [CrossRef]
Model-Data | UCF-101 | HMDB-51 | ||
---|---|---|---|---|
Top-1(%) | Top-5(%) | Top-1(%) | Top-5(%) | |
GDRN | 68.49 | 91.45 | 49.8 | 72.95 |
T3D-ConvNet | 67.92 | 88.51 | 47.16 | 71.33 |
SENet | 66.63 | 87.12 | 42.73 | 68.83 |
TS-ConvNet | 59.91 | 80.88 | 41.98 | 67.11 |
CNN+LSTM | 47.21 | 71.89 | 33.78 | 58.41 |
BCA | 59.01 | 80.46 | 39.01 | 63.34 |
ISPA | 54.85 | 75.83 | 35.83 | 59.42 |
ST Multiplier | 65.37 | 84.5 | 39.64 | 65.72 |
Hi-EAD (T = 8) | 71.97 | 93.5 | 58.78 | 76.96 |
Model-Data | UCF-101 | HMDB-51 | ||
---|---|---|---|---|
Top-1(%) | Top-5(%) | Top-1(%) | Top-5(%) | |
MFEA | 57.37 | 76.9 | 40.16 | 58.08 |
SEHD | 58.91 | 77.27 | 40.38 | 59.53 |
DenseNet(201) | 60.06 | 78.57 | 42.91 | 61.57 |
DenseNet(121) | 62.47 | 81.26 | 43.44 | 63.27 |
DenseNet(161) | 64.75 | 83.39 | 43.65 | 66.19 |
Hi-EAD (No-MFEA) | 68.75 | 84.64 | 44.55 | 66.49 |
Hi-EAD (No-SEHD) | 68.97 | 84.93 | 48.98 | 66.5 |
Hi-EAD (No-SE) | 69.97 | 85.54 | 50.75 | 72.01 |
Hi-EAD (No-SA) | 70.28 | 86.21 | 55.81 | 73.53 |
Hi-EAD (T = 16) | 70.65 | 87.44 | 56.16 | 74.01 |
Hi-EAD (T = 4) | 71.11 | 90.70 | 56.42 | 75.33 |
Hi-EAD (T = 8) | 71.97 | 93.5 | 58.78 | 76.96 |
Model-Data | UCF-101 | HMDB-51 | ||
---|---|---|---|---|
Top-1(%) | Top-5(%) | Top-1(%) | Top-5(%) | |
Hi-EAD (No-CMI) | 65.2 | 88.36 | 48.23 | 68.89 |
Hi-EAD (No-Hi) | 66.6 | 89.66 | 49.38 | 70.63 |
Hi-EAD(No-HR) | 67.18 | 90.96 | 51.15 | 72.39 |
Hi-EAD (MLP+FC) | 69.64 | 91.14 | 53.13 | 73.66 |
Hi-EAD (Conv1+GAP+FC) | 70.7 | 92.73 | 55.75 | 74.26 |
Hi-EAD (T = 8) | 71.97 | 93.5 | 58.78 | 76.96 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, Z.; Lee, E.-J. Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos. Symmetry 2021, 13, 662. https://doi.org/10.3390/sym13040662
Hu Z, Lee E-J. Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos. Symmetry. 2021; 13(4):662. https://doi.org/10.3390/sym13040662
Chicago/Turabian StyleHu, Zeyuan, and Eung-Joo Lee. 2021. "Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos" Symmetry 13, no. 4: 662. https://doi.org/10.3390/sym13040662
APA StyleHu, Z., & Lee, E.-J. (2021). Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos. Symmetry, 13(4), 662. https://doi.org/10.3390/sym13040662