Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited