Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention

: Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.


Introduction
One of the main objectives of artificial intelligence is to build a model that can accurately learn human actions and intentions [1]. Human action recognition is important because it has been applied to various applications, such as surveillance systems, health care systems, and social robots. Recently, a three-dimensional (3D) convolutional neural network (CNN) for action recognition with spatiotemporal convolutional kernels achieved better performance than 2D CNNs that can only cover the spatial kernel. The representative research in video-based action recognition is based on twostream architectures [2], recurrent neural networks (RNN) [3], or spatiotemporal convolutions [4,5]. Two-stream approaches use two separate CNNs, one using red-green-blue (RGB) data, and the other using optical flow images to deal with movement.
Recently, most of the research has relied on the modeling of motion and temporal structures. Temporal segment network (TSN) [6] uses a sparse segment to model long-range temporal structure. Other 3D CNN methods [4,7,8] tried to solve temporal modeling issues by using an additional dimension of convolution on the temporal axis with the ambition that the models learn hierarchical motion patterns as in the image space. In [9,10], the temporal modeling is further enhanced through tracking feature points and body joints over time. The above-mentioned methods demonstrated the advantages of the 3D CNN over the 2D CNN because of the ability to learn spatiotemporal characteristics. However, all previous methods have the limitation of analyzing action changes in consecutive frames because of limited coverage of convolutional filter. Since action recognition is a fine-grained recognition problem, identifying and differentiating small changes in consecutive frames and connecting all frames logically is important. Another milestone of action recognition is that applying 3D CNN on action recognition causes overfitting problems because of a huge number of parameters. To alleviate the overfitting problem, Carreira and Zisserman launched the Kinetics dataset [11], which is large enough to train a 3D CNN successfully while coping with the challenge of overfitting. Moreover, Hara et al. [12] conducted experiments to train residual architectures on four different action datasets and achieved an outstanding result. However, the main idea of [12] was to check whether the dataset is efficient to tackle a huge number of parameters of 3D CNN in action recognition or not. So, we believe that using more complex sequential-based architectures pretrained with large datasets could achieve better results.
Drawing inspiration from [12], we propose a sequential version of squeeze-and-excitation (SE) and self-attention (SA) modules, prove the aggregation of both modules, and propose a new method for recognizing video action behavior. We assume that the connection of spatial information with a temporal stream logically provides a better understanding of action behavior. Figure 1a demonstrates ResNext-101 and ResNet-18. Due to the simplicity of applying additional modules, we choose the residual architecture mentioned above. It consists of four blocks and each block consists of convolution filter, batch normalization, Rectified Linear Unit (ReLU) activation, and max pooling. Figure 1b is residual architecture with our proposed module, i.e., SE and SA. We located SE and SA modules after the last layer of each block. The SE module for channels is first; after that, the SE module is used for the sequence (number of frames), and then lastly, the SA module is applied. Implementing exactly in that order provides us the best performance. The reason is that first the SE module, explicitly capturing the correlation between channels of convolutional features, learns to selectively differentiate information features and removes less useful features. After that, the SA module computes the response as a weighted sum of characteristics at positions that contain more useful features (since the SE removes less useful features). In this way, using the SE module improves the channel and sequence interdependencies, and the SA module learns each part of the image, which provides useful information to identify and differentiate small changes in consecutive frames. Note that the detailed architecture of the proposed network is given in Appendix A like [13]. To our best knowledge, it is the first time that sequential-based action recognition has used an adaptive feature aggregation scheme with large pretrained weights. In summary, our contributions are as follows:


We propose a sequential version of SE and SA modules and apply them to create a new approach for efficiently analyzing action behavior on 3D CNN.  We validate our proposed modules in both quantitative and qualitative ways and attain s tate-of-the-art results with a marginal computation. Figure 1. The overall framework of our network. Squeeze-and-excitation and self-attention follow after each layer. The number of blocks is four.

Squeeze-and-Excitation
SE is a module that is designed to improve the network's representation ability by allowing dynamic channel-specific recalibration. VGG-Nets from the Visual Geometry Group at the University of Oxford [14] and Inception [15] showed that increasing the depth of a network can lead to a considerable increase in the quality of representation. The same idea can be applied to the SE module to enhance the representation ability of feature analysis. Hu et al. [16] came up with a mechanism for explicitly modeling dynamic, non-linear dependencies between channels using global information that can facilitate the learning process and enhance the representation ability of networks. Based on the previous success history of SE experiences, we show how we can make a sequential version of the SE module and how the SE module can be useful for video action recognition.

Self-Attention
SA is a commonly used module in computer vision and other fields of artificial intelligence. For example, the SA module has been widely used in many tasks, such as machine translation, skeleton hand-gesture recognition, and task-independent sentence representation [17][18][19]. Vaswani et al. [17] implemented a self-attention module in machine translation to analyze semantic and temporal relationships among words within a sentence. Chen et al. [18] demonstrated the self-attention mechanism represented by graphs to learn the spatiotemporal information contained in hand skeletons. Wang et al. [8] formalized self-attention as a non-local process to define the spatiotemporal dependencies in a video sequence. Despite such huge progress, the self-attention module has not been implemented for video-based action recognition yet. Therefore, in the next section, we propose basic SE and SA modules and how to modify them for sequential information.
In this section, we will discuss the basic knowledge about SE and SA modules and the method of applying these modules to a 3D CNN. Moreover, we will analyze the importance of active movement in one way of applying both short-term and long-term temporal information.

Proposed SE Module
SE is a module for enhancing channels' interdependencies with a low computation cost. Using an SE module can improve feature representation by explicitly developing channel interconnections so that the network can increase its sensitivity to information. More specifically, we give access to global data, and reset filter responses in two steps, i.e., squeeze and excitation, before transforming them into the next transformation.
We first consider the signal to each channel of output features to address the problem of the exploitation of channel dependencies. Since each convolutional filter operates only within a local field and is not able to provide information outside of this region, a squeeze global spatial operation is implemented to solve this problem. We use global average pooling to squeeze each channel into a single numeric value. The formula of the squeeze operation is: where the transformation output, , is a collection of local descriptors that are expressive for the entire video. H, W, and S represent height, weight, and sequence, respectively. Bias terms are omitted to simplify the notation. To use the information in the squeeze operation, we implement a second excitation operation. The idea of the excitation method completely captures channel-wise dependencies provided by the squeeze operation: where δ represents the ReLU activation function, Fex is the excitation operation, ∈ × , and ∈ × .
In contrast to the original SE for a 2D CNN [16], we consider not only the channel but also the number of frames ( Figure 2). We consider the concatenation of channels and sequences after each layer in our network, which makes our module more effective and allows us to achieve better efficiency. Figure 3 (left) demonstrates the SE module for image-based action recognition, and Figure  3 (middle) represents the SE (channel) and SE (sequence) parts of the proposed method for videobased action recognition, which consider both channel and sequence of frames. Figure 3 (right) is an abstract block diagram of Figure 3 (middle) which consists of two parts. Each part consists of global pooling, two fully connected (FC) layers, ReLU, and sigmoid. In the first part, we applied the SE block for channels. When we implement the SE operation, we set all the values (weight, height, and sequence) to 1 except for the channel value. So, SE operates on the channels (in other words, we need only channels). The output of each SE block for channels (X ) is obtained using rescaling with the activation s: where ( , ) is a channel-wise multiplication of scalar and the feature map ∈ ℝ × .
In the second part, to conduct an experiment on the sequence, we swap channel and sequence places using the transpose function. Then the same process is applied as in the first part, but this time for the sequence. The output of the SE block for the sequence is obtained by: where ( , ) is sequence-wise multiplication of scalar and the feature map ∈ ℝ × .  . Squeeze-and-excitation (SE) module. We divide the SE module into two parts: channels and sequences. S represents sequence, and C represents channel. Reduction values: = 16, = 2.
In the end, we again apply the transpose function to make the order of weight, height, sequence, and channels the same as at the beginning. The reason for using the transpose function is to continue further operation; the output and input of the SE block should be the same. Then, we add the output of the transpose function with initial X and provide X, which is the final output. Reduction numbers (d and r) are hyper-parameters that allow us to vary the capacity and computational cost of the SE blocks in the network. To provide a good balance between performance and computational cost, we conduct experiments with the SE block for different r values. Setting r = 16 (for channel) and d = 2 (for sequence) gave us the best trade-off between performance and complexity. Note that by applying squeeze-and-excitation for both channel and sequence, we can consider both long and short-term action correlations adaptively.

Self-Attention (SA) Module
Self-attention is a module that calculates the response as a weighted sum of the features at all positions. The main idea of self-attention is to help convolutions throughout the image domain to capture long-range, full-level interconnections. The network implemented with a self-attention module can help to determine images with small details that are connected with fine details in different areas of the image at each position [20][21][22].
Our task in this experiment is to extend the SA idea to a 3D CNN; more concretely, we implemented the self-attention idea for multiple frames (a sequence). Unlike [23], where the SA module is implemented on a single image, we examine a multi-frame module for video frames. Since we apply the SA block to every single image in a sequence (16 or 64 frames), the benefit we can get from the SA module is more valuable compared to the single-frame case, and the overall performance is higher.
Most videos in UCF-101 and HMDB51 involve human and item interactions. While the previous methods only focused on a single action [23], in not all cases can a sequence exactly present the interaction. We implemented self-attention for sequential actions to better understand the interactions of humans and items, since the environment around humans is also an important part of defining actions. Since we consider a 3D CNN with multiple frames, the SA module helps to capture all images to make a better prediction based on the action aspects in images, and helps us use this knowledge to connect all frames logically. Figure 4 demonstrates our self-attention module for a 3D CNN. Compared to the basic self-attention module, we concatenate these operations sequentially and embed to our SA module for the 3D CNN, as shown in Figure 1.

Dataset and Training Configuration
We adopted UCF-101 [24] and HMDB51 [25] datasets for evaluating the performance of our model. The UCF-101 dataset contains 13,320 images of action from 101 classes of human actions. Both datasets include three training/testing splits (70% and 30%, respectively). UCF-101 consists of unconstrained videos downloaded from YouTube with challenges such as poor lighting, cluttered backgrounds, and severe camera movement. To remove non-action frames, the videos were temporarily cut. The average duration of each video is about seven seconds. The HMDB-51 dataset contains 6766 videos from 51 human action classes. Similar to UCF-101, the videos were cut to an average length of three seconds. The training/testing split was the same as UCF-101. The main difference between the two datasets is the number of classes and instances of actions, dynamic backgrounds, and camera movements.
Details of data preprocessing are as follows. We conducted experiments with the same data augmentation for 16 frames (16f-clip) and 64 frames (64f-clip). In training, we randomly sampled a 112 × 112 crop from a random clip of the given length and applied random horizontal flipping, which includes reversing the horizontal axis in the case of flow input. Since ResNext-101 achieved state-of-the-art performance on both UCF-101 and HMDB51 datasets, we chose it as our main architecture for implementation. We used stochastic gradient descent with a momentum of 0.9 to train a mini-batch dataset. The initial learning rate for the 16f-clip was 0.1 and for the 64f-clip, it was 0.01. Finally, we used the top-1 mean accuracy for evaluating the action recognition dataset.
To assess the suggested techniques, we conducted experiments on the two distinct architectures: ResNext-101 and ResNet-18. Both architectures have four layers. Compared to ResNet-18, ResNext-101 (using the setting of 32 × 4 ) adds an additional (cardinality) block to the base network, which makes it more powerful at extracting convolution feature maps [12]. We expanded the SE module of image-based models into video-based models to consider the number of frames. After the first, second, third, and fourth blocks, the number of frames in a sequence was 8, 4, 2, and 1, respectively (see Figure 1). Similar to the SE approach, we expanded the SA module of the 2D CNN to a 3D CNN to consider the number of frames in Figure 1.

Experimental Results
In this section, we provide experimental results and a comparison of our results with earlier state-of-the-art methods of the 3D CNN. Note that before training, we fine-tuned the networks using the Kinetic400 dataset. In Tables 1 and 2, we see higher performance than in previous work in which SE and SA modules were not applied. The only difference between Tables 1 and 2 is the number of video frames used for training. As SE and SA modules were added, the performance improvements were 0.5% and 0.4%, respectively, with UCF-101 in Table 1. In addition, we can see the same performance increment with HMDB51 in Table 1. However, most importantly, the aggregation of SE and SA achieved an additional 0.8% performance improvement. We can also say that the performance tendency is consistent in 64-frame clips. We also performed experiments on ResNet-18 to prove model generalization. Results with the ResNet-18 architecture are in Table 3. Note that the synergy effect of using both SE and SA is much higher than that in previous experiments (2.8% on UCF-101 and 2.6% on HMDB51). We can conclude that for the shallower model, our approach shows better improvement compared to the more complex ones. All the obtained results indicate that both squeeze-and-excitation and self-attention were successfully implemented and worked well enough in the 3D CNN. These observations are compatible with our assumption that the layers of self-attention are useful in capturing structural data and long-distance dependence.  The black values in precision represent how many times each label was predicted during the testing period. In other words, precision demonstrates qualitative results, while recall demonstrates quantitative result. The overall accuracies are 95.06% and 74.06% for the UCF101 and HMDB51 datasets, respectively. The corresponding balanced accuracies are 95.03% and 74.05%. According to the results of Figures 5 and 6, the model has almost the discriminative ability to all classes (low variance), which means that our methods are effective at identifying action changes for all actions. In addition, we provide the normalized confusion matrices for all classes on the UCF101 and HMDB51 datasets. From Figure 7 on UCF101, we can see that our model performs very well on most categories. Categories that contain only objects without interaction with items such as 'Body Weight Squats', 'Handstand Pushups', and 'Jumping Jacks' give the best accuracies. However, our model misclassifies some samples from 'Walking', 'Diving', 'Golf Swing', and 'Soccer Penalty' because of similar action characteristics between actions. As a result, we found that the background context information is very important when analyzing action behaviors.     Figure 7. Still, we can see the misclassifications for several pairs of labels. This is due to the difficulty of identifying complex action behaviors such as for instance basketball, catching, jumping, and throwing. It is usually difficult to understand such actions because they are composed of many different sub-actions. The comparison of our method with other state-of-the-art methods is given in Table 4. The accuracies for ResNext-101 (fourth and fifth rows) are our own training results obtained by using this paper's configuration with their default parameter settings. Except for [4,6], all methods were finetuned 3D CNN models on the Kinetics dataset. The results of this experiment are essential in determining whether the squeeze-and-excitation and self-attention methods are useful for 3D CNN action recognition or not. Our method achieved state-of-the-art performance confined to using only RGB sequence frames. For example, in a 64-frame clip, the performance of prior ResNext on UCF-101 was 95.2%, compared to the performance of ResNext with SE and SA at 95.6%. For ResNet-101, we can see a performance improvement of 2.6%. Table 4. Top-1 accuracies on UCF101 and HMDB51 compared with the state-of-the-art methods.

Discussion of SE and SA Modules
From the results we acquired, we can conclude that both SE and SA work for 3D CNN action recognition well enough; however, in all cases, the SE module performs better than the SA module except for the 64f-clip in the ResNext-101 architecture. The reason that the overall SE module works better than the SA module is that we consider both channel and sequence for SE, but only the channel for SA, and because the sequence and channel concatenation did not yield the meaningful results we expected.

Integration Strategy of the SE Module
The objective of Table 5 is to conduct an ablation study of analyzing the influence of the SE module one stage at a time. More specifically, we added the SE module after each specific block and checked which layer has more impact in terms of analyzing action behavior, i.e., block1, block2, block3, and after each layer. As expected, the performance was higher with each subsequent layer. The reason is that as the number of layers increases, the amount of information (for both channels and sequences) increases as well. We noted that SE blocks produce performance advantages when they are implemented in each of these architecture phases. The gains produced by SE blocks at various stages are complementary in the sense that they can be effectively combined to further improve the network performance. The results are given in Table 5.

Computation Inference Time of SE and SA Modules
The reason for showing Table 6 is to check which module (SE or SA) was heavier to train and provides a good balance between inference time and network complexity. As you can see from Table  6, SA makes our network inference time longer (heavier) compared to the SE module. That is why we removed the SA module after the first and second blocks and only trained after the third and fourth block, since the SA module provides more efficiency after the third and fourth layers compared to the first and second. The result shows (ResNext-101 + SE + SA **) that our inference time decreased pretty much; however, the performance did not decrease too much. So, it means that the SA after third and fourth blocks can provide almost the same performance with much less inference time. The purpose of this experiment is that other approaches in future work can apply the SA module after only the third and fourth layers with about the same performance to save time. Table 6. Top-1 accuracy and complexity time for 16f-clips on HMDB51. All the streams are fine-tuned from Kinetics400 pretrained models. * means SA implemented after each layer, ** means SA implemented only after the third and fourth layers.

Qualitative Results.
The attention map is a method to analyze the implicit attention of a CNN. We implemented an activation map [27] to make our findings more comprehensive. The class activation map indicates the discriminative image regions used by the CNN to identify an action. For example, we demonstrate some results for public datasets (see Figure 9) and the other results for randomly chosen images from web crawling (see Figure 10). From the results, we can see that the main focus is on action, and there is comparably less attention to other aspects of the image. Since the idea with SA is to learn the entire picture, the network starts learning every single detail around the action, and the final decision on action recognition is made based on the aggregation of action and the environment. From qualitative results, we can conclude that the network's ability to identify the action region increases. All of these prove that our trained model can focus on important and meaningful actions.
(a) (b) Figure 9. Class activation maps on red-green-blue (RGB) frames. After implementing SE and selfattention (SA), we can see that our activation map's active region increased a bit and also covered non-active areas: (a) results before and (b) after the implementation of our approach. Figure 10. Qualitative analysis for real images. These samples do not belong to our dataset, and they were randomly chosen by web crawling. It can be seen that areas such as the human arm have a significant influence on behavior judgment.

Conclusions
In this paper, we proposed a novel approach to video action recognition, which uses the aggregation of squeeze-and-excitation and self-attention modules. Using these two modules together, we showed that dynamic changes across frames can be captured more accurately and efficiently with almost no additional computation cost. We also presented qualitative results from experiments to make a balanced decision based on both types of data. Extensive experiments demonstrated the effectiveness of our approach, which achieved state-of-the-art performance across different datasets and architectures.
Author Contributions: The work described in this article is the collaborative development of all authors. All authors contributed to the idea of data processing and designed the algorithm. F.A. and D.H.K. made contributions to data measurement and analysis and B.C.S. led us to research direction. All authors participated in the writing of the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by an Inha University research grant.

Conflicts of Interest:
The authors declare they have no conflicts of interest. Table A1 provides the layer-by-layer description of our network, which corresponds to Figure  1b. This table includes every detail of the proposed network.