You are currently viewing a new version of our website. To view the old version click .
Symmetry
  • Article
  • Open Access

12 April 2021

Hi-EADN: Hierarchical Excitation Aggregation and Disentanglement Frameworks for Action Recognition Based on Videos

and
Department of Information Communication Engineering, Tongmyong University, Busan 48520, Korea
*
Author to whom correspondence should be addressed.

Abstract

Most existing video action recognition methods mainly rely on high-level semantic information from convolutional neural networks (CNNs) but ignore the discrepancies of different information streams. However, it does not normally consider both long-distance aggregations and short-range motions. Thus, to solve these problems, we propose hierarchical excitation aggregation and disentanglement networks (Hi-EADNs), which include multiple frame excitation aggregation (MFEA) and a feature squeeze-and-excitation hierarchical disentanglement (SEHD) module. MFEA specifically uses long-short range motion modelling and calculates the feature-level temporal difference. The SEHD module utilizes these differences to optimize the weights of each spatiotemporal feature and excite motion-sensitive channels. Moreover, without introducing additional parameters, this feature information is processed with a series of squeezes and excitations, and multiple temporal aggregations with neighbourhoods can enhance the interaction of different motion frames. Extensive experimental results confirm our proposed Hi-EADN method effectiveness on the UCF101 and HMDB51 benchmark datasets, where the top-5 accuracy is 93.5% and 76.96%.

1. Introduction

Action recognition [1] is a fundamental problem of video process tasks and has drawn significant attention in computer vision communities, such as autonomous driving and intelligent surveillance [2]. However, the early-stage approaches train classifiers based on the spatial-temporal nature of images. For instance, hand-made features are input into a support vector machine for classification [3]. In recent years, with the continuous development of deep learning technology, methods have been applied to video action recognition tasks, namely, learning deep features via convolutional neural networks (CNNs) [4,5] and obtaining state-of-the-art performance on HMDB51 and UCF101 action video datasets. The performance of these deep learning methods outperforms traditional methods However, they ignore important detailed information because these CNN-based methods [6,7,8] depend mainly on the discriminativeness of high-level semantic information to aggregate these features on the fully connected layer and finally achieve classification; namely, these features usually focus more on high-level semantic information but less on important detail information [9]. For example, ResNet [10] and Inception [11] are used for the high-level features of final classification, and their feature maps are small in size, which cannot preserve the local details of the action to the maximum.
To address the above problems, the middle layers of convolutional neural networks use action recognition in video. Compared with high-level CNN features, middle layer convolution features usually show more detailed information [12,13,14]. However, these methods still leave some limit, namely, how to model the spatial-temporal structure with significant variations and complexities effectively and enhance the interaction power with neighbourhoods. For these problems, most existing methods adopt two-stream action recognition framework-based CNNs. These are processing images and optical flow by different branch CNNs, reducing the interaction between spatial and temporal features. Moreover, subtle motion and local discriminative information are ignored.
Thus, to address the above problems, we propose hierarchical excitation aggregation and disentanglement networks (Hi-EADNs), which integrate motion frame modelling into spatial-temporal information and calculate between adjacent frames to represent feature-level motion by a multiple frame excitation aggregation (MFEA) module. Then, these difference features are utilized to optimize weights, and the motion-sensitive information in the spatial-temporal features of frames can be processed with a series of squeezes and excitations. Moreover, the SEHD module can enhance the interaction of different motion frames and is forced to discover differentiated detail information and improve the capture ability of the informative spatial-temporal features. However, as the network deepens, repeating the use of a large number of local operations and redundant information leads to optimization difficulty of our proposed frameworks; thus, we squeeze-and-excitation block hierarchical embedding to DenseNet [15]. When the spatial-temporal information goes through the SEHD module, these features complete multiple information exchanges with adjacent frames and further increase the interaction of neighbours and reduce the use of redundant information, ultimately improving recognition accuracy for action in the video.
To summarize, we propose that Hi-EADN methods are complementary and cooperate; namely, they not only maximize the retention of important detailed information but also improve the modelling abilities of the short-long spatial-temporal range. In this paper, the main contributions of our Hi-EADN method are as follows.
  • We propose a novel hierarchical excitation aggregation and disentanglement network (Hi-EADN) with a multiple frame excitation aggregation (MFEA) module to better model long-short range motion and calculate the feature-level spatial-temporal difference.
  • The squeeze-and-excitation hierarchical disentanglement (SEHD) module excites motion-sensitive channels by a series of squeeze-and-excitation operations and enhances the interactivity of different motion frames. Moreover, the spatial-temporal information with neighbourhoods is aggregated and efficiently enlarges the modelling of short-long ranges.
  • The two components of Hi-EADN are complementary and cooperate to realize multiple information exchanges with adjacent frames and increase the interaction of neighbours. We proposed methods that outperform other state-of-the-art methods on the UCF101 and HMDB51 benchmark datasets.
The rest of this paper is organized as follows. Section 1 introduces the related work. Section 2 details overviews of the proposed method. Section 3 reports and analyses the experimental results and finally gives conclusions in Section 4.

3. Our Proposed Method

In this section, our Hi-EADN methods aim to improve interaction with neighbouring motion frames and reduce the use of redundant information while reducing the different modality information (image feature and optical flow), intramodality discrepancies and exciting motion patterns and a multiple spatial-temporal aggregation to build a long short-range spatial-temporal relationship. To complete this goal, we elaborate the basic knowledge for two key components, namely, MFEA and SEHD. The frameworks of our proposed method are illustrated in Figure 1.
Figure 1. The frameworks of our proposed method for action recognition based on videos. MFEA indicates the multiple-frame excitation aggregation module, where E 1 T F indicates the decoded representation of a sequence frame and E 2 i m and E 2 o f indicate the encoding process of image and optical flows. C o n v 1 indicates the convolution layer whose size is 1 1 . SA indicates dual self-guided attention module. SEHD indicates the module of squeeze-and-excitation hierarchical disentanglement. ς indicates the loss function. a o f and a i m indicate each optical and image flow, respectively. c e indicates the loss function of multi-class cross-entropy. ς κ τ indicates sharing information of cross-modality. ς t r i p indicates the loss of reconstruction ς t r i p = ς r c e .

3.1. Multiple Frames Excitation Aggregation (MFEA) Module

Most existing methods utilize motion representations for video-based action recognition [37]. Although these works have been achieving better accuracy of action recognition, the majority of works represent motion actions in the form of optical flow and mutually independent learning of motion representations and spatial-temporal information. Moreover, these methods typically adopt the local convolution kernel to process neighbouring motion frames, and long-distance spatial-temporal modelling can be performed in deep neural networks. However, the long-distance detail information is greatly weakened, and the utility of the redundant information is repeated as the network depths. In contrast, in our present component of multiple frame excitation aggregation (MFEA), the spatial-temporal information and corresponding local convolution layers are aggregated into a group of subspaces, namely, the subspace is integrated into dense connectivity blocks [15] (the component is inspired by DenseNet), which could accordingly reduce the utility of redundant information and improve the long-range modelling ability.
Formally, the input feature can be represented as X is [ N , T , C , , H , W ] , where H and W are the spatial shapes, T and N indicate the temporal dimension and batch size, respectively, C , indicates the fragments by splitting the feature channel dimensions, namely, C , = 1 4 C . That is, the component consists of a series of local convolutions and spatial-temporal convolutions. Different from this, we divide the local convolution into multiple subconvolutions, one of which is a fragment, as the input feature. Then, the remaining three fragments are sequentially processed with one spatial subconvolution layer and another channelwise temporal subconvolution layer. Finally, the residual dense connection is added between neighbouring fragments and transforms the component into a hierarchical cascade form by a parallel architecture. The process can be presented as.
X i 0 = X i , i = 1 X i 0 = C o n v s p ( C o n v t e m X i ) , i = 2 X i 0 = C o n v s p ( C o n v t e m H d e ( [ X i , X i 1 0 ] ) ) , i = 3 , 4
where C o n v s p indicates the spatial convolution layers, C o n v t e m indicates the temporal convolution layers, H d e ( [ ] ) indicates densely connection layers. In the component, the different fragments adopt different convolution kernel sizes. Namely, the X 1 initial fragment receptive field is 1 1 , namely, the size is 1 1 of the convolution kernel, and the other fragment receptive fields are 3 3 . Then, aggregating information for each fragment and one of them can be expressed as f 1 . Finally, the aggregated motion information of the t frame sequence is obtained by a simple concatenation strategy.
X 0 = [ M P ( X 1 0 ) ; M P ( X 2 0 ) ; M P ( X 3 0 ) ; M P ( X 4 0 ) ]
where M P ( ) indicates max pooling layer. X 1 0 indicates the input videos (including image, optical and sequence frames), X i 0 , i = 1 , 2 , 3 , 4 indicates different sequence frames.
In other words, the MFEA module we designed can not only effectively capture the interactivity between continuous actions in the video but also avoids the lack of information caused by the information flow in the transmission process, which leads to inaccurate recognition. Moreover, the obtained output feature information involves spatial-temporal representations capturing different temporal ranges. It is superior to the local temporal representations obtained by using a single local convolution neural network approach.

3.2. Squeeze-and-Excitation Hierarchical Disentanglement (SEHD) Module

In the component, we coupled a hierarchical disentanglement module with motion excitation by sharing the squeeze-and-excitation block [17,22] and embedding layer of multi-head attention. This component enables both information encoders to extract the motion detail information between RGB images and dynamic optical flows. Moreover, this feature learning and disentanglement process are conducive to capturing intra- and external co-occurrence information from different modality image data (RGB and optical flows) and further enhances the interaction of neighbourhood motion frames.

Sharing the Squeeze-and-Excitation Layer

In sharing the squeeze-and-excitation (SE) layer, different channels capture different motion information. Some channels (channels) are mainly used to obtain static information related to the motion scene, that is, to model the motion scene, while most of the channels are used to explore the dynamic information between consecutive action frames. These two-channel construction methods are mainly to describe the temporal and spatial differences of the actions in the video. For action recognition based on videos, the SE is beneficial to enable the Hi-EADN frameworks to discover and then enhance neighbourhood motion-sensitive interactions and reduce the utility of redundant information. The squeeze-and-excitation block [27] can improve feature information representation by channel interconnection and increase the sensitivity to information for the recognition frameworks. However, we adopt global average pooling to squeeze each channel into a single numeric value. Then, to further utilize the detailed information in the squeeze operation, we conduct a second excitation operation to completely achieve channel-wise dependencies presented by the squeeze operation and use this squeezed-and-excitation feature information [27,30] as the input of a multi-head self-attention block, and reconstruction different scale information, which the process indicates as.
F s q u e e z e ( ι θ ) = 1 H W C i = 1 H j = 1 W c = 1 C ι θ ( i , j , c ) F e x c i t a t i o n ( f s q u e e z e , W , ) = δ ( ϖ ( W 1 , φ ( W 2 , f s q u e e z e ) ) ) F A t t = M H S G A s ( F s q u e e z e ( ι θ ) ; F e x c i t a t i o n ( f s q u e e z e , W , ) ) , s S = { 1 , 2 , 3 , 4 }
where M H S G A ( ; ) indicates the dual self-guided attention module, where δ ( · ) , φ ( · ) and ϖ ( · ) indicate the ReLU activation function. s indicates the multi-scale features of the SE and the backbone networks of DenseNet. H , W , C indicate the height, width and channels. F e x c i t a t i o n indicates the feature information of the channel excitation layer, F s q u e e z e indicates the feature information of the channel squeeze layer.
The dual self-guided attention module (SA) is a module that calculates the response as a weighted sum of the features at all positions. The main idea of multi-head self-guided attention is to help convolutions capture long-range, full-level interconnections throughout the image domain. The SEHD module implemented with an SA and SE multi-scale hierarchical module can help determine images with small details connected with fine details in different areas of the image and frames at each position. Then, the SEHD module can further improve the interaction of neighbourhood frames or successive action sequences.

3.3. Hierarchical Integration Strategy

As illustrated in the hierarchical representation of Figure 1, our designed hierarchical representation learning (HR) module is coupled with multiple frame excitation aggregation (MFEA) by sharing squeeze-and-excitation hierarchical disentanglement (SEHD). This module enables both encoders to extract the common class attributes between motion frame sequences and successive actions. At the same time, this hierarchical representation process implicitly assists in separating intra- and cross-modality characteristics from successive motion frames and improves the quality of human recognition. The representation processing is defined as follows.
F T o t a l = F C ( F M F E A F S E H D ) F S E H D = C o n v ( μ = 1 S F A t t μ ) , μ S = { 1 , 2 , 3 , 4 }
where C o n v ( · ) indicates that the kernel size is 1 1 of convolution layers. F A t t u indicates the multi-scale feature information of dual self-guided attention layers, u and s indicate the size of the scale. F M F E A indicates the fusion feature information of the MFEA, F S E H D indicates the fusion feature information of the SEHD. ⊕ indicates a function concatenation.
The hierarchical representation module further strengthens the interaction between successive actions, describes actions from different levels, and strengthens the characterization ability of the features. At the same time, different levels of information can form a complementary relationship, highlighting the differences between different actions between classes or within classes.

3.4. Reconstruction of the Loss Function

In the future, we plan to capture the best multi-scale feature information and improve the performance of our proposed Hi-EAD action recognition based on videos. We employ multiple losses in all network layers during training, namely, the reconstruction (the process of multiscale features) loss and multi-classification cross-entropy loss, where the cross-entropy loss ς c e , where ς c e is useful to penalize the difference between the network output labels of the original sample labels. However, for the reconstruction loss function ς r e c , our primary strategy is to integrate a pair of cross-modality information by swapping the human action of two streams with the same id and same frames, where the stream includes image and optical flow. Formally, this cross-modality reconstruction loss function is indicated as follows
ς r e c i m = E x i m , o f i τ i m g , o f ( x i m , o f i ) [ x i m κ ( τ 2 , a 2 s , a 2 i m ) ]
where ς r e c i m indicates the reconstruction loss of image flows. E x i m , o f i τ i m g , o f ( x i m , o f i ) indicates the encoder information of the image and optical flows. x i m indicates the input information of image flows. κ ( τ 2 , a 2 s , a 2 i m ) indicates the process information of the cross-modality. a s and a i m indicate the encoders multiscale features of optical and image flows.
The interactor learns how to conduct interaction and correlation of the same or similar human actions of the same successive frames from the cross-modality reconstruction loss function. To be clear, we only give the loss function for one image modality as ς r e c i m . Thus, the reconstruction loss of optical flow ς r e c o f can be obtained by changing the parameters.
To improve the interaction quality, we proposed multiple additional reconstruction losses to capture the multi-scale information. In addition to the loss function of reconstructing images and optical flow [20,25] of different modalities, we apply multi-head self-guided attention to refine the feature map. This multi-modality reconstruction loss function plays a key role in regularization in the multi-scale refined feature map. Moreover, this loss includes both assumptions that relativity with short and opposite information should be preserved during the cross-modality reconstruction process and that interaction details and dependencies should be maintained during the same modality reconstruction process. Thus, the overall loss function for reconstruction is indicated as
ς r e c = α 1 ς r e c i m + α 2 ς r e c o f + α 3 ς r e c d f + α 4 ς r e c m s
where α j , j { 1 , 2 , 3 , 4 } indicates the importance factor of this reconstruction loss. ς r e c m s indicates the refined reconstruction loss function of multiple scales. ς r e c d f indicates the reconstruction loss of the cross-modality information stream. In addition, the ς r e c d f are formulated as
ς r e c d f = E x i m , o f i τ i m g , o f ( x i m , o f i ) [ a i m s a o f s 1 ] + E x i m , o f i τ i m g , o f ( x i m , o f i ) [ a ˜ i m s a ˜ o f s 1 ]
where a ˜ i m s and a ˜ o f s indicate the reconstruction feature information by cross-modality encoders of image and optical flows, respectively. 1 indicates the operation of a normal form.
Multiclass cross-entropy loss: Given a set of training feature vectors with the class label f i , Υ i , we use the multi-class, cross-entropy loss [10,19,31] for human action recognition learning
ς c e = E f F , Υ Υ [ l o g ( p ( Υ | f ) ) ]
where p ( Υ | f ) indicates the predicted probability of the feature vector f belonging to the classes Υ . E f F , Υ Υ [ ] indicates the encoding process of the action recognition based on videos.
To further enhance the interaction of the action with neighbouring successive frames and improve the description ability of different levels of structural information, the spatial-temporal information with neighbourhoods is aggregated and efficiently enlarges the modelling of short-long ranges. The total loss function is indicated as
ς T o t a l = λ r e c ς r e c + λ m c e ς c e
where λ r e c and λ m c e indicate the importance factor of this loss; meanwhile, we train the proposed Hi-EAD framework by the total loss in an end-to-end manner and conduct iterative optimization.
In summary, our proposed Hi-EAD framework based on video action recognition first uses MFEA to perform static and dynamic modelling of continuous action frames to strengthen the interaction between action frames. At the same time, the SEHD module is used to achieve information complementation between different levels of features, strengthen the differences between different action features, refine the feature map by a multiscale dual self-guided attention layer to reduce the use of redundant information, and achieve effective channel and position information of each action in videos and build the best dependency relationship of different frames or successive actions, which improves the accuracy of action recognition. The MFEA and SEHD process of our proposed Hi-EAD framework for action recognition base on videos as follow Algorithm 1.
Algorithm 1: The recognition process of the proposed Hi-EAD framework for action based on videos
Symmetry 13 00662 i001

5. Conclusions

We propose a hierarchical excitation aggregation and disentanglement network (Hi-EAD) to realize action cognition in video. The framework can effectively capture the possible relation between action and frames through multiple frame excitation aggregation (MFEA) and a feature squeeze-and-excitation hierarchical disentanglement (SEHD) module. The experimental results show that our proposed Hi-EAD frameworks of action recognition based on videos outperform the other traditional approaches based on CNNs on the UCF-101 and HMDB-51 published baseline datasets. Moreover, the effectiveness and reliability of our proposed Hi-EAD framework is verified for action recognition tasks based on videos.
Future research will aim to improve the network structure and enhance feature information extraction. Specifically, we aim to design a simple and efficient semantic framework to accurately extract the feature information. Network structure: Design a more robust graph structure or use graph attention networks to reduce the loss of information flow passing between the nodes and layers. External feature information: Design an effective feature extraction network, such as by improving the DensNet structure, such as embedding hierarchical squeeze-and-excitation and self-attention, and retain the most discriminative detail information of different actions or frames.

Author Contributions

Z.H. designed the experiment to evaluate the performance and wrote the paper. E.-J.L. supervised the study and reviewed the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank the anonymous reviewers for their comments and helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
Hi-EADHierarchical Excitation Aggregation and Disentanglement Networks
MFEAMultiple Frames Excitation Aggregation
SEHDSqueeze-and-Excitation Hierarchical Disentanglement
GAPGlobal Average Pooling
SAmulti-head self-guided Attention

References

  1. Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 591–600. [Google Scholar]
  2. Saponara, S.; Greco, M.S.; Gini, F. Radar-on-chip/in-package in autonomous driving vehicles and intelligent transport systems: Opportunities and challenges. IEEE Signal Process. Mag. 2019, 36, 71–84. [Google Scholar] [CrossRef]
  3. An, F.P. Human action recognition algorithm based on adaptive initialization of deep learning model parameters and support vector machine. IEEE Access 2018, 6, 59405–59421. [Google Scholar] [CrossRef]
  4. Yang, H.; Yuan, C.; Li, B.; Du, Y.; Xing, J.; Hu, W.; Maybank, S.J. Asymmetric 3d convolutional neural networks for action recognition. Pattern Recognit. 2019, 85, 1–12. [Google Scholar] [CrossRef]
  5. Chen, X.; Weng, J.; Lu, W.; Xu, J.; Weng, J. Deep manifold learning combined with convolutional neural networks for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3938–3952. [Google Scholar] [CrossRef]
  6. Jing, C.; Wei, P.; Sun, H.; Zheng, N. Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput. Appl. 2020, 32, 4293–4302. [Google Scholar] [CrossRef]
  7. Li, J.; Liu, X.; Zhang, W.; Zhang, M.; Song, J.; Sebe, N. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 2020, 22, 2990–3001. [Google Scholar] [CrossRef]
  8. Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010. [Google Scholar]
  9. Peng, Y.; Shu, T.; Lu, H. Weak integration of form and motion in two-stream CNNs for action recognition. J. Vis. 2020, 20, 615. [Google Scholar] [CrossRef]
  10. Lin, Y.; Chi, W.; Sun, W.; Liu, S.; Fan, D. Human Action Recognition Algorithm Based on Improved ResNet and Skeletal Keypoints in Single Image. Math. Probl. Eng. 2020, 2020, 6954174. [Google Scholar] [CrossRef]
  11. Bose, S.R.; Kumar, V.S. An Efficient Inception V2 based Deep Convolutional Neural Network for Real-Time Hand Action Recognition. IET Image Process. 2019, 14, 688–696. [Google Scholar] [CrossRef]
  12. Li, W.; Feng, C.; Xiao, B.; Chen, Y. Binary Hashing CNN Features for Action Recognition. TIIS 2018, 12, 4412–4428. [Google Scholar]
  13. Rahman, S.; See, J.; Ho, C.C. Deep CNN object features for improved action recognition in low quality videos. Adv. Sci. Lett. 2017, 23, 11360–11364. [Google Scholar] [CrossRef]
  14. Cherian, A.; Gould, S. Second-order Temporal Pooling for Action Recognition. Int. J. Comput. Vis. 2019, 127, 340–362. [Google Scholar] [CrossRef]
  15. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  16. Seemanthini, K.; Manjunath, S.S. Human Detection and Tracking using HOG for Action Recognition. Procedia Comput. Sci. 2018, 132, 1317–1326. [Google Scholar]
  17. Chen, W.Q.; Xiao, G.Q.; Tang, X.Q. An Action Recognition Model Based on the Bayesian Networks. Appl. Mech. Mater. 2014, 513, 886–1889. [Google Scholar] [CrossRef]
  18. Tran, D.T.; Yamazoe, H.; Lee, J.H. Multi-scale affined-HOF and dimension selection for view-unconstrained action recognition. Appl. Intell. 2020, 50, 1–19. [Google Scholar] [CrossRef]
  19. Wang, L.; Koniusz, P.; Huynh, D.Q. Hallucinating Bag-of-Words and Fisher Vector IDT terms for CNN-based Action Recognition. arXiv 2019, arXiv:1906.05910. [Google Scholar]
  20. Wang, L.; Zhi-Pan, W.U. A Comparative Review of Recent Kinect-based Action Recognition Algorithms. arXiv 2019, arXiv:1906.09955. [Google Scholar] [CrossRef]
  21. Jagadeesh, B.; Patil, C.M. Video based action detection and recognition human using optical flow and SVM classifier. In Proceedings of the 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT), Bangalore, India, 20–21 May 2016; IEEE: New York, NY, USA, 2016. [Google Scholar]
  22. Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732. Available online: https://dl.acm.org/doi/10.1109/CVPR.2014.223 (accessed on 10 March 2021).
  23. Patil, G.G.; Banyal, R.K. Techniques of Deep Learning for Image Recognition. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Pune, India, 29–31 March 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
  24. Kang, B.R.; Lee, H.; Park, K.; Ryu, H.; Kim, H.Y. BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks. Pattern Recognit. Lett. 2020, 131, 449–455. [Google Scholar] [CrossRef]
  25. Sungheetha, A.; Rajesh, S.R. Comparative Study: Statistical Approach and Deep Learning Method for Automatic Segmentation Methods for Lung CT Image Segmentation. J. Innov. Image Process. 2020, 2, 187–193. [Google Scholar] [CrossRef]
  26. Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
  27. Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term Recurrent Convolutional Networks for Visual Recognition and, Description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef]
  28. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
  29. Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal Residual Networks for Video Action Recognition. 2017. Available online: https://papers.nips.cc/paper/2016/file/3e7e0224018ab3cf51abb96464d518cd-Paper.pdf (accessed on 10 March 2021).
  30. Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
  31. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based Action Recognition with Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017. [Google Scholar]
  32. Liao, X.; He, L.; Yang, Z.; Zhang, C. Video-based Person Re-identification via 3D Convolutional Networks and Non-local Attention. In Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  33. Kalfaoglu, M.E.; Kalkan, S.; Alatan, A. Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
  34. Anvarov, F.; Kim, D.H.; Song, B.C. Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention. Electronics 2020, 9, 147. [Google Scholar] [CrossRef]
  35. Jalal, M.A.; Aftab, W.; Moore, R.K.; Mihaylova, L. Dual stream spatio-temporal motion fusion with self-attention for action recognition. In Proceedings of the 22nd International Conference on Information Fusion, Ottawa, ON, Canada, 2–5 July 2019. [Google Scholar]
  36. Purwanto, D.; Pramono, R.R.; Chen, Y.T.; Fang, W.H. Three-Stream Network with Bidirectional Self-Attention for Action Recognition in Extreme Low-Resolution Videos. IEEE Signal Process. Lett. 2019, 26, 1187–1191. [Google Scholar] [CrossRef]
  37. Yu, T.; Guo, C.; Wang, L.; Gu, H.; Xiang, S.; Pan, C. Joint Spatial-Temporal Attention for Action Recognition. Pattern Recognit. Lett. 2018, 112, 226–233. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.