GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video Summarisation
Abstract
:1. Introduction
2. Related Works
- This work presents two CNN models for processing the visual and audio components of a video. Both models utilise a spatial CNN layer to account for the time series input by employing the VGGreNet architecture. A Spatial Attention Module (SAM) is also considered to attend and weight the extracted features, and a subnet is introduced to combine the received features.
- In the Graph Attention Network (GAT) approach, a comprehensive graph is constructed from features derived from the Self-Attention Mechanism (SAM). The attention mechanism is then applied to the edges of the GAT to extract interesting content from the features.
- Next, an Adaptive Feature-based Transformation (AFT) is then introduced. This transformation is used to improve the calculation of weights, thereby increasing the overall effectiveness of the process. The AFT is then integrated into a multi-headed approach. This integration facilitates the identification of events, thereby enabling a more comprehensive understanding and interpretation of the video content.
- Bi-CARU is recommended for decoding and extracting contextual information using its context-adaptive gate. It also improves the accuracy of the similarity module by combining the forward and backward features of CARU, taking into account the hidden features and attention to generate detailed sentences for fine-grained prediction.
3. Proposed Model
- Spatial Attention Module (SAM). This module takes the visual features as input to predict the spatial information of the frame(s) and determines the coordinate of the current frame to compute their relative spatial information and perform consistent coordinate projections. By applying the 3D transformation algorithm, the model can be focused on a consistent projective space to compute the acquisition of the relevant regional information of the image.
- Graph Attention Transformer (GAT). The visual features of the nodes are transformed using the Adaptive Feature-based Transformation (AFT) mechanism. This process enhances the representation of visual features in the model. These visual features are then processed by a Bi-CARU network, also employing the multi-headed approaches for the key-frame determination.
- Bi-CARU Layer. An advanced bi-directional RNN architecture with CARU is combined with the GAT attention score generated by the GAT. This bi-CARU architecture enables simultaneous forward and backward feature searching from the input feature.
3.1. Spatial Attention Module (SAM)
3.2. Graph Attention Transformer (GAT)
3.3. Bi-CARU Decoder
4. Experimental Analysis
Feature Visualisation by Video Descriptions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- de Santana Correia, A.; Colombini, E.L. Attention, please! A survey of neural attention models in deep learning. Artif. Intell. Rev. 2022, 55, 6037–6124. [Google Scholar] [CrossRef]
- Ji, Z.; Xiong, K.; Pang, Y.; Li, X. Video Summarization With Attention-Based Encoder–Decoder Networks. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1709–1717. [Google Scholar] [CrossRef]
- Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
- Zhong, S.H.; Lin, J.; Lu, J.; Fares, A.; Ren, T. Deep Semantic and Attentive Network for Unsupervised Video Summarization. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–21. [Google Scholar] [CrossRef]
- Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video Summarization with Long Short-Term Memory. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 766–782. [Google Scholar] [CrossRef]
- Touati, R.; Mignotte, M.; Dahmane, M. Anomaly Feature Learning for Unsupervised Change Detection in Heterogeneous Images: A Deep Sparse Residual Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 588–600. [Google Scholar] [CrossRef]
- Shang, R.; Chang, J.; Jiao, L.; Xue, Y. Unsupervised feature selection based on self-representation sparse regression and local similarity preserving. Int. J. Mach. Learn. Cybern. 2017, 10, 757–770. [Google Scholar] [CrossRef]
- He, X.; Hua, Y.; Song, T.; Zhang, Z.; Xue, Z.; Ma, R.; Robertson, N.; Guan, H. Unsupervised Video Summarization with Attentive Conditional Generative Adversarial Networks. In Proceedings of the 27th ACM International Conference on Multimedia, ACM, 2019, MM ’19, Nice, France, 21–25 October 2019. [Google Scholar] [CrossRef]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Unsupervised Video Summarization via Attention-Driven Adversarial Learning. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 492–504. [Google Scholar] [CrossRef]
- Chandak, Y.; Theocharous, G.; Kostas, J.; Jordan, S.; Thomas, P.S. Learning Action Representations for Reinforcement Learning. arXiv 2019, arXiv:1902.00183. [Google Scholar]
- Hu, M.; Hu, R.; Wang, Z.; Xiong, Z.; Zhong, R. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimed. Tools Appl. 2022, 81, 40489–40510. [Google Scholar] [CrossRef]
- Yuan, L.; Tay, F.E.H.; Li, P.; Feng, J. Unsupervised Video Summarization With Cycle-Consistent Adversarial LSTM Networks. IEEE Trans. Multimed. 2020, 22, 2711–2722. [Google Scholar] [CrossRef]
- Saini, P.; Kumar, K.; Kashid, S.; Saini, A.; Negi, A. Video summarization using deep learning techniques: A detailed analysis and investigation. Artif. Intell. Rev. 2023, 56, 12347–12385. [Google Scholar] [CrossRef] [PubMed]
- Tian, Y.; Yang, M.; Zhang, L.; Zhang, Z.; Liu, Y.; Xie, X.; Que, X.; Wang, W. View while Moving: Efficient Video Recognition in Long-untrimmed Videos. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 2023, MM ’23, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar] [CrossRef]
- Chami, I.; Ying, R.; Ré, C.; Leskovec, J. Hyperbolic Graph Convolutional Neural Networks. arXiv 2019, arXiv:1910.12933. [Google Scholar]
- Spinelli, I.; Scardapane, S.; Uncini, A. Adaptive Propagation Graph Convolutional Network. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4755–4760. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Chen, Z.; Jiang, H.; Song, S.; Han, Y.; Huang, G. Adaptive Focus for Efficient Video Recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; IEEE: Piscatway, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Liu, X.; Yan, M.; Deng, L.; Li, G.; Ye, X.; Fan, D. Sampling Methods for Efficient Training of Graph Convolutional Networks: A Survey. IEEE/CAA J. Autom. Sin. 2022, 9, 205–234. [Google Scholar] [CrossRef]
- Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
- Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous Graph Attention Network. In Proceedings of the World Wide Web Conference, ACM, 2019, WWW ’19, Austin, TX, USA, 30 April–4 May 2023. [Google Scholar] [CrossRef]
- Brody, S.; Alon, U.; Yahav, E. How Attentive are Graph Attention Networks? arXiv 2021, arXiv:2105.14491. [Google Scholar]
- Bo, D.; Wang, X.; Shi, C.; Shen, H. Beyond Low-frequency Information in Graph Convolutional Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3950–3957. [Google Scholar] [CrossRef]
- Khan, A.A.; Shao, J.; Ali, W.; Tumrani, S. Content-Aware Summarization of Broadcast Sports Videos: An Audio–Visual Feature Extraction Approach. Neural Process. Lett. 2020, 52, 1945–1968. [Google Scholar] [CrossRef]
- Mehta, N.; Murala, S. Image Super-Resolution With Content-Aware Feature Processing. IEEE Trans. Artif. Intell. 2024, 5, 179–191. [Google Scholar] [CrossRef]
- Naik, B.T.; Hashmi, M.F.; Bokde, N.D. A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions. Appl. Sci. 2022, 12, 4429. [Google Scholar] [CrossRef]
- Nugroho, M.A.; Woo, S.; Lee, S.; Kim, C. Audio-Visual Glance Network for Efficient Video Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscatway, NJ, USA, 2023. [Google Scholar] [CrossRef]
- Yasmin, G.; Chowdhury, S.; Nayak, J.; Das, P.; Das, A.K. Key moment extraction for designing an agglomerative clustering algorithm-based video summarization framework. Neural Comput. Appl. 2021, 35, 4881–4902. [Google Scholar] [CrossRef]
- Wang, D.; Guo, X.; Tian, Y.; Liu, J.; He, L.; Luo, X. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit. 2023, 136, 109259. [Google Scholar] [CrossRef]
- Xu, B.; Liang, H.; Liang, R. Video summarisation with visual and semantic cues. IET Image Process. 2020, 14, 3134–3142. [Google Scholar] [CrossRef]
- Wei, H.; Ni, B.; Yan, Y.; Yu, H.; Yang, X.; Yao, C. Video Summarization via Semantic Attended Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–8 February 2018; Volume 32. [Google Scholar] [CrossRef]
- Jiang, H.; Mu, Y. Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscatway, NJ, USA, 2022. [Google Scholar] [CrossRef]
- Im, S.K.; Chan, K.H. Context-Adaptive-Based Image Captioning by Bi-CARU. IEEE Access 2023, 11, 84934–84943. [Google Scholar] [CrossRef]
- Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-End Object Detection with Dynamic Attention. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; IEEE: Piscatway, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Zhang, C.L.; Wu, J.; Li, Y. ActionFormer: Localizing Moments of Actions with Transformers. In Computer Vision, Proceedings of the ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 492–510. [Google Scholar] [CrossRef]
- Zheng, Z.; Yang, L.; Wang, Y.; Zhang, M.; He, L.; Huang, G.; Li, F. Dynamic Spatial Focus for Efficient Compressed Video Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 695–708. [Google Scholar] [CrossRef]
- Lin, Z.; Zhao, Z.; Zhang, Z.; Zhang, Z.; Cai, D. Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction. IEEE Trans. Image Process. 2020, 29, 3750–3762. [Google Scholar] [CrossRef] [PubMed]
- Lin, J.; Zhong, S.h.; Fares, A. Deep hierarchical LSTM networks with attention for video summarization. Comput. Electr. Eng. 2022, 97, 107618. [Google Scholar] [CrossRef]
- Liu, Y.T.; Li, Y.J.; Wang, Y.C.F. Transforming Multi-concept Attention into Video Summarization. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 498–513. [Google Scholar] [CrossRef]
- Li, P.; Tang, C.; Xu, X. Video summarization with a graph convolutional attention network. Front. Inf. Technol. Electron. Eng. 2021, 22, 902–913. [Google Scholar] [CrossRef]
- Chan, K.H.; Im, S.K.; Ke, W. VGGreNet: A Light-Weight VGGNet with Reused Convolutional Set. In Proceedings of the 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC), Leicester, UK, 7–10 December 2020; IEEE: Piscatway, NJ, USA, 2020. [Google Scholar] [CrossRef]
- Chan, K.H.; Pau, G.; Im, S.K. Chebyshev Pooling: An Alternative Layer for the Pooling of CNNs-Based Classifier. In Proceedings of the 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology (CCET), Beijing, China, 13–15 August 2021; IEEE: Piscatway, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- lashin, V.; Rahtu, E. Multi-modal Dense Video Captioning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscatway, NJ, USA, 2020. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. TVSum: Summarizing web videos using titles. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscatway, NJ, USA, 2015. [Google Scholar] [CrossRef]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating Summaries from User Videos. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 505–520. [Google Scholar] [CrossRef]
- Zhong, W.; Xiong, H.; Yang, Z.; Zhang, T. Bi-directional long short-term memory architecture for person re-identification with modified triplet embedding. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscatway, NJ, USA, 2017; pp. 1562–1566. [Google Scholar] [CrossRef]
- Rochan, M.; Ye, L.; Wang, Y. Video Summarization Using Fully Convolutional Sequence Networks. In Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 358–374. [Google Scholar] [CrossRef]
- Zhou, K.; Qiao, Y.; Xiang, T. Deep Reinforcement Learning for Unsupervised Video Summarization With Diversity-Representativeness Reward. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
- Yan, L.; Wang, Q.; Cui, Y.; Feng, F.; Quan, X.; Zhang, X.; Liu, D. GL-RG: Global-Local Representation Granularity for Video Captioning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, IJCAI-2022, Vienna, Austria, 23–29 July 2022. [Google Scholar] [CrossRef]
- Gao, Y.; Hou, X.; Suo, W.; Sun, M.; Ge, T.; Jiang, Y.; Wang, P. Dual-Level Decoupled Transformer for Video Captioning. In Proceedings of the 2022 International Conference on Multimedia Retrieval, ACM, 2022, ICMR’22, Newark, NJ, USA, 27–30 June 2022. [Google Scholar] [CrossRef]
- Li, P.; Ye, Q.; Zhang, L.; Yuan, L.; Xu, X.; Shao, L. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognit. 2021, 111, 107677. [Google Scholar] [CrossRef]
- Zhu, W.; Han, Y.; Lu, J.; Zhou, J. Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization. IEEE Trans. Image Process. 2022, 31, 3017–3031. [Google Scholar] [CrossRef] [PubMed]
- Ramanishka, V.; Das, A.; Zhang, J.; Saenko, K. Top-Down Visual Saliency Guided by Captions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscatway, NJ, USA, 2017. [Google Scholar] [CrossRef]
Method | TVSum | SumMe | |
---|---|---|---|
Bi-LSTM [48] | 47.6 | 54.2 | |
Baseline | DPP-LSTM [6] | 48.6 | 54.7 |
SUM-FCN [49] | 57.5 | 56.8 | |
DR-DSN [50] | 52.1 | 58.1 | |
SUM-DeepLab [49] | 58.8 | 59.4 | |
Advanced | GL-RG [51] | 54.5 | 55.5 |
[52] | 55.5 | 56.3 | |
SUM-GDA [53] | 58.9 | 52.8 | |
RR-STG [54] | 59.4 | 53.4 | |
Our Proposed | 58.4 | 59.6 |
Approach | ||
---|---|---|
human annotation | There were two wasps on the branch. Beehives carved from beeswax. Man standing in front of wooden greenhouse. Hive on a table with a couple of bees. | |
SUM-FCN [49] | A bee hanging from a branch. Beehive made with natural beeswax. Man standing outside wooden greenhouse. Close-up of bees in a hive on a table. | |
SUM-DeepLab [49] | A bee clinging to a branch. Structure of a beehive shaped with beeswax. Background of wooden greenhouse in natural environment. Beehive on wooden surface. | |
Proposed | A wasp perched on a branch in the tree. Detailed beehive shaped with beeswax. Man standing next to rustic wooden greenhouse structure. A bee in a hive on a wooden table. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chan, K.-H.; Im, S.-K. GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video Summarisation. Technologies 2024, 12, 126. https://doi.org/10.3390/technologies12080126
Chan K-H, Im S-K. GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video Summarisation. Technologies. 2024; 12(8):126. https://doi.org/10.3390/technologies12080126
Chicago/Turabian StyleChan, Ka-Hou, and Sio-Kei Im. 2024. "GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video Summarisation" Technologies 12, no. 8: 126. https://doi.org/10.3390/technologies12080126
APA StyleChan, K. -H., & Im, S. -K. (2024). GAT-Based Bi-CARU with Adaptive Feature-Based Transformation for Video Summarisation. Technologies, 12(8), 126. https://doi.org/10.3390/technologies12080126