Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion
Abstract
:1. Introduction
- We designed an L-G Block to achieve the complementary advantages of LSTM in temporal context and GCN in graph structure learning by cross-stacking LSTM and GCN.
- We designed a dynamic graph contrastive learning mechanism to enhance the performance of graph representation learning by dynamically constructing the graph structure and introducing graph contrastive learning.
- We designed a feature fusion gate to fuse the shallow-level and deep-level features of the shot to fully capture the semantic information in the video.
- After sufficient experiments on TVSum and SumMe, the effectiveness of our proposed method is verified, and it is proven that our proposed method is better than most of the current state-of-the-art methods.
2. Related Works
2.1. Video Summarization
2.2. Graph Neural Networks
2.3. Graph Contrastive Learning
3. Method
3.1. Feature Extraction
3.2. Video Encoder
3.3. Graph Contrastive Learning
3.4. Feature Fusion
3.5. Loss Function
4. Experiments
4.1. Dataset
4.2. Experiment Setting
4.3. Evaluation of Indicators
4.4. Comparative Experiments
4.5. Ablation Experiments
4.6. Visualization Result
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Saini, P.; Kumar, K.; Kashid, S.; Saini, A.; Negi, A. Video summarization using deep learning techniques: A detailed analysis and investigation. Artif. Intell. Rev. 2023, 56, 12347–12385. [Google Scholar] [CrossRef] [PubMed]
- Xu, W.; Wang, R.; Guo, X.; Li, S.; Ma, Q.; Zhao, Y.; Guo, S.; Zhu, Z.; Yan, J. Mhscnet: A multimodal hierarchical shot-aware convolutional network for video summarization. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Wu, J.; Zhong, S.H.; Liu, Y. MvsGCN: A novel graph convolutional network for multi-video summarization. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 827–835. [Google Scholar]
- Meena, P.; Kumar, H.; Yadav, S.K. A review on video summarization techniques. Eng. Appl. Artif. Intell. 2023, 118, 105667. [Google Scholar] [CrossRef]
- Zhao, B.; Li, X.; Lu, X. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7405–7414. [Google Scholar]
- Zhao, B.; Li, X.; Lu, X. TTH-RNN: Tensor-train hierarchical recurrent neural network for video summarization. IEEE Trans. Ind. Electron. 2020, 68, 3629–3637. [Google Scholar] [CrossRef]
- Haq, H.M.U.; Asif, M.; Ahmad, M.B. Video Summarization Techniques: A Review. Int. J. Sci. Technol. Res. 2020, 9, 146–153. [Google Scholar]
- Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
- Zhao, B.; Li, H.; Lu, X.; Li, X. Reconstructive sequence-graph network for video summarization. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2793–2801. [Google Scholar] [CrossRef] [PubMed]
- Zhong, R.; Wang, R.; Zou, Y.; Hong, Z.; Hu, M. Graph attention networks adjusted bi-LSTM for video summarization. IEEE Signal Process. Lett. 2021, 28, 663–667. [Google Scholar] [CrossRef]
- Zhu, W.; Han, Y.; Lu, J.; Zhou, J. Relational reasoning over spatial-temporal graphs for video summarization. IEEE Trans. Image Process. 2022, 31, 3017–3031. [Google Scholar] [CrossRef]
- Yuan, L.; Tay, F.E.H.; Li, P.; Feng, J. Unsupervised video summarization with cycle-consistent adversarial LSTM networks. IEEE Trans. Multimed. 2019, 22, 2711–2722. [Google Scholar] [CrossRef]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
- Sreeja, M.; Kovoor, B.C. A multi-stage deep adversarial network for video summarization with knowledge distillation. J. Ambient Intell. Humaniz. Comput. 2023, 14, 9823–9838. [Google Scholar] [CrossRef]
- Xiao, S.; Zhao, Z.; Zhang, Z.; Guan, Z.; Cai, D. Query-biased self-attentive network for query-focused video summarization. IEEE Trans. Image Process. 2020, 29, 5889–5899. [Google Scholar] [CrossRef] [PubMed]
- Lin, J.; Zhong, S.h.; Fares, A. Deep hierarchical LSTM networks with attention for video summarization. Comput. Electr. Eng. 2022, 97, 107618. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Wu, J.; Zhong, S.h.; Liu, Y. Dynamic graph convolutional network for multi-video summarization. Pattern Recognit. 2020, 107, 107382. [Google Scholar] [CrossRef]
- Veličković, P.; Fedus, W.; Hamilton, W.L.; Liò, P.; Bengio, Y.; Hjelm, R.D. Deep Graph Infomax. ICLR 2019, 2, 4. [Google Scholar]
- Zhu, Y.; Xu, Y.; Yu, F.; Liu, Q.; Wu, S.; Wang, L. Deep graph contrastive representation learning. arXiv 2020, arXiv:2006.04131. [Google Scholar]
- Li, W.; Qi, D.; Zhang, C.; Guo, J.; Yao, J. Video summarization based on mutual information and entropy sliding window method. Entropy 2020, 22, 1285. [Google Scholar] [CrossRef] [PubMed]
- Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-specific video summarization. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 540–555. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Yasir, M.A.; Ali, Y.H. Dynamic background subtraction in video surveillance using color-histogram and fuzzy c-means algorithm with cosine similarity. Int. J. Online Biomed. Eng. 2022, 18, 74–85. [Google Scholar] [CrossRef]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VII 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 505–520. [Google Scholar]
- Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 766–782. [Google Scholar]
- Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 202–211. [Google Scholar]
- Huang, C.; Wang, H. A novel key-frames selection framework for comprehensive video summarization. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 577–589. [Google Scholar] [CrossRef]
- Li, H.; Ke, Q.; Gong, M.; Zhang, R. Video joint modelling based on hierarchical transformer for co-summarization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3904–3917. [Google Scholar] [CrossRef] [PubMed]
- Liu, T.; Meng, Q.; Huang, J.J.; Vlontzos, A.; Rueckert, D.; Kainz, B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans. Image Process. 2022, 31, 1573–1586. [Google Scholar] [CrossRef] [PubMed]
- Puthige, I.; Hussain, T.; Gupta, S.; Agarwal, M. Attention over attention: An enhanced supervised video summarization approach. Procedia Comput. Sci. 2023, 218, 2359–2368. [Google Scholar] [CrossRef]
Method | TVSum | SumMe |
---|---|---|
vsLSTM [27] | 54.2 | 37.6 |
DPP-LSTM [27] | 54.7 | 38.6 |
DR-DSN [28] | 58.1 | 42.1 |
SUM-GAN [29] | 56.3 | 41.7 |
SF-CVS [30] | 58.0 | 46.0 |
RSGN [9] | 60.1 | 45.0 |
GATVS [10] | 59.1 | 51.5 |
VJMHT [31] | 60.9 | 50.6 |
3DST-UNet [32] | 58.3 | 47.4 |
AoA [33] | 57.4 | 46.0 |
DGC-FNet (ours) | 60.3 | 51.5 |
Model | TVSum | SumMe |
---|---|---|
CG_model | 59.9 | 51.2 |
DG_model | 59.0 | 51.0 |
SG_model | 60.3 | 51.1 |
DGC-FNet | 60.3 | 51.5 |
Model | TVSum | SumMe |
---|---|---|
CON_Fusion | 58.2 | 46.5 |
CRO_Fusion | 57.3 | 44.4 |
CAT_Fusion | 59.8 | 50.5 |
DGC-FNet | 60.3 | 51.5 |
Number of Stacked Layers | TVSum | SumMe |
---|---|---|
1 | 59.5 | 51.4 |
2 | 60.3 | 51.5 |
3 | 59.8 | 51.1 |
4 | 59.8 | 51.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Wu, G.; Bi, X.; Cui, Y. Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion. Electronics 2024, 13, 2039. https://doi.org/10.3390/electronics13112039
Zhang J, Wu G, Bi X, Cui Y. Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion. Electronics. 2024; 13(11):2039. https://doi.org/10.3390/electronics13112039
Chicago/Turabian StyleZhang, Jing, Guangli Wu, Xinlong Bi, and Yulong Cui. 2024. "Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion" Electronics 13, no. 11: 2039. https://doi.org/10.3390/electronics13112039
APA StyleZhang, J., Wu, G., Bi, X., & Cui, Y. (2024). Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion. Electronics, 13(11), 2039. https://doi.org/10.3390/electronics13112039