SUM-GAN-GEA: Video Summarization Using GAN with Gaussian Distribution and External Attention
Abstract
:1. Introduction
- (1)
- We consider for the first time the distribution of interestingness of frames in real videos and propose to use a Gaussian function as a priori function for the interestingness of video frames, using it to learn frames importance scores, which better extract the interesting segments of the video.
- (2)
- We use the idea of graph model to learn the feature relationship connections between video frames to guide the summary generator to generate better global feature representations.
2. Related Works
2.1. Video Summarization Based on LSTM Sequence Model
2.2. Complexity of Video Summarization
2.3. Gaussian for Video Summarization
3. The Proposed Approach
3.1. External Attention Mechanism
Pseudocode 1. The structure pseudocode depicted in Figure 2 |
h: the hidden state of the encoder linear: linear layer attn_en: attention energy vector *: hadamard product enc_out: the original output of the encoder enc_out’: the final output of the encoder enc_out_q: The output of “enc_out” after the Query linear layer query: linear layer mk: linear memory unit attn: attention vector softmax: softmax activation function torch: A common library in the field of deep learning mv: linear memory unit out: the output of external attention module h = h.transpose(0,1) # shape = [num_layers,batch_size, hidden_size] h = h.reshape(h.shape [0], −1) # shape = [batch_size, num_layers × hidden_size] attn_en = linear(h) # shape = [batch_size, hidden_size] attn_en = attn_en.unsqueeze(0) # shape = [1,batch_size, hidden_size] enc_out’ = attn_en * enc_out # shape = [seq,batch_size,num_layers] enc_out_q = query(enc_out’) # shape = [seq,batch_size,hidden_size] attn = mk(enc_out_q) # shape = [seq,batch_size,num_layers] attn = softmax(attn,dim = 1) # shape = [seq,batch_size,num_layers] attn = attn/torch.sum(attn,dim = 2,keepdim = True) # shape = [seq,batch_size,num_layers] out = mv(attn) # shape = [seq,batch_size,num_layers] |
3.2. Gaussian Distribution of Frames Importance Scores
3.3. SmoothL1Loss
4. Experiments
4.1. Experiment Datasets and Evaluation Metrics
4.2. Performance Evaluation
4.2.1. F-Score Data Distribution
4.2.2. Comparison with Unsupervised Methods
4.2.3. Evaluation Using a Single Ground-Truth Summarization
4.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
- Sreeja, M.; Kovoor, B.C. A multi-stage deep adversarial network for video summarization with knowledge distillation. J. Ambient. Intell. Humaniz. Comput. 2022, 1–16. [Google Scholar] [CrossRef]
- Agyeman, R.; Muhammad, R.; Choi, G.S. Soccer Video Summarization using Deep Learning. In Proceedings of the 2nd IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 270–273. [Google Scholar]
- Thomas, S.S.; Gupta, S.; Subramanian, V.K. Smart Surveillance Based on Video Summarization. In Proceedings of the IEEE Region 10 Symposium on Technologies for Smart Cities (TENSYMP), IEEE Kerala Sect, Kochi, India, 14–16 July 2017. [Google Scholar]
- Almeida, J.; Leite, N.J.; Torres, R.d.S. VISON: Video Summarization for ONline applications. Pattern Recognit. Lett. 2012, 33, 397–409. [Google Scholar] [CrossRef]
- Nair, M.S.; Mohan, J. VSMCNN-dynamic summarization of videos using salient features from multi-CNN model. J. Ambient. Intell. Humaniz. Comput. 2022, 1–10. [Google Scholar] [CrossRef]
- Li, X.; Liu, Y.; Wang, K.; Wang, F.-Y. A recurrent attention and interaction model for pedestrian trajectory prediction. IEEE/CAA J. Autom. Sin. 2020, 7, 1361–1370. [Google Scholar] [CrossRef]
- Liu, S.; Xia, Y.; Shi, Z.; Yu, H.; Li, Z.; Lin, J. Deep learning in sheet metal bending with a novel theory-guided deep neural network. IEEE/CAA J. Autom. Sin. 2021, 8, 565–581. [Google Scholar] [CrossRef]
- Mansour, R.F.; Escorcia-Gutierrez, J.; Gamarra, M.; Villanueva, J.A.; Leal, N. Intelligent video anomaly detection and classification using faster RCNN with deep reinforcement learning model. Image Vis. Comput. 2021, 112, 104229. [Google Scholar] [CrossRef]
- Alotaibi, M.F.; Omri, M.; Abdel-Khalek, S.; Khalil, E.; Mansour, R.F. Computational Intelligence-Based Harmony Search Algorithm for Real-Time Object Detection and Tracking in Video Surveillance Systems. Mathematics 2022, 10, 733. [Google Scholar] [CrossRef]
- Yan, X.; Hu, S.; Mao, Y.; Ye, Y.; Yu, H. Deep multi-view learning methods: A review. Neurocomputing 2021, 448, 106–129. [Google Scholar] [CrossRef]
- Paviglianiti, A.; Randazzo, V.; Villata, S.; Cirrincione, G.; Pasero, E. A Comparison of Deep Learning Techniques for Arterial Blood Pressure Prediction. Cogn. Comput. 2021, 14, 1689–1710. [Google Scholar] [CrossRef]
- Goel, T.; Murugan, R.; Mirjalili, S.; Chakrabartty, D.K. Automatic screening of covid-19 using an optimized generative adversarial network. Cogn. Comput. 2021, 1–16. [Google Scholar] [CrossRef] [PubMed]
- Ali, G.; Ali, T.; Irfan, M.; Draz, U.; Sohail, M.; Glowacz, A.; Sulowicz, M.; Mielnik, R.; Faheem, Z.B.; Martis, C. IoT Based Smart Parking System Using Deep Long Short Memory Network. Electronics 2020, 9, 1696. [Google Scholar] [CrossRef]
- Park, S.; Kim, H. FaceVAE: Generation of a 3D Geometric Object Using Variational Autoencoders. Electronics 2021, 10, 2792. [Google Scholar] [CrossRef]
- Yang, Z.; Yu, H.; Cao, S.; Xu, Q.; Yuan, D.; Zhang, H.; Jia, W.; Mao, Z.-H.; Sun, M. Human-Mimetic Estimation of Food Volume from a Single-View RGB Image Using an AI System. Electronics 2021, 10, 1556. [Google Scholar] [CrossRef]
- Guo, M.-H.; Liu, Z.-N.; Mu, T.-J.; Hu, S.-M. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv 2021, arXiv:2105.02358. [Google Scholar] [CrossRef]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Unsupervised video summarization via attention-driven adversarial learning. In International Conference on Multimedia Modeling; Springer: Cham, Switzerland, 2020; pp. 492–504. [Google Scholar]
- Apostolidis, E.; Metsai, A.I.; Adamantidou, E.; Mezaris, V.; Patras, I. A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, Nice, France, 21 October 2019; pp. 17–25. [Google Scholar]
- Zhao, B.; Li, H.; Lu, X.; Li, X. Reconstructive Sequence-Graph Network for Video Summarization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2793–2801. [Google Scholar] [CrossRef]
- Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 2016, 3, 47–57. [Google Scholar] [CrossRef]
- Jadon, S. A survey of loss functions for semantic segmentation. In Proceedings of the 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Via del Mar, Chile, 27–29 October 2020; pp. 1–7. [Google Scholar]
- Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
- Zhang, K.; Chao, W.-L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 766–782. [Google Scholar]
- Lebron Casas, L.; Koblents, E. Video summarization with LSTM and deep attention models. In International Conference on MultiMedia Modeling; Springer: Cham, Switzerland, 2019; pp. 67–79. [Google Scholar]
- Elfeki, M.; Borji, A. Video summarization via actionness ranking. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; pp. 754–763. [Google Scholar]
- Satorras, V.G.; Rangapuram, S.S.; Januschowski, T. Multivariate time series forecasting with latent graph inference. arXiv 2022, arXiv:2203.03423. [Google Scholar]
- Mao, F.; Wu, X.; Xue, H.; Zhang, R. Hierarchical video frame sequence representation with deep convolutional graph network. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 262–270. [Google Scholar]
- Li, P.; Tang, C.; Xu, X. Video summarization with a graph convolutional attention network. Front. Inf. Technol. Electron. Eng. 2021, 22, 902–913. [Google Scholar] [CrossRef]
- Ou, S.-H.; Lee, C.-H.; Somayazulu, V.-S.; Chen, Y.-K.; Chien, S.-Y. Low complexity on-line video summarization with Gaussian mixture model based clustering. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1260–1264. [Google Scholar] [CrossRef]
- Valdes, V.; Martinez, J.M. On-line video summarization based on signature-based junk and redundancy filtering. In Proceedings of the 2008 Ninth International Workshop on Image Analysis for Multimedia Interactive Services, Klagenfurt, Austria, 7–9 May 2008; pp. 88–91. [Google Scholar]
- Ma, M.; Mei, S.; Wan, S. Nonlinear Block Sparse Dictionary Selection for Video Summarization. J. Xi’an Jiaotong Univ. 2019, 53, 142–148. (In Chinese) [Google Scholar] [CrossRef]
- Jadon, S.; Jasim, M. Unsupervised video summarization framework using keyframe extraction and video skimming. In Proceedings of the 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, 30–31 October 2020; pp. 140–145. [Google Scholar]
- Laganière, R.; Bacco, R.; Hocevar, A.; Lambert, P.; Ionescu, B.E. Video summarization from spatio-temporal features. In Proceedings of the 2nd ACM Workshop on Video Summarization, TVS 2008, Vancouver, BC, Canada, 31 October 2008. [Google Scholar]
- Zhang, Y.; Wei, Z.; Zhao, Z.; Song, X.; Fu, L. A gaussian video summarization method using video frames similarity function. ICIC Express Lett. 2013, 7, 1997–2003. [Google Scholar]
- Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 202–211. [Google Scholar]
- Jiang, J.; Zhang, X.-P. Gaussian mixture vector quantization-based video summarization using independent component analysis. In Proceedings of the 2010 IEEE International Workshop on Multimedia Signal Processing, Saint-Malo, France, 4–6 October 2010; pp. 443–448. [Google Scholar]
- Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of adam and beyond. arXiv 2019, arXiv:1904.09237. [Google Scholar]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Gool, L.V. Creating summaries from user videos. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 505–520. [Google Scholar]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
- Kaufman, D.; Levi, G.; Hassner, T.; Wolf, L. Temporal tessellation: A unified approach for video analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 94–104. [Google Scholar]
- Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Rochan, M.; Wang, Y. Video summarization by learning from unpaired data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7902–7911. [Google Scholar]
- Yaliniz, G.; Ikizler-Cinbis, N. Unsupervised Video Summarization with Independently Recurrent Neural Networks. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019. [Google Scholar]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. AC-SUM-GAN: Connecting actor-critic and generative adversarial networks for unsupervised video summarization. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3278–3292. [Google Scholar] [CrossRef]
- Liang, G.; Lv, Y.; Li, S.; Zhang, S.; Zhang, Y. Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network. arXiv 2021, arXiv:2105.11131. [Google Scholar]
- Jung, Y.; Cho, D.; Kim, D.; Woo, S.; Kweon, I.S. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8537–8544. [Google Scholar]
- Li, P.; Ye, Q.; Zhang, L.; Yuan, L.; Xu, X.; Shao, L. Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognit. 2021, 111, 107677. [Google Scholar] [CrossRef]
- Wei, H.; Ni, B.; Yan, Y.; Yu, H.; Yang, X.; Yao, C. Video summarization via semantic attended networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Zhang, Y.; Kampffmeyer, M.; Zhao, X.; Tan, M. Dtr-gan: Dilated temporal relational adversarial network for video summarization. In Proceedings of the ACM Turing Celebration Conference-China, Chengdu, China, 17–19 May 2019; pp. 1–6. [Google Scholar]
- Ji, Z.; Xiong, K.; Pang, Y.; Li, X. Video summarization with attention-based encoder–decoder networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef] [Green Version]
- Fu, T.-J.; Tai, S.-H.; Chen, H.-T. Attentive and adversarial learning for video summarization. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; pp. 1579–1587. [Google Scholar]
- Yuan, L.; Tay, F.E.; Li, P.; Zhou, L.; Feng, J. Cycle-SUM: Cycle-consistent adversarial LSTM networks for unsupervised video summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 9143–9150. [Google Scholar]
Methods | SumMe | TVSum | Average |
---|---|---|---|
Tessellation [41] | 41.4(−) | 64.1(+) | 52.8(−) |
DR-DSN [42] | 41.4(−) | 57.6(−) | 49.5(−) |
UnpairedVSN [43] | 47.5(−) | 55.6(−) | 51.6(−) |
SUM-IndLU [44] | 51.9(−) | 61.5(+) | 56.7(−) |
AC-SUM-GAN [45] | 50.8(−) | 60.6(−) | 55.7(−) |
CAAN [46] | 50.8(−) | 59.6(−) | 55.2(−) |
CSNet [47] | 51.3(−) | 58.8(−) | 55.1(−) |
SUM-GDA [48] | 50.0(−) | 59.6(−) | 54.8(−) |
SUM-GAN-sl [19] | 47.3(−) | 58.0(−) | 52.7(−) |
SUM-GAN-AAE [18] | 48.9(−) | 58.3(−) | 53.6(−) |
SUM-GAN-GEA (Ours) | 53.4 | 61.3 | 57.4 |
Methods | SumMe | TVSum |
---|---|---|
* SUM-GAN [36] | 38.7(−) | 50.8(−) |
* SUM-GANdpp [36] | 39.1(−) | 51.7(−) |
SUM-GANsup [36] | 41.7(−) | 56.3(−) |
SASUM [49] | 45.3(−) | 58.2(−) |
DTR-GAN [50] | 44.6(−) | 59.1(−) |
A-AVS [51] | 43.9(−) | 59.4(−) |
M-AVS [51] | 44.4(−) | 61.0(−) |
AALVS [52] | 46.2(−) | 63.6(−) |
* Cycle-SUM [53] | 41.9(−) | 57.6(−) |
* SUM-GAN-sl [19] | 46.8(−) | 65.3(−) |
* SUM-GAN-AAE | 56.9(−) | 63.9(−) |
* SUM-GAN-GEA (Ours) | 65.9 | 68.2 |
Experiments | SmoothL1Loss | External Attention | Gaussian Distribution | SumMe (F-Score) and Time (s) | TVSum (F-Score) and Time (s) |
---|---|---|---|---|---|
SUM-GAN-AAE | 48.9 and45.2 | 58.3 and 145.8 | |||
Exp1 | √ | 50.1 and 45.4 | 60.3 and 151.4 | ||
Exp2 | √ | 50.5 and 37.8 | 59.9 and 123.8 | ||
Exp3 | √ | 53.4 and 44.6 | 61.2 and 145.4 | ||
Exp4 | √ | √ | 51.0 and 37.0 | 60.5 and 118.8 | |
Exp5 | √ | √ | 53.4 and 46.0 | 61.2 and 146.8 | |
Exp6 | √ | √ | 53.4 and 34.6 | 61.1 and 113 | |
Exp7(SUM-GAN-GEA) | √ | √ | √ | 53.4 and 34.0 | 61.3 and 112.6 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, Q.; Yu, H.; Wang, Y.; Pham, T.D. SUM-GAN-GEA: Video Summarization Using GAN with Gaussian Distribution and External Attention. Electronics 2022, 11, 3523. https://doi.org/10.3390/electronics11213523
Yu Q, Yu H, Wang Y, Pham TD. SUM-GAN-GEA: Video Summarization Using GAN with Gaussian Distribution and External Attention. Electronics. 2022; 11(21):3523. https://doi.org/10.3390/electronics11213523
Chicago/Turabian StyleYu, Qinghao, Hui Yu, Yongxiong Wang, and Tuan D. Pham. 2022. "SUM-GAN-GEA: Video Summarization Using GAN with Gaussian Distribution and External Attention" Electronics 11, no. 21: 3523. https://doi.org/10.3390/electronics11213523
APA StyleYu, Q., Yu, H., Wang, Y., & Pham, T. D. (2022). SUM-GAN-GEA: Video Summarization Using GAN with Gaussian Distribution and External Attention. Electronics, 11(21), 3523. https://doi.org/10.3390/electronics11213523