A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
Abstract
:1. Introduction
- We propose MiLoRA-ViSum, a novel framework that introduces a mixture of LoRA experts to achieve an advanced dual temporal–spatial adaptation mechanism tailored specifically for video summarization tasks. By integrating specialized LoRA experts for temporal and spatial layers, our approach ensures efficient and precise feature capture across video data.
- We demonstrate significant methodological advancements by leveraging the mixture of LoRA experts to dynamically optimize both temporal attention layers and spatial convolutional layers within the Video-LLaMA backbone. This dual adaptation mechanism enhances the ability to model complex temporal–spatial dependencies while maintaining computational efficiency.
- We provide a comprehensive empirical evaluation of MiLoRA-ViSum on two widely recognized video summarization datasets, VideoXum and ActivityNet, conducting a thorough comparison with existing state-of-the-art models. Our results indicate that MiLoRA-ViSum achieves competitive performance while reducing the number of trainable parameters significantly, underscoring its scalability and practicality for real-world applications.
2. Related Work
2.1. Video Summarization
2.2. LoRA in Vision-Language Models
3. Method
3.1. MiLoRA-ViSum Model Architecture
Algorithm 1: MiLoRA-ViSum |
3.2. Integrated Temporal–Spatial Adaptation and Optimization for Video Summarization
3.3. Optimization and Loss Function
3.4. Model Integration and Training Strategy
- Pre-training on a Large-Scale Dataset: The base Video-LLaMA model is pre-trained on a large-scale video dataset to learn general spatio-temporal representations. This step initializes the model with robust feature extraction capabilities.
- Expert Specialization and Fine-Tuning: After integrating MiLoRA, the mixture of experts is fine-tuned on specific video summarization datasets, such as VideoXum and ActivityNet. During this stage, the gating functions are optimized to dynamically activate the most relevant experts, ensuring efficient adaptation to diverse video content.
- Regularization and Early Stopping: Regularization terms and are applied to prevent overfitting, and early stopping is employed based on the validation loss. This ensures that the model generalizes well to unseen data.
4. Experimental Setup
4.1. Baseline Models
4.2. Experimental Environment and Datasets
4.3. Training Procedure and Parameter Optimization
4.4. Evaluation Metrics and Validation
5. Results and Analysis
5.1. Overall Performance and Comparison
5.2. Comparison with State-of-the-Art (SOTA) and Scalability
5.3. Independent Analysis: Impact of Mixture of LoRA Experts on Temporal and Spatial Adaptations
Experimental Setup
- Temporal-only MiLoRA: MiLoRA experts are applied exclusively to temporal attention layers.
- Spatial-only MiLoRA: MiLoRA experts are applied exclusively to spatial convolutional layers.
- Combined MiLoRA: MiLoRA experts are applied jointly to both temporal and spatial layers.
5.4. Comparsion with Related Works
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Saini, P.; Kumar, K.; Kashid, S.; Saini, A.; Negi, A. Video summarization using deep learning techniques: A detailed analysis and investigation. Artif. Intell. Rev. 2023, 56, 12347–12385. [Google Scholar]
- He, B.; Wang, J.; Qiu, J.; Bui, T.; Shrivastava, A.; Wang, Z. Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14867–14878. [Google Scholar]
- Jangra, A.; Mukherjee, S.; Jatowt, A.; Saha, S.; Hasanuzzaman, M. A survey on multi-modal summarization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar]
- Elfeki, M.; Wang, L.; Borji, A. Multi-stream dynamic video summarization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 339–349. [Google Scholar]
- Zhang, H.; Li, X.; Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv 2023, arXiv:2306.02858. [Google Scholar]
- Lin, J.; Hua, H.; Chen, M.; Li, Y.; Hsiao, J.; Ho, C.; Luo, J. Videoxum: Cross-modal visual and textural summarization of videos. IEEE Trans. Multimed. 2023, 26, 5548–5560. [Google Scholar] [CrossRef]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- Rennard, V.; Shang, G.; Hunter, J.; Vazirgiannis, M. Abstractive meeting summarization: A survey. Trans. Assoc. Comput. Linguist. 2023, 11, 861–884. [Google Scholar] [CrossRef]
- Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-specific video summarization. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Cham, Switzerland, 2014; pp. 540–555. [Google Scholar]
- Yang, L.; Zheng, Z.; Han, Y.; Song, S.; Huang, G.; Li, F. OStr-DARTS: Differentiable Neural Architecture Search Based on Operation Strength. IEEE Trans. Cybern. 2024, 54, 6559–6572. [Google Scholar] [PubMed]
- Selva, J.; Johansen, A.S.; Escalera, S.; Nasrollahi, K.; Moeslund, T.B.; Clapés, A. Video transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12922–12943. [Google Scholar] [PubMed]
- Yang, L.; Jiang, H.; Cai, R.; Wang, Y.; Song, S.; Huang, G.; Tian, Q. Condensenet v2: Sparse feature reactivation for deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3569–3578. [Google Scholar]
- Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Cham, Switzerland, 2016; pp. 766–782. [Google Scholar]
- Zheng, Z.; Yang, L.; Wang, Y.; Zhang, M.; He, L.; Huang, G.; Li, F. Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 695–708. [Google Scholar]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
- Ma, Y.F.; Lu, L.; Zhang, H.J.; Li, M. A user attention model for video summarization. In Proceedings of the Tenth ACM International Conference on Multimedia, New York, NY, USA, 1–6 December 2002; pp. 533–542. [Google Scholar]
- Haq, H.B.U.; Asif, M.; Ahmad, M.B. Video summarization techniques: A review. Int. J. Sci. Technol. Res 2020, 9, 146–153. [Google Scholar]
- Chu, W.S.; Song, Y.; Jaimes, A. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3584–3592. [Google Scholar]
- Zanella, M.; Ben Ayed, I. Low-Rank Few-Shot Adaptation of Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1593–1603. [Google Scholar]
- Lu, H.; Zhao, C.; Xue, J.; Yao, L.; Moore, K.; Gong, D. Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA. arXiv 2024, arXiv:2412.01004. [Google Scholar]
- Gou, Y.; Liu, Z.; Chen, K.; Hong, L.; Xu, H.; Li, A.; Yeung, D.Y.; Kwok, J.T.; Zhang, Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv 2023, arXiv:2312.12379. [Google Scholar]
- Elgendy, H.; Sharshar, A.; Aboeitta, A.; Ashraf, Y.; Guizani, M. Geollava: Efficient fine-tuned vision-language models for temporal change detection in remote sensing. arXiv 2024, arXiv:2410.19552. [Google Scholar]
- Jiang, Z.; Meng, R.; Yang, X.; Yavuz, S.; Zhou, Y.; Chen, W. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv 2024, arXiv:2410.05160. [Google Scholar]
- Chen, S.; Gu, J.; Han, Z.; Ma, Y.; Torr, P.; Tresp, V. Benchmarking robustness of adaptation methods on pre-trained vision-language models. Adv. Neural Inf. Process. Syst. 2024, 36, 51758–51777. [Google Scholar]
- Laurençon, H.; Tronchon, L.; Cord, M.; Sanh, V. What matters when building vision-language models? arXiv 2024, arXiv:2405.02246. [Google Scholar]
- Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Adv. Neural Inf. Process. Syst. 2024, 36, 61501–61513. [Google Scholar]
- Chen, J.; Lv, Z.; Wu, S.; Lin, K.Q.; Song, C.; Gao, D.; Liu, J.W.; Gao, Z.; Mao, D.; Shou, M.Z. VideoLLM-online: Online Video Large Language Model for Streaming Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18407–18418. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Post, M. A call for clarity in reporting BLEU scores. arXiv 2018, arXiv:1804.08771. [Google Scholar]
- Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, San Diego, CA, USA, 24–27 March 2002; pp. 138–145. [Google Scholar]
- Alam, M.J.; Hossain, I.; Puppala, S.; Talukder, S. Advancements in Multimodal Social Media Post Summarization: Integrating GPT-4 for Enhanced Understanding. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1934–1940. [Google Scholar]
- Son, J.; Park, J.; Kim, K. CSTA: CNN-based Spatiotemporal Attention for Video Summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18847–18856. [Google Scholar]
Model | Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore | Meteor | SacreBLEU | NIST |
---|---|---|---|---|---|---|---|---|
Video-LLaMA (Baseline) | VideoXum | 47.32 | 28.75 | 45.61 | 0.876 | 0.322 | 20.54 | 7.10 |
Alam et al. [33] | VideoXum | 51.50 | 32.10 | 49.50 | 0.894 | 0.353 | 24.12 | 8.24 |
MiLoRA-ViSum | VideoXum | 52.86 | 33.95 | 50.19 | 0.909 | 0.362 | 25.26 | 8.93 |
Video-LLaMA (Baseline) | ActivityNet | 46.10 | 27.65 | 44.80 | 0.872 | 0.310 | 19.87 | 6.88 |
Alam et al. [33] | ActivityNet | 50.00 | 31.00 | 48.50 | 0.890 | 0.348 | 23.56 | 8.10 |
MiLoRA-ViSum | ActivityNet | 51.72 | 32.14 | 48.62 | 0.911 | 0.352 | 24.29 | 8.42 |
Model | Dataset | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore | Meteor | SacreBLEU | NIST |
---|---|---|---|---|---|---|---|---|
He et al. [2] | VideoXum | 50.12 | 31.05 | 48.02 | 0.888 | 0.344 | 24.21 | 8.12 |
Son et al. [34] | ActivityNet | 49.50 | 30.10 | 47.80 | 0.882 | 0.336 | 22.34 | 7.86 |
MiLoRA-ViSum | VideoXum | 52.86 | 33.95 | 50.19 | 0.909 | 0.362 | 25.26 | 8.93 |
MiLoRA-ViSum | ActivityNet | 51.72 | 32.14 | 48.62 | 0.911 | 0.352 | 24.29 | 8.42 |
Model | Dataset | Training Time (Hours) | Inference Latency (ms) | Trainable Params (%) |
---|---|---|---|---|
He et al. [2] | VideoXum | 55 | 130 | 80% |
Son et al. [34] | ActivityNet | 50 | 120 | 60% |
MiLoRA-ViSum | VideoXum | 42 | 113 | 18% |
MiLoRA-ViSum | ActivityNet | 42 | 112 | 17% |
Configuration | ROUGE-1 | ROUGE-2 | ROUGE-L | BERTScore | Meteor | SacreBLEU | NIST |
---|---|---|---|---|---|---|---|
Temporal-only MiLoRA | 48.95 | 29.45 | 46.70 | 0.884 | 0.340 | 22.89 | 7.68 |
Spatial-only MiLoRA | 49.12 | 29.78 | 47.05 | 0.885 | 0.341 | 23.14 | 7.81 |
Combined MiLoRA | 52.86 | 33.95 | 50.19 | 0.911 | 0.362 | 25.26 | 8.93 |
Configuration | Training Time (Hours) | Inference Latency (ms) | Trainable Params (%) |
---|---|---|---|
Temporal-only MiLoRA | 35 | 105 | 10% |
Spatial-only MiLoRA | 36 | 107 | 10% |
Combined MiLoRA | 42 | 113 | 18% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Du, W.; Wang, G.; Li, X.; Chen, G.; Gao, J.; Zhao, H. A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts. Electronics 2025, 14, 2269. https://doi.org/10.3390/electronics14112269
Du W, Wang G, Li X, Chen G, Gao J, Zhao H. A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts. Electronics. 2025; 14(11):2269. https://doi.org/10.3390/electronics14112269
Chicago/Turabian StyleDu, Wenzhuo, Gerun Wang, Xin Li, Guancheng Chen, Jian Gao, and Hang Zhao. 2025. "A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts" Electronics 14, no. 11: 2269. https://doi.org/10.3390/electronics14112269
APA StyleDu, W., Wang, G., Li, X., Chen, G., Gao, J., & Zhao, H. (2025). A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts. Electronics, 14(11), 2269. https://doi.org/10.3390/electronics14112269