Next Article in Journal
FEM-Based Modelling and AI-Enhanced Monitoring System for Upper Limb Rehabilitation
Previous Article in Journal
Multimodal Medical Image Fusion Using a Progressive Parallel Strategy Based on Deep Learning
Previous Article in Special Issue
Tampering Detection in Absolute Moment Block Truncation Coding (AMBTC) Compressed Code Using Matrix Coding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

1
School of Information and Network Security, People’s Public Security University of China, Beijing 102206, China
2
Key Laboratory of Security Prevention Technology and Risk Assessment of the Ministry of Public Security, Beijing 100038, China
3
School of Criminal Investigation, People’s Public Security University of China, Beijing 102206, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(11), 2269; https://doi.org/10.3390/electronics14112269
Submission received: 7 January 2025 / Revised: 2 February 2025 / Accepted: 4 February 2025 / Published: 31 May 2025

Abstract

:
With the exponential growth of user-generated content on video-sharing platforms, the challenge of facilitating efficient searching and browsing of videos has garnered significant attention. To enhance users’ ability to swiftly locate and review pertinent videos, the creation of concise and informative video summaries has become increasingly important. Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features and requires a lot of computational resources and time. Therefore, we propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data and to control the number of parameters for training. By extending traditional Low-Rank Adaptation (LoRA) into a sophisticated mixture-of-experts paradigm, MiLoRA-ViSum incorporates a dual temporal–spatial adaptation mechanism tailored specifically for video summarization tasks. This approach dynamically integrates specialized LoRA experts, each fine-tuned to address distinct temporal or spatial dimensions. Extensive evaluations of the VideoXum and ActivityNet datasets demonstrate that MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs. The proposed mixture-of-experts strategy, combined with the dual adaptation mechanism, highlights the model’s potential to enhance video summarization capabilities, particularly in large-scale applications requiring both efficiency and precision.

1. Introduction

Video summarization, which involves distilling lengthy video content into concise and informative summaries, has become increasingly significant due to the exponential growth of video data across various domains. Applications such as content browsing, video analytics, and archival management require efficient summarization techniques, as manually reviewing entire video datasets is often unfeasible [1,2,3]. Traditional summarization methods, predominantly relying on heuristic strategies like clustering or keyframe extraction, are computationally efficient but lack the ability to generalize effectively across diverse video content. These approaches often fail to capture the complex, context-dependent information embedded within videos, as they primarily focus on either temporal or spatial features in isolation rather than addressing the inherent interplay between the two [4]. This limitation results in summaries that are often incomplete or fail to convey the essential semantics of the original content. The increasing complexity of video datasets necessitates models that can process and unify temporal and spatial features in a computationally efficient manner, a challenge that remains inadequately addressed by existing approaches.
Recent advancements in deep learning have greatly enhanced video summarization capabilities, with transformer-based models emerging as a cornerstone of these developments. Among such models, Video-LLaMA has shown significant promise due to its ability to effectively capture long-range dependencies and complex temporal dynamics embedded within video sequences. This capacity enables it to process extensive video content while maintaining semantic coherence and temporal consistency. However, despite its notable performance, the high computational costs and extensive resource requirements for training and fine-tuning these large-scale models have posed significant barriers to their practical deployment. Addressing these limitations necessitates an innovative approach that balances the need for high performance with computational efficiency. In this context, we propose transitioning from traditional Low-Rank Adaptation (LoRA) to a Mixture of LoRA Experts (MiLoRA) framework, which distributes computational workloads across specialized low-rank adaptation modules. This approach not only retains the advantages of Video-LLaMA but also introduces a scalable and resource-efficient solution for real-world applications [5].
To address these challenges, we propose a novel approach to video summarization, named MiLoRA-ViSum, which leverages a Mixture of Low-Rank Adaptation (MiLoRA) modules integrated into the Video-LLaMA backbone. MiLoRA extends the conventional LoRA framework by employing multiple specialized low-rank adaptation experts, each tailored to capture distinct aspects of temporal and spatial features in video data. These experts dynamically collaborate during training and inference, enabling the model to efficiently adapt to the complex dependencies inherent in video sequences. By distributing the adaptation workload across multiple experts, MiLoRA-ViSum not only enhances the model’s adaptability but also significantly reduces the number of trainable parameters while maintaining competitive performance. This approach ensures the generation of high-quality video summaries with minimized computational overhead, addressing the practical constraints of large-scale video summarization tasks.
In this paper, we evaluate the performance of MiLoRA-ViSum, a model that integrates the Mixture of Low-Rank Adaptation (MiLoRA) framework, using two widely recognized and diverse video summarization datasets, VideoXum [6] and ActivityNet [7]. These datasets were chosen for their ability to present a range of challenging scenarios, encompassing varied video content and temporal–spatial complexities, thereby providing a robust benchmark for assessing the proposed method’s adaptability and effectiveness. Through extensive experimentation, MiLoRA-ViSum demonstrates its capability to generate high-quality video summaries while achieving computational efficiency. Notably, the results reveal that MiLoRA-ViSum not only matches but, in several cases, surpasses the performance of state-of-the-art models in terms of summarization quality. This is accomplished with a significantly reduced computational footprint, underscoring the method’s potential for scalable and efficient deployment in real-world applications. These findings highlight the practical advantages of incorporating multiple specialized low-rank adaptation experts to optimize temporal and spatial feature learning, validating MiLoRA-ViSum as a promising advancement in video summarization research.
To address these challenges, we propose a novel approach named MiLoRA-ViSum, which extends the concept of Low-Rank Adaptation (LoRA) by introducing a mixture of LoRA experts framework with a dual adaptation mechanism specifically tailored for video summarization tasks. Unlike traditional LoRA implementations that apply a singular low-rank adaptation, MiLoRA-ViSum incorporates multiple specialized LoRA experts, each fine-tuned for distinct aspects of temporal and spatial feature processing. This multi-expert system dynamically selects and integrates the most relevant adaptations based on the video content, enabling the model to better capture the complex dependencies and interactions inherent in video data. The dual-adaptation mechanism further enhances the framework by applying specialized LoRA adaptations to both the temporal attention layers and spatial convolutional layers of the Video-LLaMA backbone. This comprehensive approach allows the model to effectively process both the sequential patterns and spatial structures present in video sequences, achieving a holistic and computationally efficient representation. To the best of our knowledge, MiLoRA-ViSum is the first work to integrate a mixture-of-experts strategy within the LoRA framework for simultaneous temporal and spatial fine-tuning, marking a significant methodological advancement in video summarization research.
  In summary, our key contributions are as follows:
  • We propose MiLoRA-ViSum, a novel framework that introduces a mixture of LoRA experts to achieve an advanced dual temporal–spatial adaptation mechanism tailored specifically for video summarization tasks. By integrating specialized LoRA experts for temporal and spatial layers, our approach ensures efficient and precise feature capture across video data.
  • We demonstrate significant methodological advancements by leveraging the mixture of LoRA experts to dynamically optimize both temporal attention layers and spatial convolutional layers within the Video-LLaMA backbone. This dual adaptation mechanism enhances the ability to model complex temporal–spatial dependencies while maintaining computational efficiency.
  • We provide a comprehensive empirical evaluation of MiLoRA-ViSum on two widely recognized video summarization datasets, VideoXum and ActivityNet, conducting a thorough comparison with existing state-of-the-art models. Our results indicate that MiLoRA-ViSum achieves competitive performance while reducing the number of trainable parameters significantly, underscoring its scalability and practicality for real-world applications.

2. Related Work

2.1. Video Summarization

Video summarization has become a critical research area within computer vision, driven by the exponential increase in video content across various platforms and domains. This task focuses on extracting concise, representative summaries from extensive video sequences, enabling efficient content consumption for applications such as video analytics, content recommendation systems, and archival retrieval [8]. Traditional video summarization methods often rely on heuristic approaches, including clustering techniques and keyframe extraction algorithms. These methods aim to identify visually distinct or statistically significant frames within a video [9,10]. While such approaches are computationally efficient and have seen widespread adoption in earlier systems, their reliance on simple rules and thresholds frequently limits their ability to capture complex patterns in real-world video data. These methods often fail to generalize across diverse video types and typically overlook the contextual information and temporal relationships embedded within videos, resulting in summaries that are disjointed or incomplete [11,12].
The advent of deep learning has revolutionized video summarization by introducing methods capable of modeling spatio-temporal dependencies in a more nuanced and data-driven manner. Early attempts in this space employed Convolutional Neural Networks (CNNs) to capture spatial features and Recurrent Neural Networks (RNNs) to model temporal dynamics. These architectures demonstrated improved performance compared to heuristic-based methods by leveraging large-scale datasets to learn meaningful feature representations. For example, RNNs, such as Long Short-Term Memory (LSTM) networks [13,14], have been effective in capturing temporal sequences, enabling a better understanding of video narratives and transitions [15]. However, these methods are often constrained by their limited ability to process long video sequences due to vanishing gradient problems and high computational costs associated with recurrent architectures. As a result, while CNN-RNN-based methods represent a step forward, they still fall short in addressing the full spectrum of challenges in video summarization, particularly for complex, dynamic video content [16].
More recently, transformer-based models, known for their self-attention mechanisms, have emerged as a promising solution to the limitations of earlier methods. These models excel at capturing long-range dependencies and hierarchical relationships in sequential data, making them well-suited for video summarization tasks [15]. Models such as Video-LLaMA leverage transformers to process both temporal and spatial features in a unified manner, enabling them to capture intricate dependencies across frames while maintaining contextual coherence. However, despite their impressive performance improvements, transformer-based models are associated with high computational costs and substantial resource requirements during training and inference [17]. These constraints hinder their deployment in resource-limited environments and large-scale real-world applications. Addressing these challenges requires innovative strategies that balance computational efficiency with the ability to capture rich spatio-temporal relationships [18]. Consequently, recent research efforts have focused on developing more efficient transformer architectures, parameter reduction techniques, and hybrid models to make video summarization more accessible and scalable without compromising performance [1].

2.2. LoRA in Vision-Language Models

Low-rank adaptation (LoRA) has emerged as a key technique in vision–language models, providing a mechanism to significantly reduce computational overhead while retaining model performance. Originally introduced for efficiently fine-tuning language models, LoRA works by introducing low-rank updates to the weight matrices of pre-trained models, which allows for task-specific adaptation without retraining the entire network. This approach has proven particularly beneficial in the context of vision-language models, where the computational complexity is often exacerbated by the need to process both visual and textual inputs. For instance, in VideoLLM-online, LoRA was successfully employed to adapt pre-trained large language models for streaming video understanding tasks. This integration allowed for real-time processing of video streams [19], reducing the memory and computational requirements associated with traditional fine-tuning methods. By focusing on low-rank updates, the approach enables the model to generalize effectively across tasks while maintaining high accuracy and efficiency [20].
Recent advancements in LoRA-based techniques have further demonstrated their versatility in vision–language tasks, particularly in models such as Video-LLaMA, where LoRA facilitates the integration of visual and textual modalities. In Video-LLaMA, LoRA enables the fine-tuning of pre-trained language models to incorporate visual information, bridging the gap between the two modalities [21]. This model excels in tasks such as video question answering and caption generation, where understanding both spatial and temporal visual features is critical. By leveraging LoRA, these models manage to adapt pre-trained architectures to complex video-based tasks with minimal additional computational cost. However, despite their success, these applications primarily focus on tasks involving generalized video understanding rather than the more specialized task of video summarization. This highlights a gap in the literature where LoRA’s potential for optimizing both temporal and spatial adaptations in the context of summarization remains largely untapped [22].
Our work seeks to address this gap by extending the application of LoRA to video summarization tasks, with a specific focus on optimizing both temporal and spatial feature representations. While previous studies have demonstrated the utility of LoRA in simplifying model adaptation for vision-language tasks [23], they often overlook the unique challenges posed by video summarization. For instance, summarization requires not only understanding the visual content in individual frames but also capturing the temporal progression and contextual transition throughout a video. To address this, we introduce a novel approach that incorporates a Mixture of LoRA Experts (MiLoRA). Unlike traditional LoRA methods that apply uniform low-rank adaptations across layers, MiLoRA employs specialized adaptations for different components of the video summarization pipeline. This includes tailoring LoRA modules to temporal attention mechanisms for modeling long-range dependencies and spatial convolutional layers for capturing localized visual details. By leveraging this mixture of experts, our method provides a more granular and effective adaptation strategy, enabling the generation of concise and coherent summaries while maintaining computational efficiency [24]. This approach represents a significant advancement in the application of LoRA to specialized video tasks, paving the way for broader adoption in computationally constrained environments.

3. Method

3.1. MiLoRA-ViSum Model Architecture

The proposed MiLoRA-ViSum framework extends the Low-Rank Adaptation (LoRA) methodology by integrating a Mixture of LoRA Experts (MiLoRA) into the Video-LLaMA backbone model, significantly enhancing the model’s efficiency and performance in video summarization tasks (see Figure 1). Unlike the traditional LoRA approach, which applies uniform low-rank updates to model layers, MiLoRA introduces a mixture of specialized low-rank modules tailored for different components of the architecture. This ensures that both temporal and spatial features of video data are effectively captured while maintaining computational efficiency. In this paper, we use the MiLoRA-ViSum framework to generate video summaries as shown in the pseudo-code in Algorithm 1.
The MiLoRA mechanism operates by introducing multiple expert modules, each designed to adaptively fine-tune specific layers of the Video-LLaMA model. For a given weight matrix W in the -th layer, the adaptation is expressed as a summation of low-rank updates contributed by K specialized experts:
Δ W = k = 1 K B ( k ) A ( k ) ,
where B ( k ) R d × r k and A ( k ) R r k × k are the low-rank decomposition matrices for the k-th expert, with  r k representing the rank for that expert. The summation allows MiLoRA to flexibly aggregate updates from multiple experts, each of which specializes in distinct aspects of the video summarization process.
Algorithm 1: MiLoRA-ViSum
Electronics 14 02269 i001
To capture the distinct characteristics of visual and textual data, MiLoRA-ViSum employs separate modules for video and text processing. These modules are integrated into the Video-Feedforward Networks (Video-FFN) and Text-Feedforward Networks (Text-FFN), respectively, as shown in Figure 1. For the video stream, the adaptation focuses on capturing temporal dynamics through self-attention mechanisms augmented with low-rank updates. For a video input sequence x v , the output is computed as:
y v = Softmax Q v K v d k + Δ W v , attn V v ,
where Q v , K v , V v are the query, key, and value matrices for the video input, and Δ W v , attn represents the low-rank updates applied specifically to the attention mechanism.
Similarly, for textual inputs x t , the Text-FFN modules leverage LoRA to fine-tune pre-trained language model layers, enabling alignment between visual and textual modalities. The adaptation for the text attention layers is given by:
y t = Softmax Q t K t d k + Δ W t , attn V t .
A key innovation in the MiLoRA-ViSum architecture is the alignment-guided self-attention mechanism [25], which bridges the gap between video and text streams. This module integrates temporal and textual embeddings by leveraging segment embeddings, position embeddings, and feature embeddings, as illustrated in Figure 1. The alignment is achieved by computing attention scores that emphasize cross-modal dependencies:
Attention ( X v , X t ) = Softmax Q ( K v + K t ) d k ,
where Q represents the combined query matrix derived from both modalities, and K v , K t are the key matrices for video and text, respectively. The alignment ensures that the final embeddings reflect meaningful interactions between the two streams.
To train the MiLoRA-ViSum framework effectively, we employ a composite loss function that combines a summarization loss L sum with an expert regularization term L reg . The overall loss function is defined as:
L = L sum + λ reg = 1 L k = 1 K B ( k ) F 2 + A ( k ) F 2 ,
where λ reg controls the contribution of the regularization term, ensuring that the low-rank updates remain compact and interpretable. This regularization encourages sparsity among the expert modules, enabling efficient utilization of computational resources.
Figure 1 provides a comprehensive overview of the MiLoRA-ViSum framework. The figure highlights the modular structure of the architecture, with separate branches for video and text processing, and illustrates how alignment-guided self-attention integrates these modalities [26]. The inclusion of segment embeddings and positional information ensures that both temporal and spatial dependencies are preserved. Additionally, the visual representation of expert modules emphasizes their modularity and adaptability, which are central to the efficiency of the MiLoRA approach.

3.2. Integrated Temporal–Spatial Adaptation and Optimization for Video Summarization

A significant challenge in video summarization is the effective integration of temporal and spatial features within video sequences, as these dimensions exhibit complex interdependencies. To address this, we propose a novel dual adaptation mechanism that extends the MiLoRA (Mixture of LoRA Experts) framework to both temporal attention layers and spatial convolutional layers of the Video-LLaMA model. This design allows for the simultaneous fine-tuning of temporal and spatial features, capturing intricate dependencies while maintaining computational efficiency. For the temporal attention layer t , we model the low-rank adaptation as:
Δ W t = k = 1 K B ( k ) A ( k ) ,
where B ( k ) R d × r k and A ( k ) R r k × d represent the decomposition matrices for the k-th expert, and K denotes the total number of experts. Each expert specializes in capturing a specific aspect of temporal dynamics, allowing for a flexible aggregation of information across different temporal patterns. The output of the temporal attention mechanism, incorporating the MiLoRA updates, is computed as:
y t = Softmax Q K d k + Δ W t V ,
where Q , K , V represent the query, key, and value matrices for the temporal attention layer, respectively. Similarly, for the spatial convolutional layer s , we apply a low-rank adaptation using the same formulation:
Δ W s = k = 1 K B ( k ) A ( k ) ,
where B ( k ) R d × r k and A ( k ) R r k × d are the low-rank matrices for the k-th expert in the spatial layer. This decomposition enables the efficient modeling of spatial dependencies without significantly increasing computational costs. The spatial feature transformation is expressed as:
y s = σ X W + Δ W s ,
where X is the spatial input feature map, * represents the convolution operation, and σ ( · ) is the activation function. A key contribution of our dual adaptation mechanism is the joint optimization of temporal and spatial adaptations. Instead of treating these adaptations independently, we introduce a unified loss function that balances their contributions, enabling cohesive learning across both dimensions. The overall loss function for the MiLoRA framework is now comprehensively described as:
L ( Θ ) = L sum + λ t t k = 1 K B ( k ) F 2 + A ( k ) F 2 + λ s s k = 1 K B ( k ) F 2 + A ( k ) F 2 ,
where L sum represents the summarization loss, λ t and λ s are regularization coefficients for the temporal and spatial layers, respectively, and · F denotes the Frobenius norm. This unified formulation, aligned with the explanation in Section 3.3, ensures that the adaptations are both effective and computationally efficient while promoting sparsity and interpretability among expert modules.
The dual adaptation mechanism leverages the interdependencies between temporal and spatial features by aligning their respective representations. Temporal attention outputs y t are fused with spatial convolutional outputs y s to create a unified representation:
y fused = α · y t + ( 1 α ) · y s ,
where α is a learnable weight parameter balancing the contributions of temporal and spatial features. This fusion step ensures that the model captures nuanced interactions between these domains effectively.

3.3. Optimization and Loss Function

To optimize MiLoRA-ViSum, we employ a novel multi-task loss function that integrates the summarization loss with a regularization term to balance the contributions of temporal and spatial adaptations, while effectively leveraging the Mixture of LoRA Experts (MiLoRA). This innovative approach ensures that the model captures intricate temporal and spatial dependencies in a computationally efficient manner. The overall loss function is formulated as:
L ( Θ ) = L sum ( y , y true ) + λ t t k = 1 K B t ( k ) F 2 + A t ( k ) F 2 + λ s s k = 1 K B s ( k ) F 2 + A s ( k ) F 2 ,
where L sum ( y , y true ) is the summarization loss measuring the discrepancy between the predicted summary y and the ground truth summary y true , using metrics such as cross-entropy loss. The terms λ t and λ s are regularization coefficients for temporal and spatial adaptations, respectively. The Frobenius norm · F 2 penalizes large weights in the low-rank matrices B t ( k ) , A t ( k ) , B s ( k ) , and A s ( k ) , ensuring sparsity and stability across the k-th mixture of experts.
The mixture-of-experts framework allows each expert k to specialize in a subset of temporal or spatial features, dynamically contributing to the final adaptation. The combination of outputs from all experts is computed as:
Δ W t = k = 1 K g k ( z ) B t ( k ) A t ( k ) , Δ W s = k = 1 K g k ( z ) B s ( k ) A s ( k ) ,
where g k ( z ) is a gating function based on input z , ensuring that only the most relevant experts are activated for a given input. This dynamic gating mechanism enhances the flexibility and efficiency of the model.

3.4. Model Integration and Training Strategy

The integration of MiLoRA into the Video-LLaMA model is designed to be minimally invasive, preserving the core architecture while introducing low-rank adaptations through the mixture of experts. During training, only the low-rank matrices B and A are updated, leaving the original weight matrices W unchanged. This strategy reduces the number of trainable parameters and computational resources required.
The training process is divided into three key stages:
  • Pre-training on a Large-Scale Dataset: The base Video-LLaMA model is pre-trained on a large-scale video dataset to learn general spatio-temporal representations. This step initializes the model with robust feature extraction capabilities.
  • Expert Specialization and Fine-Tuning: After integrating MiLoRA, the mixture of experts is fine-tuned on specific video summarization datasets, such as VideoXum and ActivityNet. During this stage, the gating functions g k ( z ) are optimized to dynamically activate the most relevant experts, ensuring efficient adaptation to diverse video content.
  • Regularization and Early Stopping: Regularization terms B t ( k ) F 2 and A t ( k ) F 2 are applied to prevent overfitting, and early stopping is employed based on the validation loss. This ensures that the model generalizes well to unseen data.
The training process utilizes the Adam optimizer with a learning rate η dynamically adjusted using a cosine decay schedule:
η t = η min + 1 2 ( η max η min ) 1 + cos t T π ,
where t is the current training step, T is the total number of steps, and η min , η max are the minimum and maximum learning rates, respectively.

4. Experimental Setup

4.1. Baseline Models

To comprehensively evaluate the effectiveness of the proposed MiLoRA-ViSum framework, we conducted comparisons against several State-Of-The-Art (SOTA) models that represent the current advancements in video understanding and summarization tasks. These models include both general video summarization frameworks and vision-language models that integrate Low-Rank Adaptation (LoRA) for task-specific adaptations. Specifically, we evaluated MiLoRA-ViSum against VideoLLM-online [27], a streaming video large language model designed for online video understanding tasks, and Video-LLaMA [5], which integrates visual features with large language models using LoRA. These models were chosen as baselines because they effectively demonstrate the capabilities of LoRA for reducing computational overhead while maintaining competitive performance in video-related tasks.
VideoLLM-online introduces a highly efficient framework for adapting pre-trained language models to video understanding tasks through LoRA, particularly focusing on online processing capabilities. However, this model emphasizes general video understanding, such as captioning and question answering, rather than the unique challenges of video summarization. Similarly, Video-LLaMA incorporates LoRA to align visual and textual modalities, excelling in tasks like video captioning and visual question answering. Despite their strengths, neither of these models addresses the dual temporal–spatial adaptation requirements that are critical for generating concise and informative video summaries. In contrast, MiLoRA-ViSum extends the LoRA framework to include a mixture of LoRA experts, specifically tailored for simultaneous fine-tuning of temporal attention layers and spatial convolutional layers. This enables our approach to better capture the intricate dependencies and hierarchical representations needed for effective video summarization.
Furthermore, in our evaluations, we also considered transformer-based models such as VidSummarize [3] and hierarchical Recurrent Neural Network (RNN)-based models like H-RNN [15], which have shown promise in video summarization tasks. These models were included to benchmark MiLoRA-ViSum against non-LoRA-based architectures, allowing for a broader assessment of its performance improvements. The comparisons focused on multiple aspects, including model efficiency, scalability, and the ability to generalize across diverse video summarization scenarios, as well as the overall quality of generated summaries as measured by standard evaluation metrics.

4.2. Experimental Environment and Datasets

The experimental evaluations of MiLoRA-ViSum were conducted in a controlled environment designed to ensure consistent and reproducible results. The backbone model, Video-LLaMA, was extended with the MiLoRA framework and fine-tuned on widely used video summarization datasets. The experiments were executed on NVIDIA A100 GPUs with 80 GB of memory (NVIDIA, Beijing, China), leveraging CUDA 12.1 and mixed-precision training to optimize computational efficiency. All models were implemented in PyTorch 3.9, with the MiLoRA-specific components integrated into the existing Video-LLaMA architecture. Hyperparameter tuning was performed using grid search, ensuring optimal performance for each baseline and the proposed MiLoRA-ViSum framework.
We utilized two benchmark datasets, VideoXum [6] and ActivityNet [7], which represent diverse and challenging scenarios for video summarization. VideoXum is a large-scale dataset that features a mix of professionally produced and user-generated content, encompassing a wide variety of genres, such as sports, documentaries, and vlogs. This dataset presents a unique challenge due to its diversity in video styles and temporal dynamics, making it a robust testbed for evaluating the adaptability of video summarization models. ActivityNet, on the other hand, is a curated dataset that includes temporally annotated activities across multiple domains, providing a structured framework for assessing the temporal alignment capabilities of the proposed method.
Prior to training, the videos were preprocessed to standardize the input format. Each video was divided into sequences of 1024 frames, and frame-level features were extracted using a ResNet-50 backbone pre-trained on ImageNet. These features were subsequently passed to the Video-LLaMA architecture, where MiLoRA was applied to dynamically adapt the temporal and spatial representations. For consistency, the same preprocessing pipeline was used across all baseline models, ensuring a fair comparison of results.

4.3. Training Procedure and Parameter Optimization

The training procedure for MiLoRA-ViSum was carefully designed to leverage the Mixture of Low-Rank Adaptation (MiLoRA) matrices for efficient fine-tuning while preserving model performance. Each weight matrix W in the -th layer of the Video-LLaMA model was decomposed into a mixture of low-rank components, enabling the dynamic adaptation of the model to task-specific requirements. Specifically, the adaptation matrix Δ W was defined as:
Δ W = i = 1 M α i B , i A , i ,
where M represents the number of experts in the mixture, B , i R d × r i and A , i R r i × k are the low-rank matrices for the i-th expert, r i is the rank of the i-th expert, and α i are trainable coefficients that govern the contribution of each expert. This architecture allows MiLoRA-ViSum to flexibly combine multiple experts, capturing diverse temporal and spatial features.
During fine-tuning, only the MiLoRA parameters ( B , i , A , i , α i ) were updated, while the pre-trained weights W of the Video-LLaMA backbone remained frozen. This approach significantly reduced the number of trainable parameters and minimized computational overhead.
The training objective was to minimize a composite loss function that balanced the summarization task’s accuracy and the regularization of the MiLoRA parameters. The total loss function L ( Θ ) was defined as:
L ( Θ ) = L sum ( y , y true ) + λ = 1 L i = 1 M B , i F 2 + A , i F 2 + α i 2 2 ,
where L sum is the summarization loss (e.g., cross-entropy loss) computed between the predicted summaries y and the ground truth summaries y true , · F denotes the Frobenius norm, · 2 denotes the L 2 -norm, and λ is a hyperparameter controlling the strength of the regularization.
To enhance convergence stability, the Adam optimizer was employed with hyperparameters β 1 = 0.9 , β 2 = 0.999 , and ϵ = 10 8 . The learning rate was scheduled using a cosine decay with warm-up to gradually increase the learning rate in the initial stages of training before reducing it for fine-tuning:
η t = η max · 1 2 1 + cos t T π ,
where η t is the learning rate at step t, η max is the initial maximum learning rate, and T is the total number of training steps. Warm-up was applied over 10% of the total training steps to stabilize gradient updates in the early training phase.
Early stopping was implemented to avoid overfitting. The training process was halted if the ROUGE-L score on the validation set did not improve over five consecutive epochs. Gradient clipping with a maximum norm of 1.0 was also applied to mitigate the risk of exploding gradients in deeper layers of the model.
The rank r i of each low-rank component was chosen as a fraction p of the original weight dimensions, with r i = p · min ( d , k ) . The value of p was tuned based on a grid search, ensuring an optimal trade-off between computational efficiency and representational power. The number of experts M was similarly determined through cross-validation on the training data.
Finally, the training strategy consisted of two stages: pre-training and fine-tuning. In the pre-training stage, MiLoRA parameters were initialized and trained on a large-scale general-purpose video dataset to capture broad temporal and spatial patterns. During fine-tuning, the model was trained on task-specific datasets (VideoXum and ActivityNet) to optimize for video summarization while retaining the pre-trained knowledge. This two-stage approach ensured that the model generalized well across diverse scenarios while performing effectively on specific tasks.

4.4. Evaluation Metrics and Validation

The evaluation of MiLoRA-ViSum’s performance was conducted using a diverse set of metrics to comprehensively assess the quality of the generated video summaries. These metrics included both traditional summarization metrics and advanced language evaluation metrics, ensuring a robust evaluation of the proposed method’s capabilities. The primary metrics used for evaluation were ROUGE-1, ROUGE-2, and ROUGE-L [28], which measure the n-gram overlap and sequence coherence between the generated summaries and the reference annotations. These metrics are widely used in summarization tasks and provide a reliable indication of the content coverage and alignment with ground truth summaries.
In addition to the ROUGE metrics, we incorporated BERTScore [29], a semantic similarity metric that uses contextual embeddings from pre-trained language models to compare generated summaries with reference summaries. BERTScore evaluates the alignment of meaning rather than just surface-level lexical matches, making it particularly valuable for assessing the semantic fidelity of the summaries. METEOR [30], another widely recognized metric, was also included. METEOR evaluates the quality of machine-generated summaries by considering synonyms, stemming, and word order, offering a more nuanced evaluation compared to n-gram-based metrics like ROUGE.
To further enhance the evaluation, we employed sacreBLEU [31], a standardized implementation of the BLEU score that ensures reproducibility across experiments. sacreBLEU measures the precision of n-grams in the generated summaries while addressing some limitations of traditional BLEU implementations, such as tokenization inconsistencies. Additionally, we included the NIST metric [32], which builds on BLEU by emphasizing the informativeness of n-grams. NIST assigns higher weights to less frequent n-grams, making it more sensitive to meaningful content differences.
This comprehensive suite of metrics allowed us to evaluate MiLoRA-ViSum from multiple perspectives, including lexical overlap, semantic coherence, and content informativeness. The diversity of metrics ensured a balanced assessment of the model’s performance, particularly in scenarios where strict n-gram overlap may not fully capture the quality of generated summaries.
The results were reported for both VideoXum and ActivityNet datasets, as these datasets encompass diverse video content and temporal structures, providing a robust evaluation framework. For each metric, we calculated the mean scores across all test samples, ensuring statistical reliability. By leveraging this diverse set of evaluation metrics, we demonstrated MiLoRA-ViSum’s ability to consistently generate high-quality, semantically coherent, and informative summaries, while maintaining a significant advantage in computational efficiency compared to baseline models.

5. Results and Analysis

5.1. Overall Performance and Comparison

Table 1 presents the comparative results of MiLoRA-ViSum and State-Of-The-Art (SOTA) models, including Video-LLaMA and the GPT-4-integrated model proposed by Alam et al. [33], evaluated on the VideoXum and ActivityNet datasets. The evaluation utilized multiple metrics to provide a comprehensive assessment of summarization performance, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, Meteor, SacreBLEU, and NIST. These metrics collectively measure various aspects of summarization quality, such as n-gram overlap, semantic similarity, and linguistic diversity, ensuring a holistic evaluation of the proposed approach.
The results demonstrate that MiLoRA-ViSum achieves competitive performance across all evaluation metrics with significantly fewer trainable parameters. Specifically, on the VideoXum dataset, MiLoRA-ViSum achieves a BERTScore of 0.909, indicating high semantic similarity between the generated and reference summaries, and a SacreBLEU score of 25.26, reflecting robust linguistic alignment. On the ActivityNet dataset, MiLoRA-ViSum attains a Meteor score of 0.352 and a NIST score of 8.42, further confirming its ability to generate coherent and informative summaries.
A key advantage of MiLoRA-ViSum lies in its efficiency. The mixture of LoRA experts enables adaptive fine-tuning of temporal and spatial components without requiring extensive computational resources. The number of trainable parameters in MiLoRA-ViSum is reduced by approximately 83% compared to the baseline Video-LLaMA model, highlighting its scalability for real-world applications. Furthermore, MiLoRA-ViSum maintains consistent performance across diverse datasets, underscoring its generalizability and robustness in handling various video summarization tasks.
Overall, these results validate the efficacy of the proposed MiLoRA-ViSum framework in achieving a strong balance between performance and computational efficiency, making it a viable alternative to more resource-intensive SOTA models. The integration of advanced metrics, such as BERTScore and SacreBLEU, further emphasizes its capability to produce semantically rich and linguistically accurate video summaries.

5.2. Comparison with State-of-the-Art (SOTA) and Scalability

The proposed MiLoRA-ViSum framework was evaluated against State-Of-The-Art (SOTA) models using multiple datasets and a comprehensive set of evaluation metrics to provide a detailed analysis of its performance and scalability. As shown in Table 2, MiLoRA-ViSum achieves competitive results across various benchmarks, including VideoXum and ActivityNet, using a combination of traditional metrics such as ROUGE-1, ROUGE-2, and ROUGE-L, and more recent advanced metrics like BERTScore, Meteor, SacreBLEU, and NIST. These metrics allow for a multi-faceted evaluation, encompassing n-gram overlap, semantic similarity, linguistic fluency, and overall informativeness.
The results reveal that MiLoRA-ViSum achieves comparable performance to SOTA models such as He et al. [2], demonstrating its ability to generate semantically rich and fluent summaries while maintaining high computational efficiency. For instance, on the VideoXum dataset, MiLoRA-ViSum attains a BERTScore of 0.909, which is on par with leading models and highlights the semantic coherence of its outputs. Similarly, its Meteor score of 0.362 reflects strong alignment with human-written references, while the SacreBLEU and NIST scores emphasize its linguistic precision and informativeness.
A critical advantage of MiLoRA-ViSum is its scalability and computational efficiency. Unlike conventional methods that require extensive computational resources, MiLoRA-ViSum leverages the Mixture of LoRA Experts (MiLoRA) architecture to significantly reduce the number of trainable parameters. As shown in Table 3, MiLoRA-ViSum uses only 17% of the trainable parameters compared to the baseline Video-LLaMA model while maintaining competitive performance. This reduction in trainable parameters translates to a 30% decrease in training time and lower inference latency, making MiLoRA-ViSum particularly suitable for resource-constrained environments and large-scale deployments.
Overall, these findings underscore the effectiveness of MiLoRA-ViSum in balancing performance and efficiency. By combining a novel mixture of LoRA experts with a dual temporal–spatial adaptation mechanism, MiLoRA-ViSum achieves state-of-the-art-level performance across multiple evaluation metrics while significantly reducing computational overhead. This makes it a robust and practical choice for real-world video summarization applications, particularly in scenarios requiring scalability and resource efficiency.

5.3. Independent Analysis: Impact of Mixture of LoRA Experts on Temporal and Spatial Adaptations

To further validate the effectiveness of the Mixture of LoRA Experts (MiLoRA), we conducted an independent experiment analyzing its specific impact on temporal and spatial feature adaptation. Unlike traditional LoRA, MiLoRA introduces specialized experts for both temporal and spatial dimensions, enabling the model to better capture dependencies within video data. This experiment evaluates the contribution of temporal-only, spatial-only, and combined temporal–spatial adaptations under the MiLoRA framework.

Experimental Setup

For this analysis, we designed three configurations of MiLoRA-ViSum:
  • Temporal-only MiLoRA: MiLoRA experts are applied exclusively to temporal attention layers.
  • Spatial-only MiLoRA: MiLoRA experts are applied exclusively to spatial convolutional layers.
  • Combined MiLoRA: MiLoRA experts are applied jointly to both temporal and spatial layers.
Each configuration was trained and evaluated on the VideoXum dataset. The evaluation metrics included ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, Meteor, SacreBLEU, and NIST, providing a comprehensive assessment of summarization quality. To isolate the effect of each adaptation mechanism, the computational resources (e.g., GPU type and memory allocation) and hyperparameters (e.g., learning rate, batch size) were kept consistent across all experiments. Table 4 presents the performance comparison of the three configurations across the aforementioned metrics.
The results indicate that both temporal and spatial adaptations independently contribute to summarization quality, with spatial-only MiLoRA achieving slightly better results than temporal-only MiLoRA across most metrics. This suggests that spatial features, such as scene context and object interactions, play a slightly more critical role in generating coherent summaries for the VideoXum dataset.
However, the combined MiLoRA configuration significantly outperforms both individual adaptations across all metrics. For instance, the combined approach achieves a ROUGE-2 score of 33.95, compared to 29.45 for temporal-only MiLoRA and 29.78 for spatial-only MiLoRA. Similarly, the BERTScore improves from 0.884 and 0.885 (for temporal-only and spatial-only, respectively) to 0.911 for the combined configuration. These findings highlight the importance of modeling the interplay between temporal and spatial features, as video data inherently involve both dimensions.
To assess the scalability of these configurations, we also measured the computational overhead associated with each approach, as shown in Table 5.
While the combined MiLoRA configuration requires marginally more computational resources (e.g., 18% trainable parameters compared to 10% for single-adaptation configurations), the significant performance gains justify the additional cost. The results demonstrate that the dual adaptation mechanism effectively balances performance and efficiency, making it a practical solution for real-world applications.
This independent experiment highlights the distinct and complementary contributions of temporal and spatial adaptations within the MiLoRA framework. The combined configuration achieves the best results, confirming that jointly modeling temporal and spatial features is critical for high-quality video summarization. These findings further validate the effectiveness and scalability of MiLoRA-ViSum, particularly in complex video summarization tasks that demand a holistic understanding of video data.

5.4. Comparsion with Related Works

Figure 2 presents a line graph comparing the accuracy of MiLoRA-ViSum with baselines such as Video-LLaMA and models by Alam et al. across the VideoXum and ActivityNet datasets. The proposed framework demonstrates superior performance, particularly in ROUGE-1 and SacreBLEU, indicating improved summarization quality. Figure 3 illustrates a radar plot showing the robustness of MiLoRA-ViSum across multiple evaluation metrics, with consistent outperformance across datasets, especially in ROUGE-L and BERTScore, highlighting its ability to maintain semantic coherence in summaries.
Figure 4 evaluates the intensity of different MiLoRA configurations—temporal-only, spatial-only, and combined. The combined MiLoRA configuration shows the highest performance across all metrics, confirming the effectiveness of integrating both temporal and spatial adaptations. The metrics, including ROUGE-1, ROUGE-2, and SacreBLEU, exhibit a noticeable upward trend from temporal-only to combined MiLoRA, indicating that the dual adaptation mechanism enhances both the precision and comprehensiveness of the summaries. This comprehensive comparison underscores MiLoRA-ViSum’s efficiency and superiority over existing approaches, validating its effectiveness in generating high-quality video summaries.

6. Conclusions

In this paper, we introduced MiLoRA-ViSum, a novel video summarization framework that extends the Low-Rank Adaptation (LoRA) methodology through the integration of a Mixture of LoRA Experts (MiLoRA), enabling dual adaptation across temporal and spatial dimensions. Unlike traditional LoRA, MiLoRA dynamically allocates specialized LoRA modules to different layers, allowing the model to simultaneously address the distinct challenges posed by temporal dynamics and spatial dependencies in video data. By fine-tuning temporal attention layers to capture long-range dependencies and spatial convolutional layers to enhance scene-specific representations, MiLoRA-ViSum achieves a comprehensive understanding of video content while maintaining computational efficiency. Extensive experiments on benchmark datasets, including VideoXum and ActivityNet, demonstrated that MiLoRA-ViSum consistently outperforms existing state-of-the-art models across multiple evaluation metrics, such as ROUGE, BERTScore, Meteor, and SacreBLEU while requiring only 15% of the trainable parameters compared to baseline methods. This highlights its scalability and adaptability for real-world applications, making MiLoRA-ViSum a significant advancement in the field of video summarization.

Author Contributions

Conceptualization, W.D. and G.W.; Methodology, W.D. and X.L.; Software, W.D., G.C., J.G. and H.Z.; Validation, W.D., G.W., J.G. and H.Z.; Formal analysis, G.W.; Investigation, G.W.; Resources, X.L. and G.C.; Data curation, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Saini, P.; Kumar, K.; Kashid, S.; Saini, A.; Negi, A. Video summarization using deep learning techniques: A detailed analysis and investigation. Artif. Intell. Rev. 2023, 56, 12347–12385. [Google Scholar]
  2. He, B.; Wang, J.; Qiu, J.; Bui, T.; Shrivastava, A.; Wang, Z. Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14867–14878. [Google Scholar]
  3. Jangra, A.; Mukherjee, S.; Jatowt, A.; Saha, S.; Hasanuzzaman, M. A survey on multi-modal summarization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar]
  4. Elfeki, M.; Wang, L.; Borji, A. Multi-stream dynamic video summarization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 339–349. [Google Scholar]
  5. Zhang, H.; Li, X.; Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv 2023, arXiv:2306.02858. [Google Scholar]
  6. Lin, J.; Hua, H.; Chen, M.; Li, Y.; Hsiao, J.; Ho, C.; Luo, J. Videoxum: Cross-modal visual and textural summarization of videos. IEEE Trans. Multimed. 2023, 26, 5548–5560. [Google Scholar] [CrossRef]
  7. Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
  8. Rennard, V.; Shang, G.; Hunter, J.; Vazirgiannis, M. Abstractive meeting summarization: A survey. Trans. Assoc. Comput. Linguist. 2023, 11, 861–884. [Google Scholar] [CrossRef]
  9. Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-specific video summarization. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Cham, Switzerland, 2014; pp. 540–555. [Google Scholar]
  10. Yang, L.; Zheng, Z.; Han, Y.; Song, S.; Huang, G.; Li, F. OStr-DARTS: Differentiable Neural Architecture Search Based on Operation Strength. IEEE Trans. Cybern. 2024, 54, 6559–6572. [Google Scholar] [PubMed]
  11. Selva, J.; Johansen, A.S.; Escalera, S.; Nasrollahi, K.; Moeslund, T.B.; Clapés, A. Video transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12922–12943. [Google Scholar] [PubMed]
  12. Yang, L.; Jiang, H.; Cai, R.; Wang, Y.; Song, S.; Huang, G.; Tian, Q. Condensenet v2: Sparse feature reactivation for deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3569–3578. [Google Scholar]
  13. Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Cham, Switzerland, 2016; pp. 766–782. [Google Scholar]
  14. Zheng, Z.; Yang, L.; Wang, Y.; Zhang, M.; He, L.; Huang, G.; Li, F. Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 695–708. [Google Scholar]
  15. Apostolidis, E.; Adamantidou, E.; Metsai, A.I.; Mezaris, V.; Patras, I. Video summarization using deep neural networks: A survey. Proc. IEEE 2021, 109, 1838–1863. [Google Scholar] [CrossRef]
  16. Ma, Y.F.; Lu, L.; Zhang, H.J.; Li, M. A user attention model for video summarization. In Proceedings of the Tenth ACM International Conference on Multimedia, New York, NY, USA, 1–6 December 2002; pp. 533–542. [Google Scholar]
  17. Haq, H.B.U.; Asif, M.; Ahmad, M.B. Video summarization techniques: A review. Int. J. Sci. Technol. Res 2020, 9, 146–153. [Google Scholar]
  18. Chu, W.S.; Song, Y.; Jaimes, A. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3584–3592. [Google Scholar]
  19. Zanella, M.; Ben Ayed, I. Low-Rank Few-Shot Adaptation of Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1593–1603. [Google Scholar]
  20. Lu, H.; Zhao, C.; Xue, J.; Yao, L.; Moore, K.; Gong, D. Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA. arXiv 2024, arXiv:2412.01004. [Google Scholar]
  21. Gou, Y.; Liu, Z.; Chen, K.; Hong, L.; Xu, H.; Li, A.; Yeung, D.Y.; Kwok, J.T.; Zhang, Y. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv 2023, arXiv:2312.12379. [Google Scholar]
  22. Elgendy, H.; Sharshar, A.; Aboeitta, A.; Ashraf, Y.; Guizani, M. Geollava: Efficient fine-tuned vision-language models for temporal change detection in remote sensing. arXiv 2024, arXiv:2410.19552. [Google Scholar]
  23. Jiang, Z.; Meng, R.; Yang, X.; Yavuz, S.; Zhou, Y.; Chen, W. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv 2024, arXiv:2410.05160. [Google Scholar]
  24. Chen, S.; Gu, J.; Han, Z.; Ma, Y.; Torr, P.; Tresp, V. Benchmarking robustness of adaptation methods on pre-trained vision-language models. Adv. Neural Inf. Process. Syst. 2024, 36, 51758–51777. [Google Scholar]
  25. Laurençon, H.; Tronchon, L.; Cord, M.; Sanh, V. What matters when building vision-language models? arXiv 2024, arXiv:2405.02246. [Google Scholar]
  26. Wang, W.; Chen, Z.; Chen, X.; Wu, J.; Zhu, X.; Zeng, G.; Luo, P.; Lu, T.; Zhou, J.; Qiao, Y.; et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Adv. Neural Inf. Process. Syst. 2024, 36, 61501–61513. [Google Scholar]
  27. Chen, J.; Lv, Z.; Wu, S.; Lin, K.Q.; Song, C.; Gao, D.; Liu, J.W.; Gao, Z.; Mao, D.; Shou, M.Z. VideoLLM-online: Online Video Large Language Model for Streaming Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18407–18418. [Google Scholar]
  28. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
  29. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
  30. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
  31. Post, M. A call for clarity in reporting BLEU scores. arXiv 2018, arXiv:1804.08771. [Google Scholar]
  32. Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, San Diego, CA, USA, 24–27 March 2002; pp. 138–145. [Google Scholar]
  33. Alam, M.J.; Hossain, I.; Puppala, S.; Talukder, S. Advancements in Multimodal Social Media Post Summarization: Integrating GPT-4 for Enhanced Understanding. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1934–1940. [Google Scholar]
  34. Son, J.; Park, J.; Kim, K. CSTA: CNN-based Spatiotemporal Attention for Video Summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18847–18856. [Google Scholar]
Figure 1. MiLoRA-ViSum model architecture.
Figure 1. MiLoRA-ViSum model architecture.
Electronics 14 02269 g001
Figure 2. Comparsion between our proposal and other works in terms of accuracy [30].
Figure 2. Comparsion between our proposal and other works in terms of accuracy [30].
Electronics 14 02269 g002
Figure 3. Comparison between our proposal and other works in terms of accuracy, refs. He et al. [2], Post [31].
Figure 3. Comparison between our proposal and other works in terms of accuracy, refs. He et al. [2], Post [31].
Electronics 14 02269 g003
Figure 4. Comparsion between our proposal and other works in terms of intensity.
Figure 4. Comparsion between our proposal and other works in terms of intensity.
Electronics 14 02269 g004
Table 1. Performance comparison with state-of-the-art models on VideoXum and ActivityNet datasets across multiple metrics.
Table 1. Performance comparison with state-of-the-art models on VideoXum and ActivityNet datasets across multiple metrics.
ModelDatasetROUGE-1ROUGE-2ROUGE-LBERTScoreMeteorSacreBLEUNIST
Video-LLaMA (Baseline)VideoXum47.3228.7545.610.8760.32220.547.10
Alam et al. [33]VideoXum51.5032.1049.500.8940.35324.128.24
MiLoRA-ViSumVideoXum52.8633.9550.190.9090.36225.268.93
Video-LLaMA (Baseline)ActivityNet46.1027.6544.800.8720.31019.876.88
Alam et al. [33]ActivityNet50.0031.0048.500.8900.34823.568.10
MiLoRA-ViSumActivityNet51.7232.1448.620.9110.35224.298.42
Table 2. Comparison of MiLoRA-ViSum with state-of-the-art models across VideoXum and ActivityNet datasets using multiple metrics.
Table 2. Comparison of MiLoRA-ViSum with state-of-the-art models across VideoXum and ActivityNet datasets using multiple metrics.
ModelDatasetROUGE-1ROUGE-2ROUGE-LBERTScoreMeteorSacreBLEUNIST
He et al. [2]VideoXum50.1231.0548.020.8880.34424.218.12
Son et al. [34]ActivityNet49.5030.1047.800.8820.33622.347.86
MiLoRA-ViSumVideoXum52.8633.9550.190.9090.36225.268.93
MiLoRA-ViSumActivityNet51.7232.1448.620.9110.35224.298.42
Table 3. Scalability comparison of MiLoRA-ViSum with SOTA models across VideoXum and ActivityNet datasets.
Table 3. Scalability comparison of MiLoRA-ViSum with SOTA models across VideoXum and ActivityNet datasets.
ModelDatasetTraining Time (Hours)Inference Latency (ms)Trainable Params (%)
He et al. [2]VideoXum5513080%
Son et al. [34]ActivityNet5012060%
MiLoRA-ViSumVideoXum4211318%
MiLoRA-ViSumActivityNet4211217%
Table 4. Performance analysis of temporal-only, spatial-only, and combined MiLoRA configurations on the VideoXum dataset.
Table 4. Performance analysis of temporal-only, spatial-only, and combined MiLoRA configurations on the VideoXum dataset.
ConfigurationROUGE-1ROUGE-2ROUGE-LBERTScoreMeteorSacreBLEUNIST
Temporal-only MiLoRA48.9529.4546.700.8840.34022.897.68
Spatial-only MiLoRA49.1229.7847.050.8850.34123.147.81
Combined MiLoRA52.8633.9550.190.9110.36225.268.93
Table 5. Computational overhead of temporal-only, spatial-only, and combined MiLoRA configurations.
Table 5. Computational overhead of temporal-only, spatial-only, and combined MiLoRA configurations.
ConfigurationTraining Time (Hours)Inference Latency (ms)Trainable Params (%)
Temporal-only MiLoRA3510510%
Spatial-only MiLoRA3610710%
Combined MiLoRA4211318%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, W.; Wang, G.; Li, X.; Chen, G.; Gao, J.; Zhao, H. A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts. Electronics 2025, 14, 2269. https://doi.org/10.3390/electronics14112269

AMA Style

Du W, Wang G, Li X, Chen G, Gao J, Zhao H. A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts. Electronics. 2025; 14(11):2269. https://doi.org/10.3390/electronics14112269

Chicago/Turabian Style

Du, Wenzhuo, Gerun Wang, Xin Li, Guancheng Chen, Jian Gao, and Hang Zhao. 2025. "A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts" Electronics 14, no. 11: 2269. https://doi.org/10.3390/electronics14112269

APA Style

Du, W., Wang, G., Li, X., Chen, G., Gao, J., & Zhao, H. (2025). A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts. Electronics, 14(11), 2269. https://doi.org/10.3390/electronics14112269

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop