This section presents a comprehensive experimental evaluation of the proposed framework. We describe the benchmark datasets (
Section 4.1) and experimental setup (
Section 4.2), then report an ablation study on architectural components (
Section 4.3), hyperparameter tuning results (
Section 4.4), quantitative comparison with state-of-the-art methods (
Section 4.5), qualitative analysis (
Section 4.6), and sensitivity analysis on focal loss and learning rate (
Section 4.7).
4.2. Experimental Setting
We follow the standard evaluation protocol using 5-fold cross-validation on the SumMe and TVSum datasets under the canonical setting. In all experiments, the model is trained for 300 epochs in each fold, and the final performance is reported as the average over the five folds.
The input sampling rate of every 15th frame (approximately 2 FPS) follows the standard preprocessing protocol established in the foundational video summarization benchmarks [
12,
13,
14] and adopted by all compared methods [
16,
17,
18,
19,
21,
22,
23,
24,
25,
26,
28,
34,
35,
36], ensuring that the evaluation conditions are directly comparable across approaches. This fixed rate captures one frame per half-second, which is sufficient to represent visually distinct moments while avoiding redundant near-duplicate frames.
All experiments are implemented in PyTorch 2.10.0 and executed on Google Colab using an NVIDIA L4 GPU. The model is trained end-to-end using the Adam optimizer with a learning rate of and an weight decay of . To address the severe class imbalance inherent in video summarization, we employ a weighted focal loss with a focusing parameter for frame-level importance prediction.
Details regarding module-wise design choices and hyperparameter selection for the Temporal U-Net and the shot-based hierarchical transformer are provided in their respective subsections.
4.3. Module Selection
We first perform an ablation study to assess the impact of each major architectural component in the proposed framework.
Table 2 reports the results for different combinations of (a) using a pretrained Vision Transformer (ViT) encoder versus non-pretrained features, (b) employing the proposed Temporal U-Net backbone versus a simpler global-attention (GA) baseline, and (c) adopting a hierarchical transformer scoring head versus a plain MLP head. All experiments are conducted on the SumMe dataset under the canonical setting with a fixed summary length of 20%.
To ensure a fair comparison, all ablation experiments are carried out under the same training and architectural configuration. Specifically, we use a learning rate of with regularization of and a focal loss focusing parameter . The Temporal U-Net is configured with four levels, a base channel width of 256, dropout rate of 0.2, and multi-head temporal attention enabled with four attention heads. Feature fusion across U-Net scales is also enabled. For the shot-based scoring head, we employ two attention heads, 3 transformer layers at both the shot and frame levels, a feed-forward dimension of 1024, and a dropout rate of 0.2. Apart from the module being ablated, all other components are kept identical across experiments.
From
Table 2, we observe that incorporating the pretrained ViT encoder consistently improves performance, highlighting the importance of strong spatial representations. Replacing the GA backbone with the proposed Temporal U-Net leads to a substantial gain in F1-score, increasing performance from approximately 51% to around 65%, demonstrating the effectiveness of multi-scale temporal modeling. Furthermore, when using the U-Net backbone, the hierarchical transformer head slightly outperforms the MLP-based scoring head (65.02% vs. 64.54%), while maintaining comparable diversity. Overall, the combination of a pretrained ViT encoder, Temporal U-Net backbone, and hierarchical transformer scoring head achieves the best performance and is therefore adopted in all subsequent experiments.
To verify that the architectural integration is synergistic rather than merely additive, we quantify the interaction effect in
Table 3. Using the base configuration (No ViT + GA + MLP = 51.17% F1) as reference, the marginal gains of adding each component independently are computed. If the combination were purely additive, the full model should achieve approximately 54.36%. The observed performance of 65.02% exceeds this additive prediction by +10.66 percentage points, providing strong evidence of super-additive synergy. This interaction arises because the ViT’s global contextualization is a necessary precondition for the U-Net to achieve its full potential (ViT + U-Net + MLP = 64.54% vs. No ViT + U-Net + MLP = 54.36%, a +10.18% amplification), and the hierarchical head further leverages the enriched multi-scale features.
4.4. Hyperparameter Tuning
All hyperparameter tuning experiments are conducted on the SumMe dataset under the canonical setting with a fixed summary length of 20%. We focus on tuning the most influential components of the proposed architecture, namely the Temporal U-Net backbone and the shot-based hierarchical transformer scoring head, while keeping all other components fixed.
Table 4 reports the top-performing configurations for the Temporal U-Net. We vary the base channel width, dropout rate, use of temporal attention, number of attention heads, and whether multi-scale feature fusion is enabled. The best-performing configuration employs 128 base channels, a dropout rate of 0.4, temporal attention with 2 heads, and no feature fusion, achieving an F1-score of 67.98%. Several nearby configurations yield comparable performance (ranging from 67.1% to 65.8%), indicating that the proposed U-Net design is robust to moderate variations in architectural choices.
To justify the choice of encoder depth, both
and
were explored during the hyperparameter search. As shown in
Table 4,
configurations consistently outperform
configurations. The best
setting achieves 67.98% F1, while the best
setting reaches only 64.71%—a gap of 3.27 percentage points. Across all tested configurations, the median F1 for
is 64.69% compared to 59.74% for
, and 70% of
configurations exceed 64.5% F1 versus only 22% for
. This consistent degradation at
is attributable to excessive temporal compression at the bottleneck: for the shortest videos (
),
reduces the bottleneck to only four temporal tokens, which is insufficient for meaningful self-attention. The choice of
therefore provides the optimal balance between temporal receptive field coverage (spanning
the original temporal window) and information preservation.
Table 5 presents the tuning results for the shot-based hierarchical transformer head. We explore different numbers of attention heads, depths of the shot-level and frame-level transformer encoders, feed-forward network dimensions, and dropout rates. The optimal configuration uses four attention heads, one shot-level transformer layer, two frame-level transformer layers, a feed-forward dimension of 1024, and a dropout rate of 0.4, again achieving an F1-score of 67.98%. Configurations with fewer heads or reduced depth generally lead to a gradual performance degradation, with F1-scores decreasing to approximately 65%.
Overall, these results demonstrate that the selected hyperparameter settings for both the Temporal U-Net and the hierarchical transformer head are near-optimal under the canonical evaluation protocol and provide a strong balance between model capacity and generalization performance.
4.5. Quantitative Analysis
The relationship between F1-score and diversity as a function of summary length is shown in
Figure 5. When the length of the summary is very short (for example, 5–10 percent), the summaries generated show comparably high diversity and fail to reflect the key contents, which leads to a low F1-score. As the length of the summary increases, the F1-score gradually improves and diversity also gradually increases due to the increase in the length of the summary. The best trade-off between coverage and diversity is found at around 20 percent summary length, which is where the F1 score is at its maximum and diversity is still high enough. This empirical finding is a reasonable justification for our choice of 20% summary length for all subsequent experiments and represents an important design contribution of the present study.
It is worth noting that the fixed-budget evaluation protocol adopted in this study is a deliberate design choice to ensure fair and direct comparison with all prior methods, which universally evaluate under fixed budgets. While adaptive per-video budgets may appear more realistic for deployment, implementing such a mechanism introduces a circular dependency: a learned budget predictor must be trained against oracle budgets derived from the same human annotations used for importance scoring, and inter-annotator variability in summary length means that a single correct per-video budget does not exist. The systematic length analysis presented in
Figure 5 already addresses the core concern by identifying 20% as the consistently optimal operating point, providing evidence-based guidance without the additional complexity of an adaptive module.
We further tackle the computational efficiency of the proposed model. The total number of parameters and inference costs for videos with minimal, median and maximal length from the SumMe and TVSum datasets is summarized in
Table 6. The model has 105.5 M parameters and its complexity is linearly proportional to the video length.
Figure 6 shows how the computational complexity measured in giga floating point operations (GFLOPs) grows linearly with the number of input frames
T, which reflects the design of the temporal backbone and transformer modules. The computational complexity was profiled using
fvcore.nn.FlopCountAnalysis (Facebook Research), which computes the total floating-point operations for a single forward pass. Inference latency and FPS were measured using PyTorch CUDA event timing (
torch.cuda.Event with synchronization) on an NVIDIA L4 GPU in evaluation mode. GFLOPs were computed for six sequence lengths (T = 64, 167, 293, 380, 649, 1294) corresponding to the min/median/max video lengths from SumMe and TVSum. The plot was rendered using
Matplotlib 3.10.9.
As a solid example, as part of the SumMe dataset, a short video of 64 frames requires 11.8 GFLOPs and takes a speed of about 84 frames per second (FPS) for inference. In comparison, a much longer video with 649 frames calls for 119.5 GLOPs and gives a much lower inference speed of approximately 47 FPS. A similar linear trend can be seen in the TVSum dataset, where even the longest videos (1294 frames) approach processing speeds of more than 21 FPS. These results show that the proposed model has computational efficiency, and so it can be used for practical applications in video summarization scenarios.
Figure 7 presents the mean training and validation loss curves averaged over 5-fold cross-validation for both the SumMe and TVSum datasets under the canonical, augmented, and transfer settings. In all cases, the training loss decreases rapidly during the initial epochs, followed by a gradual convergence, indicating efficient optimization of the proposed architecture.
Within the canonical configuration, convergence is always stable across both the datasets, as can be seen from the validation loss following the training loss upon completion of the initial phase of training. Such a pattern means that the model learns discriminative representations of the importance of frame but does not suffer from overfitting. Under the augmented setup, a slight increase in validation loss is observed in the first few epochs, reflecting the increased variation introduced by data augmentation; however, loss curves stabilize over time, indicating improved generalization. Analogously, in the transfer configuration, the loss trajectories show soft convergence after the preliminary adaptation phase, which validates the model’s ability to transfer acquired representations between datasets. Training and validation losses were logged at every epoch during the PyTorch training loop. The metric plotted is the weighted focal loss (Equation (
20)) with
and adaptive class weighting (Equation (
19)). Loss values were averaged across all videos per epoch, then across all five folds. Plots were generated using
Matplotlib 3.10.9.
Table 7 gives a quantitative comparison of the proposed approach with modern supervised approaches for video summarization given the canonical five-fold cross validation protocol. The resulting F1-scores of 65.8% on SumMe and 79.92% on TVSum strongly exceed previous state-of-the-art results which reach a maximum of 58.4% on SumMe and 67.5% on TVSum. This significant performance margin is indicative of how effective the model being proposed is.
In order to evaluate generalization,
Table 8 presents performance across the harder protocol consisting of canonical, augmented, and transfer splits. Consistently across both of the datasets, the proposed method maintains good performance, achieving the highest scores within the canonical regime whilst achieving competitive results under augmented and transfer conditions. These results support that the proposed architecture provides a new state-of-the-art in the conventional evaluation setting and has good generalization power across various training and testing settings.
In addition to relevance measured by F1-score, we compare the visual diversity of the generated summaries with state-of-the-art supervised video summarization methods. Diversity evaluates the non-redundancy of selected frames and reflects how well a method captures varied visual content rather than repeatedly selecting similar segments. We follow the standard evaluation protocol and report average diversity scores computed using cosine-based feature dissimilarity among selected frames.
Table 9 presents the diversity comparison on the SumMe and TVSum datasets under the canonical setting. The proposed HybridHiT-UNet consistently achieves competitive diversity compared to existing methods, indicating its ability to balance relevance and non-redundancy. This improvement can be attributed to the multi-scale temporal modeling of the Temporal U-Net and the explicit incorporation of shot-level context through the hierarchical transformer head, which discourages redundant frame selection across temporally adjacent segments.
4.6. Qualitative Analysis
We also qualitatively evaluate the behavior of the proposed model by visually comparing the summary generated by the model to the summary annotated by the users. In
Figure 8 and
Figure 9, we allow for a representative subset from SumMe (Videos 13 and 15), and for each video show the top 20 frames selected by the model, compared with the human labeler’s selection.
The model-selected frames are very similar to the user-selected frames in both videos, with a strong correspondence around salient semantic events and important visual events. In Video 13, the model and users highlight extended interaction scenes and visual informative close-ups, and in Video 15, the summaries always focus on the main interaction segments and the recurring interaction patterns. There are few minor discrepancies mainly because of the subjective preferences of the user or because of time boundary variations, but there is generally a semantic consistency between the two summaries. This visual agreement indicates that the proposed hierarchical modeling is able to capture the underlying narrative structure which is preferred by human annotators.
To better understand the temporal alignment,
Figure 10 and
Figure 11 show the mean indicator of the users’ summary versus the binary model-predicted indicator over time on the original video. These plots indicate significant overlap between the model-selected segments and the user-selected segments, especially in the vicinity of high-importance regions where there is a consensus among the users. The model is able to capture the most common intervals, which is a step towards approximating collective human judgment without overfitting to each individual annotation.
Figure 8 and
Figure 9 were produced by running the trained HybridHiT-UNet model (best checkpoint from 5-fold cross-validation, canonical setting, 20% budget) and extracting the top 20 frames by importance score after knapsack-based shot selection. User-annotated frames were extracted from SumMe ground-truth annotations. Frame images were extracted using
OpenCV 4.13.0 and arranged using
Matplotlib 3.10.9.
Figure 10 and
Figure 11 overlay the mean user summary indicator (averaged across all annotators) against the binary model-predicted summary mask. Plots were generated using
Matplotlib 3.10.9.
Overall, the qualitative results corroborate the quantitative findings by showing that the proposed method not only achieves high F1-scores but also produces summaries that are visually and semantically consistent with human expectations. The strong alignment across both frame-level selections and temporal importance patterns provides further evidence of the effectiveness of the proposed HybridHiT-UNet framework.
4.7. Ablation on Focal Loss and Learning Rate
We conduct a focused ablation study to analyze the impact of the focal loss focusing parameter and the learning rate on summarization performance under the SumMe canonical setting with a fixed summary length of 20%.
Figure 12 illustrates the effect of varying
in the focal loss. When
is small (e.g.,
), the model behaves similarly to standard weighted cross-entropy, resulting in suboptimal performance. Increasing
improves the model’s ability to emphasize hard and informative frames, leading to a steady increase in F1-score. The best performance is achieved at
, after which the F1-score drops for larger values, indicating over-suppression of easy samples. Based on this observation, we fix
in all experiments.
Figure 13 shows the sensitivity of the model to the learning rate. Very small learning rates lead to slow convergence and inferior performance, while excessively large learning rates cause unstable optimization and occasional sharp performance degradation. The best and most stable F1-scores are obtained around a learning rate of
, which we therefore adopt as the default setting. Overall, these results confirm that both focal loss
and learning rate play a critical role in stable optimization and achieving strong summarization performance.
Figure 12 was produced by training the full model on SumMe canonical for each
, keeping all other hyperparameters fixed (lr =
, 20% budget, 300 epochs). Best F1 per
was recorded using the standard evaluation protocol (Equations (
21) and (
22)).
Figure 13 follows the same methodology but varies the learning rate from
to
with
= 2.0 fixed. Both plots were generated using
Matplotlib 3.10.9.