HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization

Sakib, Saadman; Mahmud, Tanjim; Andersson, Karl; Deb, Kaushik

doi:10.3390/make8050135

Open AccessArticle

HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization

by

Saadman Sakib

¹

,

Tanjim Mahmud

²

,

Karl Andersson

³

and

Kaushik Deb

^1,*

¹

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology (CUET), Chattogram 4349, Bangladesh

²

Department of Computer Science and Engineering, Rangamati Science and Technology University, Rangamati 4500, Bangladesh

³

Cybersecurity Laboratory, Luleå University of Technology, 931 87 Skellefteå, Sweden

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(5), 135; https://doi.org/10.3390/make8050135

Submission received: 25 March 2026 / Revised: 10 May 2026 / Accepted: 13 May 2026 / Published: 20 May 2026

(This article belongs to the Section Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Video summarization aims to produce a short yet informative summary of a long video while reducing the amount of redundancy. Most transformer-based methods are single-temporal scale or are unconcerned with shot-level structure, limiting temporal coherence and cross-dataset generalization. To fill these gaps, we present HybridHiT-UNet, a supervised framework that combines three complementary parts: a pretrained Vision Transformer encoder to provide spatially rich frame representations, a multi-scale 1D Temporal U-Net backbone to provide hierarchical temporal modeling of frame representations, and a shot-aware hierarchical transformer scoring head to provide inter-shot context to importance prediction. Frame-level scores are summed into shot-level utilities and optimized with a knapsack selection on a fixed-length budget, and a weighted focal loss is used to address extreme class imbalance. Wide experiments using four benchmarks (SumMe, TVSum, OVP, and YouTube) under canonical, augmented, and transfer protocols demonstrate that HybridHiT-UNet achieves F1-scores of 65.8% on SumMe and 79.92% on TVSum, which is higher than the existing methods, which still achieve diversity scores of 64.98% and 48.68%, respectively. A systematic study further demonstrates that a 20% summary budget would yield a consistently superior coverage-diversity trade-off than the traditional 15% one, which provides useful evidence-based advice on the selection of summary length.

Keywords:

video summarization; hierarchical transformer; temporal U-Net; SumMe; TVSum

Graphical Abstract

1. Introduction

Online video has become one of the dominant media modalities in the world. The YouTube platform registers more than twenty million files every day [1], and platforms such as TikTok register billions of views in the same timeframe. As a result, consumption of online video has surged to new heights with the average person spending around seventeen hours per week on video content [2]. This increase is due, in large part, to the increased accessibility of recording and distribution technologies. Contemporary smartphones with sophisticated camera systems are ubiquitous, surveillance infrastructure is booming in both the public and private sectors, and modern online platforms allow for food to be disseminated to the world immediately. The cumulation of all these factors has produced a large, constantly growing body of video data that escapes individual comprehensive exploration.

This phenomenon leads to a practical difficulty. Dominant video recordings are usually extensive, but the bulk of their duration is made up of material of limited intrinsic interest. Instances of substantive relevance represent a small fraction of the whole temporal span. Manual inspection of the entirety of such recordings to isolate these salient moments is neither scalable nor a good use of the human being’s labor. An automated system coupled with the ability to recognize and extract the salient segments of a video should therefore be highly useful; this goal is the crux of video summarization. A video-summarization system takes a long recording as an input and generates a shorter representation of the recording, which retains the most important content while discarding the rest. The output can be in the form of a set of representative still images—called a static summary—or a series of short video clips in chronological order—called a dynamic summary. In either case, the viewer can grasp the substance of the original recording in a fraction of the time it takes to play the recording in its entirety. The applicability of such summarizations is wide-spread. Security analysts use summarization to scan large amounts of surveillance footage for anomalies [3,4,5]. Broadcasters create sports highlights and promotional clips. Researchers and educators need to be able to access shortened versions of long recorded sessions. Beyond these direct applications, summarization is useful for related tasks like video saliency estimation and content analysis [6,7,8,9] and video synopsis generation [10,11]. Reducing the amount of data that has to be stored and transmitted also has substantive infrastructural benefits.

Research on summarizing videos has changed significantly in the past 20 years. Initial approaches were unsupervised and relied on hand-crafted visual descriptors such as color histograms or edge statistics to select the perceptually different frames. While effective when trying to capture surface-level visual change, these methods were lacking the knowledge of what human observers would actually see as important. This limitation triggered a move towards supervised techniques, where models were trained in opposition to reference summaries which are annotated by human raters [12,13]. Deep learning has been a revolution in methodology. Recurrent neural networks, Long Short-Term Memory (LSTM) networks in specific, became the standard tool to model temporal dynamics and predict frame-level importance scores [14]. Although they have worked in many situations, these models have limitations that affect their ability to process content: Sequential processing hinders the learning of long-range dependencies, limitations related to the use of gradients affect the memory of longer sequences, and the content of the frames themselves is often underused despite its importance in determining which moments should be included when computing a summary.

Attention mechanisms and transformer architectures have since emerged as the prevailing paradigm. Self-attention enables global pairwise interactions among all frames in a sequence, facilitating the modeling of long-range temporal relationships. Vision Transformers and hybrid architectures have introduced new possibilities for jointly addressing spatial and temporal reasoning [15,16,17]. Nonetheless, identifying an optimal architecture for video summarization remains an open problem. This is especially true under single-view, fully supervised conditions, where the interaction between design decisions and cross-dataset generalization is not yet well understood.

Based on this, the following two research questions are used as a guideline for the present study.

RQ1: Can a single architecture that combines spatial feature encoding using a pretrained Vision Transformer, multi-scale temporal modeling using 1D Temporal U-Net backbone, and shot-level hierarchical reasoning using a transformer scoring head outperform the state of the art single-scale and flat-attention supervised video summarization methods across statistically better F1-scores and competitive diversity scores in the three canonical, augmented and transfer protocols on SumMe, TVSum, OVP and YouTube with 5-fold cross validation?
RQ2: Does modifying the summary length budget from 5% to 30% of the original video length in the benchmark datasets show a percentage that consistently achieves a better tradeoff between F1-score and diversity than the traditionally used 15%, as evaluated under the canonical evaluation setting across all the four benchmark datasets?

These open questions motivate the present work. We propose an architecture that unites three components: a Vision Transformer (ViT) encoder for frame-level spatial representation, a U-Net-inspired temporal backbone for multi-scale temporal encoding, and a hierarchical shot-based transformer scoring head for importance prediction. In addition, we examine the effect of summary length on performance. This factor is frequently overlooked because most evaluation protocols adopt a fixed 15% length budget by convention. Our experiments reveal that a 20% budget yields consistently better trade-offs across all tested datasets. The contributions of this paper are as follows:

We propose a supervised video summarization framework consisting of a Vision Transformer (ViT) encoder, a temporal U-Net backbone, and a hierarchical transformer-based scoring head. The model is evaluated on four benchmark datasets; SumMe, TVSum, OVP, and YouTube under canonical, augmented and transfer protocols. In the canonical setting, the frame-by-frame results of the framework is F1 score of 65.80% and 79.92% on SumMe and TVSum with diversity score of 64.98% and 48.68%, respectively. These results support the framework’s ability to capture both relevance and variety of content.
We perform an empirical analysis of the length of summary, in which we vary summary length from 5% to 30% of the original video length. A budget of 20% is consistently more effective than the conventional 15% of budget. This finding offers practitioners empirically-based guidance for the relationship of summary length to output quality.

It is worth emphasizing the interdisciplinary nature of the proposed framework. From computer vision, we adopt a Vision Transformer pretrained on large-scale image data to extract spatially rich frame representations. From signal processing, the Temporal U-Net borrows the multi-scale encoder–decoder paradigm, enabling the model to analyze temporal structure at multiple resolutions—analogous to wavelet-based multi-resolution analysis. From natural language processing and sequence modeling, we employ transformer-based self-attention mechanisms that capture long-range temporal dependencies without the sequential bottleneck of recurrent architectures. Finally, from multimedia analysis and cognitive science, the shot-level hierarchical design reflects the human tendency to mentally segment continuous video into discrete episodes and evaluate their relative importance.

The rest of this paper is structured as follows. Section 2 surveys previous efforts on video summarization that are based on traditional methods as well as deep learning methods. Section 3 outlines the architecture proposed and the training process. Section 4 presents the experimental results, which include the results as compared with existing methods and the summary length analysis. Section 5 concludes the paper and includes directions for future research.

To provide a high-level overview of the research workflow underlying this study, Figure 1 presents a flowchart that traces the progression from problem identification through to the final conclusions.

2. Related Work

This section reviews the existing literature on video summarization, organized to trace the evolution of the field and motivate the design choices of the proposed framework. We first describe the two main output paradigms—static and dynamic summarization—in Section 2.1. We then review unsupervised methods (Section 2.2) followed by supervised approaches (Section 2.3), further distinguished by their core modeling mechanism: recurrent networks (Section 2.3.1), attention-based models (Section 2.3.2), transformer architectures (Section 2.3.3), and shot-based hierarchical methods (Section 2.3.4). Finally, Section 2.4 identifies the gaps that directly motivate our proposed approach.

2.1. Video Summarization Approaches

Video summarization approaches differ from each other in the structure of the output generated by the approach. The output generated by a static video summarization approach consists of a collection of key images that represent the video from which the images were selected. These images contain key content from the video and do not depend on the continuity of the content or the audio in the video [18]. The output generated by a dynamic video summarization approach consists of short video segments (shots) from the video being summarized. The video segments contain the temporal flow of the video and usually contain the audio from the video [18]. The choice of the video summarization approach depends on the application for which the approach is being implemented. The video summarization approach that is appropriate for a security personnel monitoring video from a security camera may not be appropriate for a video summarization approach for a video containing a sports event or a movie trailer.

2.2. Unsupervised Video Summarization

Unsupervised methods are used to develop salient video summaries without reference to human-annotated reference summaries. Such approaches have to infer the importance of individual frames from the intrinsic properties of the video. Early investigations used clustering of frame-level descriptors or dictionary learning to select frames that were representative and visually different. More recent work has combined deep feature extraction with self-organizing maps to prove that functional keyframe summaries can be produced without access to any labeled training data [19].

Adversarial training causes an alternate learning signal. In this paradigm, a summarizer network is responsible for frame selection, and a discriminator network tries to tell the difference between the original video and the reconstruction of the video with only the frames from the summary. The summarizer suffers a penalty if the discriminator succeeds, and as a result, it learns to choose frames that are faithful to the source content [20]. Subsequent work replaced the recurrent layers with temporal convolutions and self-attention modules and improved both computation efficiency and summary quality [21]. Another line of research modeled videos as the structure of graphs and used external memory attention to bind long-range interactions. One exemplar method parameterized frame interestingness by using a Gaussian prior and coupled it with an efficient external attention module. This approach achieved competitive results on standard benchmarks without the use of any human labels [22]. Unsupervised techniques are still useful in situations where the availability of annotations is small or where such annotation is very expensive to obtain, but they are usually not able to capture the sophisticated, context-specific judgments usually learned by supervised models.

2.3. Supervised Video Summarization

The supervised approaches use training data where the frames are annotated by human subjects with importance scores or keyshot labels, which are binary labels. This gives a direct insight into the model about what the viewers find important. The standard deep learning architecture, which was introduced, treats the summarization problem as a sequence labeling problem. To recap, a network based on LSTM treats the features of a video on a frame-by-frame basis to generate a per-frame importance estimation [14]. This was later extended. One extension of the standard supervised approaches treats the summarization problem as a temporal interest detection problem. This uses anchor-based approaches, which classify fixed-length segments, and anchor-free approaches, which regress importance at any position [23]. Temporal coherence within these frameworks ensures that shots are grouped together. This results in a summary which is more similar to one which a human would generate. One major drawback of supervised approaches is that they require expensive annotations. Also, models which are trained on a single dataset run a risk of over-fitting. Techniques such as data augmentation, cross-dataset training, and domain adaptation are employed to overcome these.

2.3.1. LSTM- and RNN-Based Architectures

One of the first architectures of deep learning to be employed in video summarization was the use of Long Short-Term Memory networks. Because of the natural ability of such networks to process data in a sequential fashion, they were one of the first architectures to be employed. In a simple form of such networks, a frame is encoded individually, and a scalar output representing importance is generated at each time step. A well-known limitation of such a simple form of network is the bottleneck effect, which causes a loss of access to earlier frames as a result of gradient signal degradation. A hierarchical form of such networks, where a lower-level LSTM network processes short video clips along with their compressed representations as input to a higher-level LSTM network, helps to partially compensate for such a bottleneck effect [14]. Other improvements, such as attention layers and reweighted loss functions to compensate for the dominance of key frames over non-key frames, are also helpful in achieving better performance [18]. However, even such improvements cannot compensate for the fundamental sequential nature of such networks.

2.3.2. Attention-Based Models

Attention mechanisms overcome this major limitation of recurrent structures. In a recurrent structure, two distant frames can only interact with each other through all the intervening frames. In a self-attention mechanism, relevance is computed for all pairs of frames in a single step. There is no limit on the distance that a frame can attend to another frame. The encoder–decoder structures that employ attention mechanisms in each of their layers produce summaries that are more comprehensive in content and more similar to human-produced summaries [23]. Graph attention mechanisms take a different approach to attention mechanisms. They first use a graph attention mechanism to incorporate the spatial dependencies in each frame and then combine these with the temporal dependencies using cross-attention mechanisms [24]. Channel attention has also been used in convolutional encoder–decoder structures to direct the model towards more discriminative feature channels and sharpen the output of the model [25]. These innovations in attention mechanisms have taken the accuracy of summarization models in both the spatial and temporal domains to a much higher level than what recurrent structures could manage.

2.3.3. Transformer-Based Approaches

Transformers generalize the attention mechanism to a whole sequence modeling approach. The application of transformers for video summarization has accelerated significantly. One of the models uses two parallel attention mechanisms. The first attention mechanism is for the whole video, and the second attention mechanism is for each frame to model the spatial saliency phenomenon [16]. The second approach uses multi-scale spatial features learned by a ViT backbone and a deep pyramidal refinement module for video summarization [26]. The third approach is based on the U-Net structure and uses hierarchical compression and reconstruction of the video for video summarization [17]. The common conclusion from all the different approaches is that the transformers produce good results on the benchmarks with a relatively low number of parameters. This supports the effectiveness of the transformer approach for video summarization. The preliminary version of the present study introduced a light-weight transformer approach for video summarization [27]. The present study is an extension of the preliminary study with multi-scale temporal encoding, hierarchical shot-level video transformers, and a wide range of experiments with different datasets and evaluation protocols.

2.3.4. Shot-Based Hierarchical Transformer Methods

A growing body of research argues that by treating a video as a flat sequence of frames, compositional structure is lost. Shot-based hierarchical transformers reduce this concern. First, the video is divided into shots, each of which is a semantically coherent set of consecutive frames. A two-tier attention mechanism is applied: a frame-level transformer is applied inside each shot to detect the most salient frames in comparison to its local neighbors, and a shot-level transformer is applied to capture wider temporal patterns and inter-shot dependencies [28]. This decomposition reflects the human way of processing what viewers are seeing, which is that they will mentally segment a recording into separate episodes and evaluate the importance of each episode, then later recall salient events from episodes that they deem important. The Shot-Based Hierarchical Transformer Video Summarization (SHTVS) method, which is based on this principle, has been shown to be able to obtain summaries which have better coherence and (narrative) completeness than non-hierarchical baselines. Furthermore, it is more scalable to long recordings since each shot-level attention window is over a manageable number of tokens. As the duration of video increases and the structure of the video becomes more complex, shot-aware hierarchical architectures are becoming the leading candidate to play a central role in achieving a scalable and semantically faithful summarization.

2.4. Gaps and Motivation

Despite the above achievements, there are several ways in which the process is still deficient. Recurrent models are still limited by the limitations of temporal context and serial computation. Single-resolution transformers and flat attention mechanisms only alleviate the temporal bottleneck partially. Most of these architectures encode time at a single granularity, hence missing the multi-scale temporal structure of the real-world video. A few recent proposals include hierarchical processing or external memory modules; very few unify spatial, temporal, and structural cues in a coherent pipeline.

Our work directly helps to overcome this gap. We combine a Vision Transformer (ViT) encoder generating spatially rich frame-level features, a U-Net backbone that captures temporal patterns in multiple resolutions, and a shot-level hierarchical transformer head, which accounts for the compositional organization of the video. Each component is selected in an attempt to overcome limitations inherent in the others. The result is a cohesive and scalable framework that is designed to generate summaries which are both semantically informative and visually diverse.

It is important to emphasize that the proposed integration is not a simple modular stacking of pre-trained, off-the-shelf components. Only the ViT encoder is pretrained; the Temporal U-Net backbone and the hierarchical shot-based transformer scoring head are custom-designed architectures built specifically for video summarization. The 1D Temporal U-Net adapts the encoder–decoder paradigm into a temporal architecture with integrated self-attention at coarser scales, while the shot-based hierarchical transformer scoring head introduces a novel broadcast-and-refine mechanism at two complementary levels. Both components were extensively tuned through systematic hyperparameter searches, optimizing depth, channel width, attention heads, dropout, and fusion strategies. The resulting integration creates synergistic information flows rather than additive contributions. The ViT encoder pre-conditions the feature space with global self-attention, so that downstream components operate on representations that are already globally contextualized. The Temporal U-Net then applies multi-resolution encoding to these pre-conditioned features, capturing temporal patterns at multiple scales simultaneously through its encoder–decoder structure with skip connections. Critically, the hierarchical shot-based transformer scoring head receives multi-scale features and applies a two-tier attention mechanism whose shot-level output is broadcast back to the frame level, creating a feedback-like enrichment. This broadcast-and-refine mechanism ensures that each frame’s importance prediction is jointly conditioned on its local multi-scale temporal context from the U-Net, its global narrative position from the shot-level transformer, and fine-grained frame dependencies from the frame-level transformer. This three-way coupling cannot be replicated by any single component or by a flat concatenation of independently computed features.

3. Proposed Method

This section presents the complete architecture and training procedure of the proposed HybridHiT-UNet framework. We begin with an overview of the full pipeline (Section 3.1), then describe each component in detail: input feature construction (Section 3.2), the pretrained ViT block (Section 3.3), the Temporal U-Net backbone (Section 3.4), the hierarchical shot-based transformer scoring head (Section 3.5), knapsack-based summary generation (Section 3.6), the training objective (Section 3.7), evaluation metrics (Section 3.8), and the training procedure (Section 3.9).

3.1. Overview of the Architecture

Figure 2 illustrates the complete pipeline of the proposed HybridHiT-UNet framework. The process begins at the leftmost block, where the input video is temporally downsampled to 2 frames per second by selecting every 15th frame, producing a sequence of T frames. Each downsampled frame is passed through a pretrained GoogleNet backbone (Pool5 layer), yielding a 1024-dimensional feature vector per frame. These features are linearly projected to 768 dimensions and augmented with sinusoidal positional encodings before being fed into a pretrained ViT-Base encoder, which applies multi-head self-attention across the entire temporal sequence to capture global frame-to-frame dependencies. The ViT output (T × 768) is passed to the Temporal U-Net backbone (detailed in Figure 3), which applies hierarchical 1D convolutions and self-attention at progressively coarser temporal resolutions, followed by a symmetric decoder that reconstructs the original resolution via skip connections. The resulting multi-scale temporal features (T × 768) are processed by the hierarchical shot-based transformer scoring head (detailed in Figure 4): frame features are pooled into shot-level representations (K × 768), encoded by a shot-level transformer, broadcast back to frame-level features, and refined by a frame-level transformer. An MLP scoring head produces a scalar importance score per frame (T × 1). Finally, frame-level scores are converted to shot-level scores, and a 0–1 knapsack optimization selects the optimal subset of shots under the summary length budget (20%). In training, we apply a weighted focal loss on the predicted frame scores to handle the severe imbalance between “important” and “unimportant” frames. We evaluate summaries using the standard F1-score against human references [29] and a diversity metric that measures the pairwise dissimilarity of selected frames [30].

3.2. Input Feature Construction and Temporal Alignment

We sample the video at a fixed rate (every 15th frame) and detect shot boundaries using a standard algorithm (Kernel Temporal Segmentation). This yields K shots, where shot k spans a consecutive set of frames. Each shot k is represented by an aggregated feature

s_{k} \in R^{D}

, obtained by mean-pooling the features of its constituent frames using a pretrained CNN (e.g., GoogLeNet) and appending a binary “change-point” indicator at the shot boundary. Concurrently, each individual frame i (frame index

i = 1, \dots, T

) is encoded by a Vision Transformer into a D-dimensional feature

f_{i}

. We project all features (frame and shot) to a common dimension

D = 768

to serve as inputs to the temporal models. To fuse frame and shot representations, we note that frame i belongs to some shot

k (i)

; later, we will broadcast the shot feature

s_{k (i)}

to frame i. In summary, the input to our model consists of a sequence of frame features

{f_{i}}_{i = 1}^{T}

and shot features

{s_{k}}_{k = 1}^{K}

that are aligned in time via the shot assignments. These form the basis for the subsequent encoding stages.

3.3. Pretrained ViT Block

Each video frame is first represented by a 1024-dimensional visual descriptor extracted using a pretrained GoogleNet backbone (pool5 layer), following standard practice in supervised video summarization benchmarks. Rather than operating directly on raw images, our model employs a pretrained Vision Transformer (ViT-Base) as a feature-level transformer to model global interactions across the temporal feature sequence.

Specifically, the sequence of GoogleNet features is linearly projected to the ViT embedding dimension (768) and augmented with sinusoidal positional encodings to preserve temporal order. The resulting feature tokens are then processed by the transformer blocks of a ViT-Base model pretrained on ImageNet-1K, instantiated using the timm library. In our implementation, only the transformer encoder layers and the final normalization layer are reused, while the original image patch embedding and classification head are discarded. Let the features from the GoogleNet Pool 5 layer be represented as

x_{t} \in R^{1024}

. The linear projection to ViT embedding space is represented by Equation (1).

z_{t} = W_{p} x_{t} + b_{p}, W_{p} \in R^{768 \times 1024}

(1)

Intuitively, Equation (1) performs a learnable linear transformation that maps each 1024-dimensional GoogleNet feature into the 768-dimensional ViT embedding space. The weight matrix

W_{p}

learns to project the CNN features into a representation compatible with the pretrained transformer, while the bias

b_{p}

provides an offset for distributional adjustment.

Sinusoidal positional encoding is applied to the linear projection before passing through the encoding layers with Equation (2).

z_{t} = z_{t} + PosEnc (t)

(2)

The sinusoidal positional encoding injects temporal position information into each frame’s feature. Without this encoding, the transformer would treat the input as a set rather than a sequence, losing all temporal ordering. The sinusoidal formulation uses sine and cosine functions at different frequencies, enabling the model to attend to relative temporal positions.

This design allows the pretrained ViT to act as a powerful global self-attention encoder over temporal feature tokens, capturing long-range dependencies across frames while leveraging knowledge learned from large-scale image data. The representation of the temporal token can be expressed by

Z = [z_{1}, \dots, z_{T}] \in R^{T \times 768}

, which will be passed to the ViT blocks and can be represented using Equation (3).

F = ViTBlocks (Z)

(3)

The ViT output provides temporally enriched frame representations, which are subsequently fed into the Temporal U-Net backbone for multi-scale temporal modeling.

3.4. Temporal U-Net Backbone

To capture temporal dependencies at multiple resolutions, we employ a 1D Temporal U-Net backbone operating on the sequence of frame-level features produced by the pretrained ViT block, as illustrated in Figure 3. The input (T × 768) passes through a linear projection reducing channels to 128. The encoder has three levels: Encoder Level 0 applies 1D convolution (128 × T) with GELU activation, then downsamples by 2 (128 × T/2). Encoder Level 1 increases channels to 256 (256 × T/2), applies temporal self-attention, then downsamples (256 × T/4). Encoder Level 2 increases to 512 (512 × T/4) with self-attention. The Bottleneck (512 × T/4) applies convolutions and self-attention for global temporal aggregation. The decoder symmetrically reconstructs resolution: each level upsamples via transposed convolution, concatenates with the encoder feature map through skip connections, and refines with 1D convolution. A final 1 × 1 convolution restores the output to T × 768.

The Temporal U-Net follows an encoder–decoder architecture with skip connections, enabling the model to jointly learn fine-grained local motion patterns and long-range temporal context. Let the ViT-encoded frame features be denoted in Equation (4).

F^{(0)} = [f_{1}, f_{2}, \dots, f_{T}] \in R^{T \times D}

(4)

where T is the number of temporally downsampled frames and

D = 768

is the unified feature dimension. This sequence serves as the input to the Temporal U-Net encoder.

3.4.1. Encoder

The encoder consists of

L = 3

hierarchical levels, each progressively reducing the temporal resolution while increasing the representational capacity. At encoder level

l \in {0, \dots, L - 1}

, the input feature map

F^{(l)} \in R^{T_{l} \times C_{l}}

is first processed by a temporal convolution block followed by a nonlinear activation in Equation (5):

{\tilde{F}}^{(l)} = σ (W^{(l)} * F^{(l)} + b^{(l)})

(5)

where ∗ denotes a 1D convolution along the temporal axis,

W^{(l)}

and

b^{(l)}

are learnable convolutional weights and biases, and

σ (\cdot)

denotes a nonlinear activation function (GELU in our implementation). The resulting feature map is then temporally downsampled using a strided convolution using Equation (6).

F^{(l + 1)} = Downsample ({\tilde{F}}^{(l)})

(6)

which reduces the temporal length such that

T_{l + 1} = \frac{1}{2} T_{l}

. This hierarchical encoding increases the effective temporal receptive field at deeper layers, allowing the network to capture long-range temporal structure.

Together, Equations (5) and (6) implement a single encoder level. Equation (5) applies a 1D convolution along the temporal axis to extract local temporal patterns, followed by GELU nonlinearity for modeling complex dynamics. Equation (6) halves the temporal resolution via strided convolution, doubling the effective receptive field. After L = 3 levels, the deepest features span 8× the original temporal window, capturing both rapid local changes and slow global trends.

3.4.2. Temporal Self-Attention

While the temporal convolutions are very good at managing the dynamics of local motion, their receptive fields are insufficient to model local extended temporal relationships. Consequently, temporal self-attention modules are added at some encoder stages, usually those at coarser temporal scales. For a feature map

F \in R^{T \times D}

, the self attention mechanism computes pairwise temporal correlations using the following Equation (7):

A = Softmax (\frac{F F^{⊤}}{\sqrt{D}})

(7)

where

A \in R^{T \times T}

is the attention matrix. The attended representation is then obtained by Equation (8).

F_{att} = A F

(8)

In the end, the attention-weighted representation is combined with the original feature map through a residual connection and then subject to layer normalisation, which stabilizes the optimisation process and results in the robust propagation of information over distant temporal positions.

Equations (7) and (8) implement temporal self-attention. Equation (7) computes a

T \times T

attention matrix whose entries represent pairwise frame relevance, normalized by the square root of the feature dimension to prevent gradient saturation. Equation (8) produces the attended representation as a weighted average over all frames. The key advantage over convolution is that any frame can directly attend to any other regardless of temporal distance, which is critical for capturing long-range dependencies such as recurring events or thematic continuity.

3.4.3. Bottleneck

At the deepest level of the encoder, where the temporal resolution reaches its minimum, a bottleneck block consisting of a series of temporal convolutions and self-attention layers is used. This block aggregates the temporal context throughout the world and creates a synthesis of a compact representation for the entire video sequence in the coarsest scale.

3.4.4. Decoder

The decoder is similar to the encoder architecture, rebuilding the temporal resolution progressively back to T. At each decoder stage, the feature map is upsampled with a transposed convolution as in Equation (9) and then concatenated with the corresponding encoder feature map through a skip connection. A temporal convolution block then refines the fused features, allowing for the decoder to recover the fine-grained temporal details while retaining the global context that was learned at the deeper layers.

G^{(l)} = Upsample (G^{(l + 1)})

(9)

3.4.5. Output

After the final decoding stage, a

1 \times 1

temporal convolution projects the decoder output to the target feature dimension, yielding the final Temporal U-Net representation:

U \in R^{T \times D}

. This output encodes frame-level features enriched by multi-scale temporal context and serves as the input to the subsequent hierarchical shot-based transformer scoring head.

Overall, the Temporal U-Net backbone effectively combines local temporal modeling through convolutions with global dependency modeling through self-attention, enabling robust and context-aware frame representations for video summarization.

3.5. Shot-Level Hierarchical Transformer Scoring Head

While the Temporal U-Net backbone produces temporally enriched frame-level features, it does not explicitly exploit the higher-level structure of videos in terms of shots. To incorporate semantic and structural context, we introduce a hierarchical shot-based transformer scoring head, illustrated in Figure 4. This module models temporal dependencies at two complementary levels: (i) the shot level, capturing inter-shot relationships and global narrative structure, and (ii) the frame level, refining fine-grained importance predictions within each shot. Frame features from the U-Net (T × 768) are assigned to shots via KTS-based boundary detection. Mean pooling produces K shot-level features (K × 768). A Shot-Level Transformer Encoder (1 layer, 4 heads, FF dim 1024) models inter-shot relationships. Encoded shot representations are broadcast back to frame-level resolution via additive fusion (T × 768). A Frame-Level Transformer Encoder (2 layers, 4 heads) refines frame-level importance. An MLP Scoring Head with sigmoid produces per-frame importance scores (T × 1).

3.5.1. Shot-Level Feature Construction

Let the output of the Temporal U-Net backbone be denoted as

U = [u_{1}, u_{2}, \dots, u_{T}] \in R^{T \times D}

, where

u_{i}

is the multi-scale temporal representation of frame i. Based on the shot segmentation described in Section 3.2, each frame i belongs to a shot indexed by

k (i) \in {1, \dots, K}

.

We construct a shot-level representation for each shot k by aggregating the frame features belonging to that shot using mean pooling with Equation (10).

s_{k} = \frac{1}{| I_{k} |} \sum_{i \in I_{k}} u_{i}

(10)

where

I_{k} = {i ∣ k (i) = k}

denotes the set of frame indices within shot k. If a shot contains no frames due to boundary effects, we substitute the global mean feature to ensure numerical stability.

Mean pooling in Equation (10) compresses all frame-level features within a shot into a single representative vector, capturing the overall visual and temporal characteristics of each segment for modeling inter-shot relationships in the subsequent transformer encoder.

3.5.2. Shot-Level Transformer Encoding

The sequence of shot embeddings

S = [s_{1}, s_{2}, \dots, s_{K}] \in R^{K \times D}

is passed through a transformer encoder to model inter-shot dependencies and global video structure. This transformer captures relationships between shots, such as recurring events or long-range semantic transitions, which are difficult to infer from frame-level processing alone. The shot-level transformer output is given by Equation (11).

S^{'} = {Transformer}_{shot} (S)

(11)

where

S^{'} \in R^{K \times D}

contains context-aware shot representations.

3.5.3. Broadcasting Shot Context to Frames

To propagate shot-level context back to individual frames, the encoded shot representation

s_{k (i)}^{'}

corresponding to frame i’s shot is broadcast and added to the frame feature (Equation (12)):

{\tilde{u}}_{i} = u_{i} + s_{k (i)}^{'}

(12)

yielding a shot-aware frame representation

{\tilde{u}}_{i}

. This fusion allows each frame to inherit semantic context from its enclosing shot, enabling coherent importance estimation within longer temporal segments.

3.5.4. Frame-Level Transformer Refinement

The shot-aware frame features

\tilde{U} = [{\tilde{u}}_{1}, {\tilde{u}}_{2}, \dots, {\tilde{u}}_{T}]

are then processed by a second transformer encoder operating at the frame level (Equation (13)):

H = {Transformer}_{frame} (\tilde{U})

(13)

where

H \in R^{T \times D}

. This stage refines frame-level representations by modeling fine-grained temporal dependencies, while being informed by the global shot-level context injected in the previous step.

3.5.5. Frame Importance Prediction

Finally, a lightweight multilayer perceptron (MLP) followed by a sigmoid activation is applied to each frame representation

h_{i}

to produce a scalar importance score (Equation (14)):

{\hat{y}}_{i} = σ (w^{⊤} h_{i} + b)

(14)

where

{\hat{y}}_{i} \in [0, 1]

denotes the predicted importance of frame i. These frame-level importance scores form the basis for the subsequent key-shot selection and summary generation process.

Overall, the hierarchical shot-based transformer head enables the model to jointly reason about global video structure and local frame saliency. By explicitly separating shot-level and frame-level modeling, the proposed design improves temporal coherence and semantic consistency in the generated video summaries.

3.6. Knapsack-Based Summary Generation

The hierarchical transformer scoring head produces a frame-level importance score

{\hat{y}}_{i} \in [0, 1]

for each temporally downsampled frame

i = 1, \dots, T

. However, a dynamic video summary is ultimately required to consist of temporally coherent segments (shots) whose total duration does not exceed a predefined length budget. To enforce this constraint while maximizing the overall informativeness of the summary, we adopt a knapsack-based shot selection strategy, which has become a standard practice in supervised video summarization benchmarks.

3.6.1. Frame-to-Shot Score Aggregation

Given the predicted frame importance scores

{{\hat{y}}_{i}}_{i = 1}^{T}

and the shot segmentation obtained during preprocessing, we first convert frame-level scores into shot-level scores. Let shot k span the set of original frame indices

[a_{k}, b_{k}]

. The importance score of shot k is computed as the average importance of its constituent frames (Equation (15)):

v_{k} = \frac{1}{b_{k} - a_{k} + 1} \sum_{i = a_{k}}^{b_{k}} {\hat{y}}_{i}

(15)

To improve numerical stability and ensure integer-valued utilities, the shot scores are scaled by a constant factor (multiplied by 1000) before knapsack optimization, following prior work [31].

3.6.2. Length Constraint and Budget Definition

Let

n_{k}

denote the duration (in number of frames) of shot k, and let N be the total number of frames in the original video. We define a summary length budget as a fixed proportion

α

of the video length (Equation (16)):

B = ⌊ α \cdot N ⌋

(16)

where

α = 0.20

in our experiments for better F1-score which is discussed in the next section. The goal is to select a subset of shots whose total length does not exceed B.

3.6.3. Knapsack Formulation

The summary generation problem can now be formulated as a 0–1 knapsack optimization (Equation (17)):

max_{z} \sum_{k = 1}^{K} z_{k} v_{k} s . t . \sum_{k = 1}^{K} z_{k} n_{k} \leq B, z_{k} \in {0, 1}

(17)

where

z_{k} = 1

indicates that shot k is selected for the summary and

z_{k} = 0

otherwise. This formulation explicitly balances summary informativeness and compactness under a strict duration constraint. Each shot has a value (importance) and weight (duration), and the goal is to maximize total value without exceeding the capacity (summary budget). The dynamic programming solution runs in O(KB) time and guarantees globally optimal shot selection.

3.6.4. Optimal Shot Selection

We solve the knapsack problem using dynamic programming to obtain the optimal set of shots that maximizes the total importance score while satisfying the length budget. The selected shots are then concatenated in their original temporal order to form the final video summary. This approach guarantees globally optimal shot selection under the given constraint and avoids heuristic thresholding of frame scores.

3.6.5. Final Summary Construction

Once the optimal shot indices are obtained, all frames belonging to the selected shots are marked as summary frames, yielding a binary summary mask over the original video timeline. This mask is subsequently used for evaluation against human-annotated summaries.

3.7. Training Objective and Weighted Focal Loss

The goal of training is to learn a frame-level importance predictor that assigns high scores to frames included in human-annotated summaries while suppressing redundant or uninformative content. This is formulated as a binary classification problem at the frame level, where each frame is labeled as either important or unimportant based on ground-truth summaries.

3.7.1. Ground-Truth Target Construction

For each training video, we are given a binary ground-truth summary mask over the original frame sequence, indicating whether each frame belongs to the reference summary. Since the model operates on temporally downsampled frames, we convert the frame-level ground truth into a binary target vector aligned with the downsampled timeline. Specifically, a downsampled frame index is labeled positive if at least one of the original frames it represents is marked as important in the ground truth. This yields a binary target vector

y = [y_{1}, y_{2}, \dots, y_{T}], y_{i} \in {0, 1}

, where T denotes the number of downsampled frames.

3.7.2. Class Imbalance in Video Summarization

Video summarization datasets exhibit severe class imbalance, as only a small fraction of frames typically belong to the summary. Directly applying standard binary cross-entropy loss often biases the model toward predicting low importance scores. To address this issue, we employ a weighted focal loss that dynamically down-weights easy negatives while emphasizing hard and rare positive Examples [32].

3.7.3. Weighted Focal Loss Formulation

Let

{\hat{y}}_{i} \in [0, 1]

denote the predicted importance score for frame i, and let

y_{i} \in {0, 1}

be the corresponding ground-truth label. We define the focal loss for frame i as Equation (18).

L_{i} = - w_{i} {(1 - p_{i})}^{γ} log (p_{i})

(18)

where

p_{i} = \{\begin{matrix} {\hat{y}}_{i}, & if y_{i} = 1, \\ 1 - {\hat{y}}_{i}, & if y_{i} = 0, \end{matrix}

and

γ = 2.0

is the focusing parameter that controls the degree to which easy examples are down-weighted.

3.7.4. Adaptive Class Weighting

The weight

w_{i}

is selected based on the class of frame i:

w_{i} = \{\begin{matrix} w_{pos}, & if y_{i} = 1, \\ w_{neg}, & if y_{i} = 0, \end{matrix}

where the positive and negative weights are computed adaptively for each video using median frequency balancing (Equation (19)):

w_{pos} = \frac{median (P_{pos}, P_{neg})}{P_{pos}}, w_{neg} = \frac{median (P_{pos}, P_{neg})}{P_{neg}}

(19)

with

P_{pos}

and

P_{neg}

denoting the proportions of positive and negative frames, respectively, in the downsampled sequence.

The focal loss modulating factor

{(1 - p_{i})}^{γ}

dynamically down-weights well-classified frames: with

γ = 2.0

, the loss for a confident prediction

(p_{i} = 0.9)

is reduced by 100× compared to standard cross-entropy, while misclassified frames are barely affected. This is critical because only 15–20% of frames are labeled important; without focal loss, the model converges to predicting all frames as unimportant. The class weight

w_{i}

(Equation (19)) provides additional static rebalancing based on class proportions.

3.7.5. Overall Training Objective

The final training loss for a video is obtained by averaging the focal loss over all downsampled frames (Equation (20)):

L_{imp} = \frac{1}{T} \sum_{i = 1}^{T} L_{i}

(20)

This loss encourages the model to assign high importance scores to summary frames while suppressing redundant content, even under extreme class imbalance.

Overall, the combination of hierarchical modeling and weighted focal loss enables robust learning of frame importance distributions, leading to more accurate and balanced video summaries.

3.8. Evaluation Metrics

We evaluate the quality of the generated video summaries using two complementary metrics that are standard in supervised video summarization benchmarks: F1-score and diversity. The F1-score measures agreement with human-annotated reference summaries, while the diversity metric assesses the visual variety of the selected content. Together, these metrics capture both relevance and non-redundancy in the generated summaries.

3.8.1. F1-Score

Let

s \in {0, 1}^{N}

denote the binary summary mask produced by the model for a video with N original frames, where

s_{i} = 1

indicates that frame i is selected in the summary. Let

{s^{(u)}}_{u = 1}^{U}

denote the set of binary reference summaries provided by U human annotators. For each user summary

s^{(u)}

, we compute the precision and recall as Equation (21)

{Precision}^{(u)} = \frac{| s \cap s^{(u)} |}{| s |}, {Recall}^{(u)} = \frac{| s \cap s^{(u)} |}{| s^{(u)} |}

(21)

where

| \cdot |

denotes the number of selected frames. The F1-score with respect to user u is then given by Equation (22).

F 1^{(u)} = \frac{2 \cdot {Precision}^{(u)} \cdot {Recall}^{(u)}}{{Precision}^{(u)} + {Recall}^{(u)}}

(22)

Following the standard evaluation protocol, the final F1-score for a video is obtained by aggregating over all users.

3.8.2. Diversity

While the F1-score evaluates relevance with respect to human annotations, it does not explicitly penalize redundancy within the selected summary. To measure the visual diversity of the summary, we compute a diversity score based on pairwise feature similarity among selected frames.

Let

X = {x_{i} \in R^{D}}

denote the set of frame-level features, and let

S = {i ∣ s_{i} = 1}

be the set of indices of selected frames. The diversity score is defined as the average pairwise cosine similarity among selected frames (Equation (23)):

Div = \frac{1}{| S | (| S | - 1)} \sum_{i \in S} \sum_{\begin{matrix} j \in S \\ j \neq i \end{matrix}} (1 - \frac{x_{i}^{⊤} x_{j}}{∥ x_{i} ∥ ∥ x_{j} ∥})

(23)

A lower similarity indicates higher diversity, implying that the selected frames capture a wider range of visual content. For consistency with prior work, we report the diversity score directly as computed in Equation (23).

3.9. Training and Validation Procedure

Algorithm 1 presents the pseudocode for model training and validation. In each training iteration, we sample a video, compute its frame scores through the described network, and update the model using the focal loss. During validation, we predict frame scores for each video, perform shot segmentation and knapsack-based selection to generate the summary, then compute the F₁ and diversity metrics by comparing against ground truth. The entire model (except the ViT which is initialized from a pretrained checkpoint) is trained end-to-end. We stop training when the validation F₁ score converges or starts to decrease (early stopping), and report the test results using the best model.

Algorithm 1 Training and Validation Procedure for the Proposed Video Summarization Model

Require: Training set $D_{train}$ , validation set $D_{val}$ ; summary proportion $α$ ; maximum epochs E; learning rate $η$ ; focal loss parameter $γ$ .
Ensure: Trained model parameters $Θ$ .

1:: Initialize model parameters $Θ$ (ViT, Temporal U-Net, hierarchical transformer head)
2:: Initialize optimizer (Adam) and learning rate scheduler
3:: for epoch $= 1$ to E do
4:: Training Phase
5:: Set model to training mode
6:: for each video V in $D_{train}$ do
7:: Extract downsampled (2 FPS) frame features ${x_{t}}_{t = 1}^{T}$ (GoogleNet Pool5)
8:: Project features and encode using pretrained ViT:

$F \leftarrow ViTBlocks (Z)$
9:: Pass $F$ through Temporal U-Net to obtain multi-scale features $U$
10:: Build shot indices ${k (t)}$ from change points
11:: Compute shot features via mean pooling (Equation (10))
12:: Encode shot features using shot-level transformer
13:: Broadcast shot context to frame features (Equation (12))
14:: Refine frame features using frame-level transformer
15:: Predict frame importance scores ${{\hat{y}}_{t}}_{t = 1}^{T}$
16:: Construct binary ground-truth targets ${y_{t}}_{t = 1}^{T}$
17:: Compute weighted focal loss $L_{imp}$ (Equation (20))
18:: Update model parameters $Θ$ via backpropagation
19:: end for
20:: Validation Phase
21:: Set model to evaluation mode
22:: for each video V in $D_{val}$ do
23:: Predict frame importance scores ${{\hat{y}}_{t}}$
24:: Convert frame scores to shot scores
25:: Select summary shots using knapsack optimization (Equation (17))
26:: Generate binary summary mask $s$
27:: Compute F1-score against user summaries (Equation (22))
28:: Compute diversity score (Equation (23))
29:: end for
30:: Update learning rate based on validation loss and F1-score
31:: Save model checkpoint if validation F1-score improves
32:: end for
33:: return Trained model parameters $Θ$

In summary, our proposed methodology combines a ViT-based frame encoder for powerful spatial representations, a U-Net temporal backbone for multi-scale temporal feature learning with self-attention-based context aggregation, and a hierarchical shot-based transformer scoring head that exploits shot-wise structure to enhance frame importance prediction. This end-to-end trainable architecture is designed to predict accurate importance scores while maintaining diversity and coverage of the video content, as evidenced by the use of the knapsack optimization and the evaluation metrics described. The next section presents experimental results demonstrating the effectiveness of this approach.

4. Experiments

This section presents a comprehensive experimental evaluation of the proposed framework. We describe the benchmark datasets (Section 4.1) and experimental setup (Section 4.2), then report an ablation study on architectural components (Section 4.3), hyperparameter tuning results (Section 4.4), quantitative comparison with state-of-the-art methods (Section 4.5), qualitative analysis (Section 4.6), and sensitivity analysis on focal loss and learning rate (Section 4.7).

4.1. Dataset

We evaluate our model on four standard video summarization benchmarks: SumMe [13], TVSum [33], OVP [12], and YouTube [12]. These datasets cover diverse genres (e.g., sports, documentaries, user-generated content) and vary in length and number of human annotations. Table 1 summarizes the statistics of each dataset, including video count, content type, duration range, and number of annotators per video.

4.2. Experimental Setting

We follow the standard evaluation protocol using 5-fold cross-validation on the SumMe and TVSum datasets under the canonical setting. In all experiments, the model is trained for 300 epochs in each fold, and the final performance is reported as the average over the five folds.

The input sampling rate of every 15th frame (approximately 2 FPS) follows the standard preprocessing protocol established in the foundational video summarization benchmarks [12,13,14] and adopted by all compared methods [16,17,18,19,21,22,23,24,25,26,28,34,35,36], ensuring that the evaluation conditions are directly comparable across approaches. This fixed rate captures one frame per half-second, which is sufficient to represent visually distinct moments while avoiding redundant near-duplicate frames.

All experiments are implemented in PyTorch 2.10.0 and executed on Google Colab using an NVIDIA L4 GPU. The model is trained end-to-end using the Adam optimizer with a learning rate of

1.66 \times 10^{- 4}

and an

L_{2}

weight decay of

8.376 \times 10^{- 4}

. To address the severe class imbalance inherent in video summarization, we employ a weighted focal loss with a focusing parameter

γ = 2

for frame-level importance prediction.

Details regarding module-wise design choices and hyperparameter selection for the Temporal U-Net and the shot-based hierarchical transformer are provided in their respective subsections.

4.3. Module Selection

We first perform an ablation study to assess the impact of each major architectural component in the proposed framework. Table 2 reports the results for different combinations of (a) using a pretrained Vision Transformer (ViT) encoder versus non-pretrained features, (b) employing the proposed Temporal U-Net backbone versus a simpler global-attention (GA) baseline, and (c) adopting a hierarchical transformer scoring head versus a plain MLP head. All experiments are conducted on the SumMe dataset under the canonical setting with a fixed summary length of 20%.

To ensure a fair comparison, all ablation experiments are carried out under the same training and architectural configuration. Specifically, we use a learning rate of

5 \times 10^{- 5}

with

L_{2}

regularization of

1 \times 10^{- 5}

and a focal loss focusing parameter

γ = 1

. The Temporal U-Net is configured with four levels, a base channel width of 256, dropout rate of 0.2, and multi-head temporal attention enabled with four attention heads. Feature fusion across U-Net scales is also enabled. For the shot-based scoring head, we employ two attention heads, 3 transformer layers at both the shot and frame levels, a feed-forward dimension of 1024, and a dropout rate of 0.2. Apart from the module being ablated, all other components are kept identical across experiments.

From Table 2, we observe that incorporating the pretrained ViT encoder consistently improves performance, highlighting the importance of strong spatial representations. Replacing the GA backbone with the proposed Temporal U-Net leads to a substantial gain in F1-score, increasing performance from approximately 51% to around 65%, demonstrating the effectiveness of multi-scale temporal modeling. Furthermore, when using the U-Net backbone, the hierarchical transformer head slightly outperforms the MLP-based scoring head (65.02% vs. 64.54%), while maintaining comparable diversity. Overall, the combination of a pretrained ViT encoder, Temporal U-Net backbone, and hierarchical transformer scoring head achieves the best performance and is therefore adopted in all subsequent experiments.

To verify that the architectural integration is synergistic rather than merely additive, we quantify the interaction effect in Table 3. Using the base configuration (No ViT + GA + MLP = 51.17% F1) as reference, the marginal gains of adding each component independently are computed. If the combination were purely additive, the full model should achieve approximately 54.36%. The observed performance of 65.02% exceeds this additive prediction by +10.66 percentage points, providing strong evidence of super-additive synergy. This interaction arises because the ViT’s global contextualization is a necessary precondition for the U-Net to achieve its full potential (ViT + U-Net + MLP = 64.54% vs. No ViT + U-Net + MLP = 54.36%, a +10.18% amplification), and the hierarchical head further leverages the enriched multi-scale features.

4.4. Hyperparameter Tuning

All hyperparameter tuning experiments are conducted on the SumMe dataset under the canonical setting with a fixed summary length of 20%. We focus on tuning the most influential components of the proposed architecture, namely the Temporal U-Net backbone and the shot-based hierarchical transformer scoring head, while keeping all other components fixed.

Table 4 reports the top-performing configurations for the Temporal U-Net. We vary the base channel width, dropout rate, use of temporal attention, number of attention heads, and whether multi-scale feature fusion is enabled. The best-performing configuration employs 128 base channels, a dropout rate of 0.4, temporal attention with 2 heads, and no feature fusion, achieving an F1-score of 67.98%. Several nearby configurations yield comparable performance (ranging from 67.1% to 65.8%), indicating that the proposed U-Net design is robust to moderate variations in architectural choices.

To justify the choice of encoder depth, both

L = 3

and

L = 4

were explored during the hyperparameter search. As shown in Table 4,

L = 3

configurations consistently outperform

L = 4

configurations. The best

L = 3

setting achieves 67.98% F1, while the best

L = 4

setting reaches only 64.71%—a gap of 3.27 percentage points. Across all tested configurations, the median F1 for

L = 3

is 64.69% compared to 59.74% for

L = 4

, and 70% of

L = 3

configurations exceed 64.5% F1 versus only 22% for

L = 4

. This consistent degradation at

L = 4

is attributable to excessive temporal compression at the bottleneck: for the shortest videos (

T = 64

),

L = 4

reduces the bottleneck to only four temporal tokens, which is insufficient for meaningful self-attention. The choice of

L = 3

therefore provides the optimal balance between temporal receptive field coverage (spanning

2^{3} = 8 \times

the original temporal window) and information preservation.

Table 5 presents the tuning results for the shot-based hierarchical transformer head. We explore different numbers of attention heads, depths of the shot-level and frame-level transformer encoders, feed-forward network dimensions, and dropout rates. The optimal configuration uses four attention heads, one shot-level transformer layer, two frame-level transformer layers, a feed-forward dimension of 1024, and a dropout rate of 0.4, again achieving an F1-score of 67.98%. Configurations with fewer heads or reduced depth generally lead to a gradual performance degradation, with F1-scores decreasing to approximately 65%.

Overall, these results demonstrate that the selected hyperparameter settings for both the Temporal U-Net and the hierarchical transformer head are near-optimal under the canonical evaluation protocol and provide a strong balance between model capacity and generalization performance.

4.5. Quantitative Analysis

The relationship between F1-score and diversity as a function of summary length is shown in Figure 5. When the length of the summary is very short (for example, 5–10 percent), the summaries generated show comparably high diversity and fail to reflect the key contents, which leads to a low F1-score. As the length of the summary increases, the F1-score gradually improves and diversity also gradually increases due to the increase in the length of the summary. The best trade-off between coverage and diversity is found at around 20 percent summary length, which is where the F1 score is at its maximum and diversity is still high enough. This empirical finding is a reasonable justification for our choice of 20% summary length for all subsequent experiments and represents an important design contribution of the present study.

It is worth noting that the fixed-budget evaluation protocol adopted in this study is a deliberate design choice to ensure fair and direct comparison with all prior methods, which universally evaluate under fixed budgets. While adaptive per-video budgets may appear more realistic for deployment, implementing such a mechanism introduces a circular dependency: a learned budget predictor must be trained against oracle budgets derived from the same human annotations used for importance scoring, and inter-annotator variability in summary length means that a single correct per-video budget does not exist. The systematic length analysis presented in Figure 5 already addresses the core concern by identifying 20% as the consistently optimal operating point, providing evidence-based guidance without the additional complexity of an adaptive module.

We further tackle the computational efficiency of the proposed model. The total number of parameters and inference costs for videos with minimal, median and maximal length from the SumMe and TVSum datasets is summarized in Table 6. The model has 105.5 M parameters and its complexity is linearly proportional to the video length. Figure 6 shows how the computational complexity measured in giga floating point operations (GFLOPs) grows linearly with the number of input frames T, which reflects the design of the temporal backbone and transformer modules. The computational complexity was profiled using fvcore.nn.FlopCountAnalysis (Facebook Research), which computes the total floating-point operations for a single forward pass. Inference latency and FPS were measured using PyTorch CUDA event timing (torch.cuda.Event with synchronization) on an NVIDIA L4 GPU in evaluation mode. GFLOPs were computed for six sequence lengths (T = 64, 167, 293, 380, 649, 1294) corresponding to the min/median/max video lengths from SumMe and TVSum. The plot was rendered using Matplotlib 3.10.9.

As a solid example, as part of the SumMe dataset, a short video of 64 frames requires 11.8 GFLOPs and takes a speed of about 84 frames per second (FPS) for inference. In comparison, a much longer video with 649 frames calls for 119.5 GLOPs and gives a much lower inference speed of approximately 47 FPS. A similar linear trend can be seen in the TVSum dataset, where even the longest videos (1294 frames) approach processing speeds of more than 21 FPS. These results show that the proposed model has computational efficiency, and so it can be used for practical applications in video summarization scenarios.

Figure 7 presents the mean training and validation loss curves averaged over 5-fold cross-validation for both the SumMe and TVSum datasets under the canonical, augmented, and transfer settings. In all cases, the training loss decreases rapidly during the initial epochs, followed by a gradual convergence, indicating efficient optimization of the proposed architecture.

Within the canonical configuration, convergence is always stable across both the datasets, as can be seen from the validation loss following the training loss upon completion of the initial phase of training. Such a pattern means that the model learns discriminative representations of the importance of frame but does not suffer from overfitting. Under the augmented setup, a slight increase in validation loss is observed in the first few epochs, reflecting the increased variation introduced by data augmentation; however, loss curves stabilize over time, indicating improved generalization. Analogously, in the transfer configuration, the loss trajectories show soft convergence after the preliminary adaptation phase, which validates the model’s ability to transfer acquired representations between datasets. Training and validation losses were logged at every epoch during the PyTorch training loop. The metric plotted is the weighted focal loss (Equation (20)) with

γ = 2.0

and adaptive class weighting (Equation (19)). Loss values were averaged across all videos per epoch, then across all five folds. Plots were generated using Matplotlib 3.10.9.

Table 7 gives a quantitative comparison of the proposed approach with modern supervised approaches for video summarization given the canonical five-fold cross validation protocol. The resulting F1-scores of 65.8% on SumMe and 79.92% on TVSum strongly exceed previous state-of-the-art results which reach a maximum of 58.4% on SumMe and 67.5% on TVSum. This significant performance margin is indicative of how effective the model being proposed is.

In order to evaluate generalization, Table 8 presents performance across the harder protocol consisting of canonical, augmented, and transfer splits. Consistently across both of the datasets, the proposed method maintains good performance, achieving the highest scores within the canonical regime whilst achieving competitive results under augmented and transfer conditions. These results support that the proposed architecture provides a new state-of-the-art in the conventional evaluation setting and has good generalization power across various training and testing settings.

In addition to relevance measured by F1-score, we compare the visual diversity of the generated summaries with state-of-the-art supervised video summarization methods. Diversity evaluates the non-redundancy of selected frames and reflects how well a method captures varied visual content rather than repeatedly selecting similar segments. We follow the standard evaluation protocol and report average diversity scores computed using cosine-based feature dissimilarity among selected frames.

Table 9 presents the diversity comparison on the SumMe and TVSum datasets under the canonical setting. The proposed HybridHiT-UNet consistently achieves competitive diversity compared to existing methods, indicating its ability to balance relevance and non-redundancy. This improvement can be attributed to the multi-scale temporal modeling of the Temporal U-Net and the explicit incorporation of shot-level context through the hierarchical transformer head, which discourages redundant frame selection across temporally adjacent segments.

4.6. Qualitative Analysis

We also qualitatively evaluate the behavior of the proposed model by visually comparing the summary generated by the model to the summary annotated by the users. In Figure 8 and Figure 9, we allow for a representative subset from SumMe (Videos 13 and 15), and for each video show the top 20 frames selected by the model, compared with the human labeler’s selection.

The model-selected frames are very similar to the user-selected frames in both videos, with a strong correspondence around salient semantic events and important visual events. In Video 13, the model and users highlight extended interaction scenes and visual informative close-ups, and in Video 15, the summaries always focus on the main interaction segments and the recurring interaction patterns. There are few minor discrepancies mainly because of the subjective preferences of the user or because of time boundary variations, but there is generally a semantic consistency between the two summaries. This visual agreement indicates that the proposed hierarchical modeling is able to capture the underlying narrative structure which is preferred by human annotators.

To better understand the temporal alignment, Figure 10 and Figure 11 show the mean indicator of the users’ summary versus the binary model-predicted indicator over time on the original video. These plots indicate significant overlap between the model-selected segments and the user-selected segments, especially in the vicinity of high-importance regions where there is a consensus among the users. The model is able to capture the most common intervals, which is a step towards approximating collective human judgment without overfitting to each individual annotation.

Figure 8 and Figure 9 were produced by running the trained HybridHiT-UNet model (best checkpoint from 5-fold cross-validation, canonical setting, 20% budget) and extracting the top 20 frames by importance score after knapsack-based shot selection. User-annotated frames were extracted from SumMe ground-truth annotations. Frame images were extracted using OpenCV 4.13.0 and arranged using Matplotlib 3.10.9. Figure 10 and Figure 11 overlay the mean user summary indicator (averaged across all annotators) against the binary model-predicted summary mask. Plots were generated using Matplotlib 3.10.9.

Overall, the qualitative results corroborate the quantitative findings by showing that the proposed method not only achieves high F1-scores but also produces summaries that are visually and semantically consistent with human expectations. The strong alignment across both frame-level selections and temporal importance patterns provides further evidence of the effectiveness of the proposed HybridHiT-UNet framework.

4.7. Ablation on Focal Loss $γ$ and Learning Rate

We conduct a focused ablation study to analyze the impact of the focal loss focusing parameter

γ

and the learning rate on summarization performance under the SumMe canonical setting with a fixed summary length of 20%.

Figure 12 illustrates the effect of varying

γ

in the focal loss. When

γ

is small (e.g.,

γ = 0.5

), the model behaves similarly to standard weighted cross-entropy, resulting in suboptimal performance. Increasing

γ

improves the model’s ability to emphasize hard and informative frames, leading to a steady increase in F1-score. The best performance is achieved at

γ = 2.0

, after which the F1-score drops for larger values, indicating over-suppression of easy samples. Based on this observation, we fix

γ = 2.0

in all experiments.

Figure 13 shows the sensitivity of the model to the learning rate. Very small learning rates lead to slow convergence and inferior performance, while excessively large learning rates cause unstable optimization and occasional sharp performance degradation. The best and most stable F1-scores are obtained around a learning rate of

1.66 \times 10^{- 4}

, which we therefore adopt as the default setting. Overall, these results confirm that both focal loss

γ

and learning rate play a critical role in stable optimization and achieving strong summarization performance.

Figure 12 was produced by training the full model on SumMe canonical for each

γ \in {0.5, 1.0, 1.5, 2.0, 2.5, 3.0}

, keeping all other hyperparameters fixed (lr =

1.66 \times 10^{4}

, 20% budget, 300 epochs). Best F1 per

γ

was recorded using the standard evaluation protocol (Equations (21) and (22)). Figure 13 follows the same methodology but varies the learning rate from

10^{- 5}

to

5 \times 10^{- 4}

with

γ

= 2.0 fixed. Both plots were generated using Matplotlib 3.10.9.

5. Conclusions

In this work, we proposed HybridHiT-UNet, a supervised video summarization framework that integrates a Vision Transformer-based frame encoder, a Temporal U-Net backbone for multi-scale temporal modeling, and a hierarchical shot-based transformer scoring head. By jointly leveraging spatial semantics, multi-resolution temporal context, and shot-level structural information, the proposed model effectively captures both local saliency and global narrative structure in videos.

The two research questions required in this study can be directly related to the experimental results. On RQ1, extensive experiments on four benchmark datasets show that the unified architecture consistently outperforms all of existing state-of-the-art methods under the canonical evaluation protocol, with F1-scores of 65.80% on SumMe and 79.92% on TVSum, and diversity scores of 64.98% and 48.68%, respectively. The results of the ablation study confirm that all of the components make a measurable contribution to total performance, and the analysis of interactions indicates that the synergy characteristics are super-additive (+10.66 percentage points over and above the additive prediction). In terms of RQ2, our systematic analysis of the summary length reveals that a 20% summary proportion can be consistently a better trade-off between relevance and diversity than a more traditionally adopted 15%, which provides evidence based guidance to practitioners.

In an interdisciplinary approach, the success of HybridHiT-UNet is due to the conceptual integration of the ideas of various fields of study. The Vision Transformer offers spatially rich frame representations in the form of global self-attention. The Temporal U-Net adopts the multi-scale encoder–decoder paradigm of signal processing, allowing multi-resolution temporal pattern analysis. The scoring head is a transformer-based scoring head that makes use of the sequence modeling ability to tackle long-range temporal dependencies. The hierarchical design, at the shot-level, reflects a cognitive-inspired segmentation, a segmentation of the continuous video into discrete episodes as perceived by human viewers. This convergence is not only a technical collection of bits, but a unified structure that connects low-level visual features with high-level semantic structure with broader implications to video retrieval, surveillance analytics as well as multimodal content understanding.

Although these are encouraging findings, a number of limitations are to be noted. The model is based on trained example with human-labeled importance labels, which are costly to acquire. The model has 105.5 million parameters, which can be a limitation to run on resource-constrained platforms. The original frame representations are based on a fixed GoogleNet backbone and may not be able to capture all semantically relevant visual features.

Such constraints present a number of avenues on which future studies can be conducted. The framework can be expanded to unsupervised or semi-supervised cases, eliminating the need to rely on expensive annotations. Richer summarization may be facilitated by adapting to multimodal data, such as integrating audio, transcripts or textual metadata. A particularly promising direction is to explore VisionLanguage Models to produce semantically richer user-adaptive summaries. Researching the methods of compressing models might enhance their efficiency without compromising their performance. We are convinced that the future of more intelligent and human-oriented video summarization systems lies in combining hierarchical temporal modeling and the emergent capabilities of multimodal reasoning.

Author Contributions

Conceptualization, S.S. and K.D.; methodology, S.S.; software: S.S.; formal analysis, S.S.; investigation, S.S. and K.D.; validation: S.S. and K.D.; visualization: S.S.; writing—original draft preparation, S.S.; writing—review and editing: T.M., K.A. and K.D.; Supervision: K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets analyzed during the current study are publicly available and cited in the manuscript (e.g., SumMe [13], TVSum [33], OVP [12], and YouTube [12]).

Conflicts of Interest

The authors declare no conflicts of interest.

References

YouTube for Press. Available online: https://blog.youtube/press/ (accessed on 23 December 2025).
Sowery, K. People on Average Watch 17 Hours of Online Video Content Per Week. Startups Magazine, 2023. Available online: https://startupsmagazine.co.uk/article-people-average-watch-17-hours-online-video-content-week (accessed on 23 December 2025).
Sakib, S.; Sen, A.; Deb, K. A Transfer Learning Approach to Recognize Pedestrian Attributes. In Applied Intelligence for Industry 4.0; Chapman and Hall/CRC: Boca Raton, FL, USA, 2023; pp. 145–161. [Google Scholar]
Sakib, S.; Deb, K.; Dhar, P.K.; Kwon, O.J. A framework for pedestrian attribute recognition using deep learning. Appl. Sci. 2022, 12, 622. [Google Scholar] [CrossRef]
Hossain, S.; Deb, K.; Sakib, S.; Sarker, I.H. A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarization. Multimed. Tools Appl. 2025, 84, 6219–6272. [Google Scholar] [CrossRef]
Cong, R.; Lei, J.; Fu, H.; Cheng, M.M.; Lin, W.; Huang, Q. Review of visual saliency detection with comprehensive information. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2941–2959. [Google Scholar] [CrossRef]
Huang, C.R.; Chang, Y.J.; Yang, Z.X.; Lin, Y.Y. Video saliency map detection by dominant camera motion removal. IEEE Trans. Circuits Syst. Video Technol. 2014, 24, 1336–1349. [Google Scholar] [CrossRef]
Xiao, S.; Zhao, Z.; Zhang, Z.; Guan, Z.; Cai, D. Query-biased self-attentive network for query-focused video summarization. IEEE Trans. Image Process. 2020, 29, 5889–5899. [Google Scholar] [CrossRef]
Hu, Y.; Liu, M.; Su, X.; Gao, Z.; Nie, L. Video moment localization via deep cross-modal hashing. IEEE Trans. Image Process. 2021, 30, 4667–4677. [Google Scholar] [CrossRef]
Pritch, Y.; Rav-Acha, A.; Peleg, S. Nonchronological video synopsis and indexing. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1971–1984. [Google Scholar] [CrossRef]
Huang, C.R.; Chung, P.C.J.; Yang, D.K.; Chen, H.C.; Huang, G.J. Maximum a posteriori probability estimation for online surveillance video synopsis. IEEE Trans. Circuits Syst. Video Technol. 2014, 24, 1417–1429. [Google Scholar] [CrossRef]
De Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision; Springer: Zurich, Switzerland, 2014; pp. 505–520. [Google Scholar]
Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 766–782. [Google Scholar]
Rochan, M.; Ye, L.; Wang, Y. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 347–363. [Google Scholar]
Hsu, T.C.; Liao, Y.S.; Huang, C.R. Video summarization with spatiotemporal vision transformer. IEEE Trans. Image Process. 2023, 32, 3013–3026. [Google Scholar] [CrossRef]
Chen, Y.; Guo, B.; Shen, Y.; Zhou, R.; Lu, W.; Wang, W.; Wen, X.; Suo, X. Video summarization with u-shaped transformer. Appl. Intell. 2022, 52, 17864–17880. [Google Scholar] [CrossRef]
Lin, J.; Zhong, S.H.; Fares, A. Deep hierarchical LSTM networks with attention for video summarization. Comput. Electr. Eng. 2022, 97, 107618. [Google Scholar] [CrossRef]
Kashid, S.; Awasthi, L.K.; Berwal, K.; Saini, P. STVS: Spatio-temporal feature fusion for video summarization. IEEE Multimed. 2024, 31, 88–97. [Google Scholar] [CrossRef]
Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 202–211. [Google Scholar]
Liang, G.; Lv, Y.; Li, S.; Zhang, S.; Zhang, Y. Video summarization with a convolutional attentive adversarial network. Pattern Recognit. 2022, 131, 108840. [Google Scholar] [CrossRef]
Yu, Q.; Yu, H.; Wang, Y.; Pham, T.D. SUM-GAN-GEA: Video summarization using GAN with gaussian distribution and external attention. Electronics 2022, 11, 3523. [Google Scholar] [CrossRef]
Zhu, W.; Lu, J.; Li, J.; Zhou, J. Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Trans. Image Process. 2020, 30, 948–962. [Google Scholar] [CrossRef] [PubMed]
Teng, X.; Gui, X.; Xu, P.; Tong, J.; An, J.; Liu, Y.; Jiang, H. A Hierarchical Spatial–Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning. Sensors 2022, 22, 8275. [Google Scholar] [CrossRef]
Alharbi, F.; Habib, S.; Albattah, W.; Jan, Z.; Alanazi, M.D.; Islam, M. Effective video summarization using channel attention-assisted encoder–decoder framework. Symmetry 2024, 16, 680. [Google Scholar] [CrossRef]
Khan, H.; Hussain, T.; Khan, S.U.; Khan, Z.A.; Baik, S.W. Deep multi-scale pyramidal features network for supervised video summarization. Expert Syst. Appl. 2024, 237, 121288. [Google Scholar] [CrossRef]
Sakib, S.; Palit, R.; Das, D.; Mahmud, T.; Deb, K. A Lightweight Transformer-Based Encoder-Decoder Model for Video Summarization. In Proceedings of the International Conference on Data Science, AI and Applications; Springer: Cham, Switzerland, 2025; pp. 357–372. [Google Scholar]
An, Y.; Zhao, S. SHTVS: Shot-level based Hierarchical Transformer for Video Summarization. In Proceedings of the 2022 5th International Conference on Image and Graphics Processing; Association for Computing Machinery: New York, NY, USA, 2022; pp. 268–274. [Google Scholar]
Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkila, J. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7596–7604. [Google Scholar]
Barbakos, S.; Antoniadis, C.; Potamianos, G.; Setti, G. Unsupervised Transcript-assisted Video Summarization and Highlight Detection. arXiv 2025, arXiv:2505.23268. [Google Scholar] [CrossRef]
Fajtl, J.; Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Summarizing videos with attention. In Proceedings of the Asian Conference on Computer Vision; Springer: Cham, Switzerland, 2018; pp. 39–54. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
Ghauri, J.A.; Hakimov, S.; Ewerth, R. Supervised video summarization via multiple feature sets with parallel attention. arXiv 2021, arXiv:2104.11530. [Google Scholar] [CrossRef]
Wang, S.; Zhang, J. MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment. arXiv 2025, arXiv:2506.10430. [Google Scholar] [CrossRef]
Zhao, C.; Wang, C.; Song, Z.; Hu, G.; Chen, H.; Zhai, X. Cap2Sum: Learning to summarize videos by generating captions. arXiv 2024, arXiv:2408.12800. [Google Scholar]
Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]

Figure 1. Research workflow underlying this study. The process flows from problem identification (Section 1) through literature gap analysis (Section 2), architecture design (Section 3), experimental evaluation (Section 4), and findings to conclusions (Section 5). Dashed red arrows indicate how each identified gap directly informs a specific architectural component or experimental analysis. The experimental evaluation phase encompasses module ablation and synergy analysis (Tables 2 and 3), hyperparameter tuning (Tables 4 and 5), state-of-the-art comparison (Tables 7 and 8), and qualitative and summary length analysis (Figures 5 and 8–13).

Figure 2. Pipeline of the proposed summarization model. We extract frame features via a pretrained ViT, process them with a Temporal U-Net, and then apply a hierarchical Transformer to compute frame importance scores. The highest-scoring shots are selected by knapsack to form the summary.

Figure 3. Temporal U-Net backbone (multi-scale 1D convolution and self-attention).

Figure 4. Hierarchical shot-based transformer scoring head. Shot-level features are first encoded to capture inter-shot context, then broadcast to frame-level features, followed by a frame-level transformer to produce final importance scores.

Figure 5. Trade-off between F1-score and diversity as a function of summary length. The optimal balance is achieved at approximately 20% of the original video duration.

Figure 6. GFLOPs as a function of sequence length T for SumMe and TVSum. The linear trend indicates scalable computational complexity.

Figure 7. Mean training and validation loss curves over 5-fold cross-validation for SumMe (top row) and TVSum (bottom row) under canonical, augmented, and transfer settings.

Figure 8. Selected frames for Video 13 (SumMe): (top) model-generated summary, (bottom) user-annotated summary.

Figure 9. Selected frames for Video 15 (SumMe): (top) model-generated summary, (bottom) user-annotated summary.

Figure 10. Temporal alignment between mean user summary and model-predicted summary for Video 13.

Figure 11. Temporal alignment between mean user summary and model-predicted summary for Video 15.

Figure 12. Effect of focal loss focusing parameter

γ

on F1-score. The best performance is achieved at

γ = 2.0

.

Figure 12. Effect of focal loss focusing parameter

γ

on F1-score. The best performance is achieved at

γ = 2.0

.

Figure 13. Effect of learning rate on F1-score. Stable and optimal performance is obtained around

1.66 \times 10^{- 4}

.

Figure 13. Effect of learning rate on F1-score. Stable and optimal performance is obtained around

1.66 \times 10^{- 4}

.

Table 1. Overview of the benchmark datasets used for experimental evaluation.

Dataset	#Videos	Content	Duration (min)	#Annotators
SumMe [13]	25	Holidays, sports, events	1–6	15–18
TVSum [33]	50	News, how-to, documentaries, user-generated	2–10	20
OVP [12]	50	Documentary, educational, historical, lecture	1–4	5
YouTube [12]	39	Cartoons, sports, TV shows, commercials, home videos	1–10	5

Table 2. Ablation of model components (SumMe, 20% summary). GA = global-attention backbone.

ViT	Backbone	Score Head	F1 (%)	Diversity (%)
✓	GA	Hier	51.17	58.82
×	GA	Hier	51.17	58.82
✓	GA	MLP	51.17	58.82
×	GA	MLP	51.17	58.82
✓	U-Net	Hier	65.02	60.59
×	U-Net	Hier	54.99	60.51
✓	U-Net	MLP	64.54	61.28
×	U-Net	MLP	54.36	58.68

Table 3. Interaction analysis of architectural components on SumMe (canonical). Marginal

Δ

is the gain over the base configuration. Synergy = observed full-model F1 − additive prediction.

Table 3. Interaction analysis of architectural components on SumMe (canonical). Marginal

Δ

is the gain over the base configuration. Synergy = observed full-model F1 − additive prediction.

Analysis	ViT	Backbone	Score Head	F1 (%)	Marginal $Δ$
Base	×	GA	MLP	51.17	—
+ViT only	✓	GA	MLP	51.17	+0.00
+U-Net only	×	U-Net	MLP	54.36	+3.19
+Hier only	×	GA	Hier	51.17	+0.00
Additive prediction				54.36
ViT + U-Net	✓	U-Net	MLP	64.54	+10.18
Full model	✓	U-Net	Hier	65.02
Synergy (Full − Additive)				+10.66

Table 4. U-Net hyperparameter tuning results (SumMe).

Depth	Channels	Dropout	Attn	Heads	Fusion	F1 (%)
3	128	0.4	Yes	2	No	67.98
3	128	0.4	Yes	2	Yes	67.13
3	128	0.2	No	2	Yes	65.82
3	128	0.4	No	2	Yes	65.82
3	256	0.3	No	4	No	65.72
3	128	0.0	No	4	No	65.03
3	256	0.3	No	2	Yes	65.02
3	256	0.4	Yes	2	Yes	65.01
3	128	0.3	Yes	2	No	64.99
3	128	0.3	No	4	No	64.99
3	128	0.1	No	4	Yes	64.90
3	256	0.4	No	4	Yes	64.90
4	128	0.2	No	4	Yes	64.71
4	128	0.1	No	4	No	64.66
4	128	0.4	No	2	Yes	62.99
4	256	0.0	Yes	4	No	62.79
4	256	0.2	No	4	Yes	59.74
4	128	0.2	No	2	No	59.66

Table 5. Transformer head hyperparameter tuning (SumMe).

Heads	Shot Layers	Frame Layers	FF Dim	Dropout	F1 (%)
4	1	2	1024	0.4	67.98
4	1	4	512	0.1	65.82
4	1	3	512	0.4	65.82
4	1	2	512	0.1	65.80
4	3	4	1024	0.4	65.77
2	1	2	512	0.1	65.72
4	3	3	1024	0.4	65.41
4	2	3	1024	0.4	65.41
4	2	4	1024	0.4	65.03
2	1	1	512	0.1	65.02
4	1	2	512	0.2	65.01
4	2	4	1024	0.2	64.99

Table 6. Model complexity and inference efficiency on SumMe and TVSum for different video lengths.

Dataset	Total Params	T (Frames)	GFLOPs	Latency (ms)	FPS
SumMe	105,546,753	Min (64)	11.789	11.863	84.29
		Median (293)	53.955	12.122	82.50
		Max (649)	119.513	21.441	46.64
TVSum	105,546,753	Min (167)	30.747	11.874	84.22
		Median (380)	69.981	12.032	83.11
		Max (1294)	238.290	46.694	21.42

Table 7. Comparison with SOTA on SumMe and TVSum (canonical 5-fold cross-validation).

Method	SumMe F1 (%)	TVSum F1 (%)
DHAVS [18]	45.6	60.8
DSNET [23]	51.2	61.9
MSVA [34]	54.5	67.5
CAAN [21]	50.8	59.6
SUM-GAN-GEA [22]	53.4	61.3
Uformer [17]	53.9	63.0
DBLSTM [24]	58.4	65.3
SHTVS [28]	52.3	61.4
MPFN [26]	51.9	62.4
STVT [16]	55.1	67.1
STVS [19]	53.6	61.9
SAVS-Net [25]	51.8	61.5
MF2Summ [35]	53.1	63.3
HybridHiT-UNet (Ours)	65.8	79.92

Table 8. Comparison with SOTA under canonical (C), augmented (A), and transfer (T) splits on SumMe and TVSum. Ranking is based on SumMe canonical performance.

Method	Rank	SumMe			TVSum
Method	Rank	C	A	T	C	A	T
DHAVS [18]	14	45.6	46.5	43.5	60.8	61.2	57.5
DSNET [23]	12	51.2	53.3	47.6	61.9	62.2	58.0
MSVA [34]	4	54.5	–	–	67.5	–	–
CAAN [21]	13	50.8	50.9	46.5	59.6	59.8	57.8
SUM-GAN-GEA [22]	7	53.4	–	–	61.3	–	–
Uformer [17]	5	53.9	53.4	47.1	63.0	64.0	59.4
DBLSTM [24]	2	58.4	58.4	60.0	65.3	67.4	66.2
SHTVS [28]	9	52.3	55.4	45.1	61.4	62.1	59.4
MPFN [26]	10	51.9	–	–	62.4	–	–
STVT [16]	3	55.1	55.9	48.2	67.1	67.7	59.9
STVS [19]	6	53.6	–	–	61.9	–	–
SAVS-Net [25]	11	51.8	–	–	61.5	–	–
Cap2Sum [36]	–	–	59.2	56.4	–	69.7	66.2
MF2Summ [35]	8	53.1	–	–	63.3	–	–
HybridHiT-UNet (Ours)	1	65.8	55.1	48.29	79.92	70.60	65.94

Table 9. Diversity comparison (%) on SumMe and TVSum under the canonical setting. Higher is better.

Method	SumMe Diversity (%)	TVSum Diversity (%)
dpp-LSTM [14]	59.1	46.3
DR-DSN [37]	59.4	46.4
DSNET_ab [23]	64.2	47.6
Uformer [17]	66.5	48.1
HybridHiT-UNet (Ours)	64.98	48.68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sakib, S.; Mahmud, T.; Andersson, K.; Deb, K. HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization. Mach. Learn. Knowl. Extr. 2026, 8, 135. https://doi.org/10.3390/make8050135

AMA Style

Sakib S, Mahmud T, Andersson K, Deb K. HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization. Machine Learning and Knowledge Extraction. 2026; 8(5):135. https://doi.org/10.3390/make8050135

Chicago/Turabian Style

Sakib, Saadman, Tanjim Mahmud, Karl Andersson, and Kaushik Deb. 2026. "HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization" Machine Learning and Knowledge Extraction 8, no. 5: 135. https://doi.org/10.3390/make8050135

APA Style

Sakib, S., Mahmud, T., Andersson, K., & Deb, K. (2026). HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization. Machine Learning and Knowledge Extraction, 8(5), 135. https://doi.org/10.3390/make8050135

Article Menu

HybridHiT-UNet: Multi-Scale Temporal U-Net with Hierarchical Shot-Aware Transformers for Video Summarization

Abstract

1. Introduction

2. Related Work

2.1. Video Summarization Approaches

2.2. Unsupervised Video Summarization

2.3. Supervised Video Summarization

2.3.1. LSTM- and RNN-Based Architectures

2.3.2. Attention-Based Models

2.3.3. Transformer-Based Approaches

2.3.4. Shot-Based Hierarchical Transformer Methods

2.4. Gaps and Motivation

3. Proposed Method

3.1. Overview of the Architecture

3.2. Input Feature Construction and Temporal Alignment

3.3. Pretrained ViT Block

3.4. Temporal U-Net Backbone

3.4.1. Encoder

3.4.2. Temporal Self-Attention

3.4.3. Bottleneck

3.4.4. Decoder

3.4.5. Output

3.5. Shot-Level Hierarchical Transformer Scoring Head

3.5.1. Shot-Level Feature Construction

3.5.2. Shot-Level Transformer Encoding

3.5.3. Broadcasting Shot Context to Frames

3.5.4. Frame-Level Transformer Refinement

3.5.5. Frame Importance Prediction

3.6. Knapsack-Based Summary Generation

3.6.1. Frame-to-Shot Score Aggregation

3.6.2. Length Constraint and Budget Definition

3.6.3. Knapsack Formulation

3.6.4. Optimal Shot Selection

3.6.5. Final Summary Construction

3.7. Training Objective and Weighted Focal Loss

3.7.1. Ground-Truth Target Construction

3.7.2. Class Imbalance in Video Summarization

3.7.3. Weighted Focal Loss Formulation

3.7.4. Adaptive Class Weighting

3.7.5. Overall Training Objective

3.8. Evaluation Metrics

3.8.1. F1-Score

3.8.2. Diversity

3.9. Training and Validation Procedure

4. Experiments

4.1. Dataset

4.2. Experimental Setting

4.3. Module Selection

4.4. Hyperparameter Tuning

4.5. Quantitative Analysis

4.6. Qualitative Analysis

4.7. Ablation on Focal Loss γ and Learning Rate

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.7. Ablation on Focal Loss $γ$ and Learning Rate