Next Article in Journal
The Role of Cardiovascular Imaging in the Diagnosis of Athlete’s Heart: Navigating the Shades of Grey
Next Article in Special Issue
The Adversarial Robust and Generalizable Stereo Matching for Infrared Binocular Based on Deep Learning
Previous Article in Journal
A Hybrid Approach for Image Acquisition Methods Based on Feature-Based Image Registration
Previous Article in Special Issue
Deep Learning for Single-Shot Structured Light Profilometry: A Comprehensive Dataset and Performance Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization †

1
Intelligence and Sensing Lab, Osaka University, Suita 565-0871, Japan
2
CyberAgent, Inc., Tokyo 150-0042, Japan
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 June 2023.
J. Imaging 2024, 10(9), 229; https://doi.org/10.3390/jimaging10090229
Submission received: 19 July 2024 / Revised: 21 August 2024 / Accepted: 12 September 2024 / Published: 14 September 2024
(This article belongs to the Special Issue Deep Learning in Computer Vision)

Abstract

:
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.

1. Introduction

In an era where video data are booming at an unprecedented pace, the importance of making the video browsing process more efficient has never been greater. Video summarization facilitates efficient browsing by creating a concise synopsis of the raw video, a topic that has been popular in research for many years. The rapid development of deep learning has significantly promoted the efficacy of video summarization tools [1]. Supervised approaches [2,3,4,5] leverage the temporal modeling power of LSTM (long short-term memory) [6] or self-attention mechanisms [7] and train them with annotated summaries. Heuristic training objectives such as diversity and representativeness have been applied using unsupervised methods [8,9,10,11,12,13,14] to enforce a diverse selection of keyframes that are representative of the essential contents of videos.
Past unsupervised approaches have trained summarization models to produce diverse and representative summaries by optimizing feature similarity-based loss/reward functions. Many research works on visual representation learning have revealed that vision models pre-trained on supervised or self-supervised tasks contain rich semantic signals, facilitating zero-shot transfer learning in tasks such as classification [15,16], semantic segmentation [17], and object detection [18]. In this work, we propose leveraging the rich semantics encoded in pre-trained visual features to achieve zero-shot video summarization that outperforms previous heavily-trained approaches and self-supervised pre-training to enhance the zero-shot performance further.
Specifically, we first define local dissimilarity and global consistency as two desirable criteria for localizing keyframe candidates. Inspired by the diversity objective, if a frame is distant from its nearest neighbors in the feature space, it encodes information that rarely appears in other frames. As a result, including such frames in the summary contributes to the diversity of its content. Such frames are considered to be decent key frame candidates as they enjoy high local dissimilarity, the naming of which leverages the definition of locality in the feature space in [19]. However, merely selecting frames based on dissimilarity may wrongly incorporate noisy frames that are not indicative of the video storyline. Therefore, we constrain the keyframes to be aligned with the video storyline by guaranteeing their high semantic similarity with the global cluster of the video frames, i.e., they are representative of (or globally consistent with) the video theme. Overall, the selected keyframes should enjoy a decent level of local dissimilarity to increase the content diversity in the summary and reflect the global video gist.
In contrast to previous works that required training to enforce the designed criteria, we directly quantify the proposed criteria into frame-level importance scores by utilizing contrastive losses for visual representation learning, i.e., alignment and uniformity losses [20]. The alignment loss calculates the distance between semantically similar samples, such as augmented versions of an input image, and minimizes this distance to ensure similarity between these positive samples in a contrastive learning setting. In our case, we directly apply the alignment loss to quantify the local dissimilarity metric. Uniformity loss is employed to regularize the overall distribution of features, with higher values indicating closely clustered features. This characteristic makes it well-suited for assessing the semantic consistency across a group of frames. To leverage this, we adapt the uniformity loss to evaluate the consistency between an individual frame and the entire set of video frames, which serves as a proxy for the global video storyline. These two losses can then be utilized for self-supervised contrastive refinement of the features, where contrastive learning is applied to optimize feature distances, ultimately enhancing the accuracy of the calculated frame importance scores.
Nonetheless, background frames may feature dynamic content that changes frequently, making them distinct from even the most similar frames and resulting in local dissimilarity. At the same time, these frames might contain background elements that are common across a majority of the video frames, contributing to global consistency. For example, in a video of a car accident, street scenes are likely to appear consistently. Although these frames might differ due to moving objects, they remain generally consistent with most frames, on average, due to the shared background context. We propose mitigating the chances of selecting such frames by exploiting the observation that such background frames tend to appear in many different videos with diverse topics and, thus, are not unique to their associated videos, e.g., street scenes in videos about car accidents, parades, city tours, etc. Specifically, we propose a uniqueness filter to quantify the uniqueness of frames, formulated by leveraging cross-video contrastive learning. An illustration of the difference between the proposed method and previous methods is provided in Figure 1.
Leveraging rich semantic information encoded in pre-trained visual features, we, for the first time, propose tackling training-free zero-shot video summarization and self-supervised pre-training to enhance the zero-shot transfer. Inspired by contrastive loss components [20], we achieve zero-shot summarization by quantifying frame importance into three metrics: local dissimilarity, global consistency, and uniqueness. The proposed method achieves better or competitive performance compared to previous methods while being training-free. Moreover, we introduce self-supervised contrastive refinement using unlabeled videos from YouTube-8M [21] to refine the feature distribution, which aids in training the proposed uniqueness filter and further enhances performance. Finally, compared to our conference paper [22], we provide results of current SOTA methods [23,24], provide more insightful analyses of the pros and cons of our proposed methods, and conduct more comprehensive ablation studies in various crucial aspects. The code to reproduce all the experiments is available at https://github.com/pangzss/pytorch-CTVSUM (accessed on 18 July 2024).

2. Related Work

Early applications in video summarization focus on sports videos [25,26,27] for event detection and highlight video compilation. Later on, video summarization was explored in other domains such as instructional videos [28,29,30,31], movies [32,33], and general user videos [34]. Thanks to the excellent generalization capabilities of deep neural networks/features, the focus of video summarization research has been diverted to developing general-purpose summarization models for a diverse range of video domains.
As an initial step toward deep learning-based supervised video summarization, Zhang et al. [2] utilized a long short-term memory (LSTM) for modeling temporal information when trained with human-annotated summaries, which sparked a series of subsequent works based on LSTM [3,35,36,37,38]. The rise of Transformer [7] inspired a suite of methods leveraging self-attention mechanisms for video summarization [4,5,10,39,40,41,42,43]. Some works have explored spatiotemporal information by jointly using RNNs and convolutional neural networks (CNNs) [44,45,46] or used graph convolution networks [47,48]. Video summarization leveraging multi-modal signals has also performed impressively [23,24,49].
Deep learning-based unsupervised methods mainly exploit two heuristics: diversity and representativeness. For diversity, some works [8,9,11,48] have utilized a diversity loss derived from a repelling regularizer [50], guaranteeing dissimilarities between selected keyframes. It has also been formulated as a reward function optimized via policy gradient methods, as seen in [12,51,52]. Similarly, representativeness can be guaranteed by reconstruction loss [8,10,11,13,53] or reconstruction-based reward functions [12,51,52].
Unlike previous works, we tackle training-free zero-shot video summarization and propose a pre-training strategy for better zero-shot transfer. Specifically, we directly calculate frame importance by leveraging contrastive loss terms formulated in [20] to quantify diversity and representativeness. With features from a vision backbone pre-trained on supervised image classification tasks [54] and without any further training, the proposed contrastive loss-based criteria can already well-capture the frame contribution to the diversity and representativeness of the summary. The proposed self-supervised contrastive refinement can further boost the performance and leverage unlabeled videos for zero-shot transfer to test videos.

3. Preliminaries

Given the centrality of contrastive learning to our approach, we first introduce the relevant preliminaries, with a focus on instance discrimination as outlined in [55].

3.1. Instance Discrimination via the InfoNCE Loss

Contrastive learning [56] has become a cornerstone of self-supervised image representation learning; throughout the years, it has received more attention from researchers. This method has been continuously refined to produce representations with exceptional transferability [19,20,53,55,57,58,59,60]. Formally, given a set of N images D = { I n } n = 1 N , contrastive representation learning aims to learn an encoder f θ such that the resulting features f θ ( I n ) can be readily leveraged by downstream vision tasks. A theoretically founded loss function with favorable empirical behaviors is InfoNCE loss [58]:
L InfoNCE = I D log e f θ ( I ) · f θ ( I ) / τ J D ( I ) e f θ ( I ) · f θ ( J ) / τ ,
where I is a positive sample for I, usually obtained through data augmentation, and D ( I ) includes I as well as all negative samples, e.g., any other images. The operator “·” is the inner product and τ is a temperature parameter. Therefore, the loss aims to pull the features of an instance closer to those of its augmented views while repelling them from the features of other instances, thus performing instance discrimination.

3.2. Contrastive Learning via Alignment and Uniformity

When normalized onto the unit hypersphere, the features learned through contrastive learning that yield strong downstream performance exhibit two notable properties. First, semantically related features tend to cluster closely on the sphere, regardless of specific details. Second, the overall information of the features is largely preserved, resulting in a joint distribution that approximates a uniform distribution [57,58,59]. Wang et al. [20] termed these two properties as alignment and uniformity.
The alignment metric computes the distance between the positive pairs [20]:
L align ( θ , α ) = E ( I , I ) p pos [ f θ ( I ) f θ ( I ) 2 α ] ,
where α > 0 , and p pos is the distribution of positive pairs. The uniformity is defined as the average pairwise Gaussian potential between the overall features, as follows:
L uniform ( θ , β ) = log E I , J i . i . d p data [ e β f θ ( I ) f θ ( J ) 2 2 ] .
Here, p data is typically approximated by the empirical data distribution, and β is commonly set to 2, as recommended by [20]. This metric promotes the overall feature distribution on the unit hypersphere to approximate a uniform distribution and can also directly quantify the uniformity of feature distributions [61]. Additionally, Equation (3) approximates the logarithm of the denominator in Equation (1) when the number of negative samples approaches infinity [20]. As demonstrated in [20], jointly minimizing Equations (2) and (3) leads to better alignment and uniformity of the features, meaning they become locally clustered and globally uniform [61].
In this paper, we employ Equation (2) to calculate the distance or dissimilarity between semantically similar video frame features, which helps measure frame importance based on local dissimilarity. We then apply a modified version of Equation (3) to assess the proximity between a specific frame and the overall information of the corresponding video, thereby estimating their semantic consistency. Additionally, by leveraging these two loss functions, we learn a nonlinear projection of the pre-trained features to enhance the local alignment and global uniformity of the projected features.

4. Proposed Method

We first define two metrics to quantify frame importance by leveraging rich semantic information in pre-trained features: local dissimilarity and global consistency. To guarantee that the metrics encode the diversity and representativeness of the summary, we conduct self-supervised contrastive refinement of the features, where an extra metric called uniqueness is defined to further strengthen the keyframes’ quality. We provide a conceptual illustration of our approach in Figure 2.

4.1. Local Dissimilarity

Inspired by the diversity objective, we consider frames likely to result in a diverse summary as those conveying diverse information even when compared to their nearest neighbors. Formally, given a video V , we first extract deep features using the ImageNet [62] pre-trained vision backbone, e.g., GoogleNet [54], denoted as F, such that F ( V ) = { x t } t = 1 T , where x t represents the deep feature for the t-th frame in V , and T is the total number of frames in V . Each feature is L2-normalized such that x t 2 = 1 .
To define local dissimilarity for frames in V , we first use cosine similarity to retrieve for each frame x t a set N t of top K = a T neighbors, where a is a hyperparameter and K is rounded to the nearest integer. The local dissimilarity metric for x t is an empirical approximation of Equation (2), defined as the local alignment loss:
L align ( x t ) = 1 | N t | x N t x t x 2 2 ,
which measures the distance/dissimilarity between x t and its semantic neighbors.
The larger the value of L align ( x t ) , the more dissimilar x t is from its neighbors. Therefore, if a frame exhibits a certain distance from even its closest neighbors in the semantic space, the frames within its local neighborhood are likely to contain diverse information, making them strong candidates for keyframes. Consequently, L align ( x t ) can be directly utilized as the importance score for x t after appropriate scaling.

4.2. Global Consistency

N t may contain semantically irrelevant frames if x t has very few meaningful semantic neighbors in the video. Therefore, merely using Equation (4) for frame-wise importance scores is insufficient. Inspired by the reconstruction-based representativeness objective [8], we define another metric, called global consistency, to quantify how consistent a frame is with the video gist by a modified uniformity loss based on Equation (3):
L uniform ( x t ) = log 1 T 1 x x t , x F ( V ) e 2 x t x 2 2 ,
L uniform ( x t ) measures the proximity between x t and the remaining frames, bearing similarity to the reconstruction- and K-medoid-based objectives in [8,12]. However, it obviates the need to train an autoencoder [8] or a policy network [12] by directly leveraging rich semantics in pre-trained features.

4.3. Contrastive Refinement

Equations (4) and (5) are computed using deep features pre-trained on image classification tasks, which may not inherently exhibit the local alignment and global uniformity described in Section 3.2. To address similar challenges, Hamilton et al. [17] proposed contrastively refining self-supervised vision transformer features [15] for unsupervised semantic segmentation. They achieve this by freezing the feature extractor (to improve efficiency) and training only a lightweight projector. Following this approach, we also avoid fine-tuning the heavy feature extractor—in our case, GoogleNet—and instead train only a lightweight projection head.
Formally, given features F ( V ) from the frozen backbone for a video, we feed them to a learnable module to obtain z t = G θ ( x t ) , where z t is L2-normalized (we leave out the L2-normalization operator for notation simplicity). The nearest neighbors in N t for each frame are still determined using the pre-trained features { x t } t = 1 T . Similar to [19,63], we also observe collapsed training when directly using the learnable features for nearest neighbor retrieval, so we stick to using the frozen features.
With the learnable features, the alignment loss (local dissimilarity) and uniformity loss (global consistency) become (we slightly abuse the notation of L to represent losses both before and after transformation by G θ ):
L align ( z t ; θ ) = 1 | N t | z N t z t z 2 2 ,
L uniform ( z t ; θ ) = log 1 T 1 z z t , z G θ ( F ( V ) ) e 2 z t z 2 2 ,
The joint loss function is as follows:
L ( z t ; θ ) = L align ( z t ; θ ) + λ 1 L uniform ( z t ; θ ) ,
where λ 1 is a hyperparameter balancing the two loss terms.
During the contrastive refinement, L align and L uniform will mutually resist each other for frames that have semantically meaningful nearest neighbors and are consistent with the video gist. Specifically, when a nontrivial number of frames beyond N t also share similar semantic information with the anchor z t , these frames function as “hard negatives” that prevent L align to be easily minimized [19,61]. Therefore, only frames with moderate local dissimilarity and global consistency will have balanced values for the two losses. In contrast, the other frames tend to have extreme values compared to those before the refinement.

4.4. The Uniqueness Filter

The two metrics defined above fail to account for the fact that locally dissimilar yet globally consistent frames can often be background frames with complex content that is related to most of the frames in the video. For example, dynamic cityscapes might frequently appear in videos recorded in urban settings.
To address this, we propose filtering out such frames by leveraging a common characteristic: they tend to appear in many different videos that do not necessarily share a common theme or context. For instance, city views might be present in videos about car accidents, city tours, or parades, while scenes featuring people moving around can appear across various contexts. Consequently, these frames are not unique to their respective videos. This concept has been similarly explored in weakly-supervised action localization research [64,65,66], where a single class prototype vector is used to capture all background frames. However, our approach aims to identify background frames in an unsupervised manner. Additionally, rather than relying on a single prototype, which can be too restrictive [67], we treat each frame as a potential background prototype. By identifying frames that are highly activated across random videos, we develop a metric to determine the “background-ness” of a frame.
To design a filter for eliminating such frames, we introduce an extra loss to Equation (8) that taps into cross-video samples. For computational efficiency, we aggregate the frame features in a video V k with T k frames into segments with an equal length of m. The learnable features, z , in each segment, are average-pooled and L2-normalized to obtain segment features S k = { s l } l = 1 | S k | with | S k | = T k / m . To measure the proximity of a frame with frames from a randomly sampled batch of videos B (represented as segment features), including S k , we again leverage Equation (3) to define the uniqueness loss for z t V k as follows:
L unique ( z t ; θ ) = log 1 A S B / S k s S e 2 z t s 2 2 ,
where A = S B / S k | S | is the normalization factor. A large value of L unique means that z t has nontrivial similarity with segments from randomly gathered videos, indicating that it is likely to be a background frame. When jointly optimized with Equations (8) and (9) the process will be easy to minimize for unique frames, for which most elements of s are semantically irrelevant and can be safely repelled. It is not the case for the background frames with semantically similar s , as the local alignment loss keeps strengthening the closeness of semantically similar features.
As computing Equation (9) requires random videos, it is not as straightforward to convert Equation (9) to importance scores after training. To address this, we train a model H θ ^ whose last layer is a sigmoid unit to mimic 1 L ¯ unique ( z t ; θ ) , where L ¯ unique ( z t ; θ ) is L unique ( z t ; θ ) scaled to [ 0 , 1 ] over t. Denoting y t = 1 sg ( L ¯ unique ( z t ; θ ) ) and r t = H θ ^ ( sg ( z t ) ) , where “sg” stands for stop gradients, we define the loss for training the model as follows:
L filter ( z t ; θ ^ ) = y t log r t + ( 1 y t ) log ( 1 r t ) .

4.5. The Full Loss and Importance Scores

With all the components, the loss for each frame in a video is as follows:
L ( z t ; θ , θ ^ ) = L align ( z t ; θ ) + λ 1 L uniform ( z t ; θ ) + λ 2 L unique ( z t ; θ ) + λ 3 L filter ( z t ; θ ^ ) ,
where we fix both λ 2 and λ 3 as 0.1 and only tune λ 1 .
Scaling the local dissimilarity, global consistency, and uniqueness scores to [ 0 , 1 ] over t, the frame-level importance score is defined as follows:
p t = L ¯ align ( z t ; θ ) L ¯ uniform ( z t ; θ ) H ¯ θ ^ ( z t ) + ϵ ,
which ensures that the importance scores are high only when all three terms have significant magnitude. The parameter ϵ is included to prevent zero values in the importance scores, which helps stabilize the knapsack algorithm used to generate the final summaries. Since these scores are derived from three independent metrics, they may lack the temporal smoothness typically provided by methods like RNNs [2] or attention networks [5]. To address this, we apply Gaussian smoothing to the scores within each video, aligning our method with previous work that emphasizes the importance of temporal smoothness in score generation.

5. Experiments

5.1. Datasets and Settings

Datasets. In line with previous studies, we evaluate our method on two benchmarks: TVSum [31] and SumMe [34]. TVSum consists of 50 YouTube videos, each annotated by 20 individuals who provide importance scores for every two-second shot. SumMe includes 25 videos, each with 15 to 18 reference binary summaries. Following the protocol established by [2], we use the OVP (50 videos) and YouTube (39 videos) datasets [68] to augment both TVSum and SumMe. Additionally, to assess whether our self-supervised approach can benefit from a larger video dataset, we randomly selected approximately 10,000 videos from the YouTube-8M dataset [21], which contains 3862 video classes with highly diverse content.
Evaluation Setting. Following prior work, we evaluate our model’s performance using five-fold cross-validation, where the dataset (either TVSum or SumMe) is randomly divided into five splits. The reported results are the average across these five splits. In the canonical setting (C), training is performed only on the original splits of the two evaluation datasets. In the augmented setting (A), we expand the training set in each fold with three additional datasets (e.g., SumMe, YouTube, and OVP when evaluating on TVSum). In the transfer setting (T), all videos from TVSum (or SumMe) are reserved for testing, while the other three datasets are used for training. Additionally, we introduce a new transfer setting where training is exclusively conducted on the collected YouTube-8M videos, and evaluation is performed on TVSum or SumMe. This setting is intended to assess whether our model can benefit from a larger volume of data.

5.2. Evaluation Metrics

F1 score. Denoting A as the set of frames in a ground-truth summary and B as the set of frames in the corresponding generated summary, we can calculate precision and recall as follows:
Precision = | A B | | A | , Recall = | A B | | B | ,
with which we can calculate the F1 score by the following:
F 1 = 2 × Precision × Recall Precision + Recall .
We follow [2] to deal with multiple ground-truth summaries and to convert importance scores into summaries.
Rank correlation coefficients. Recently, Otani et al. [69] highlighted that F1 scores can be unreliable and may yield relatively high values even for randomly generated summaries. To address this issue, they proposed using rank correlation coefficients, specifically Kendall’s τ [70] and Spearman’s ρ [71], to evaluate the correlation between predicted and ground-truth importance scores. For each video, we first compute the coefficient value between the predicted importance scores and the scores provided by each annotator, then average these values across all annotators for that video. The final results are obtained by averaging the correlation coefficients across all videos.

5.3. Summary Generation

We follow previous work to convert importance scores to key shots. Specifically, we use the KTS algorithm [72] to segment videos into temporally consecutive shots and then average the importance scores within each shot to compute the shot-level importance scores. The final key shots are chosen to maximize the total score while guaranteeing that the summary length does not surpass 15 % of the video length. The maximization is conducted by solving the knapsack problem based on dynamic programming [31]. Otani et al. [69] pointed out that using average frame importance scores as shot-level scores will drastically increase the F1 score for the TVSum dataset, and they recommended using the sum of scores to alleviate the problem. However, F1 scores reported by previous works mostly rely on averaging importance scores for shot-level scores. We also report our F1 scores in the same way as they did but focus on analyzing the rank correlation values for comparison and analysis.

5.4. Implementation Details

We follow prior studies by using GoogleNet [54] pre-trained features as the default for standard experiments. For experiments involving YouTube-8M videos, we utilize the quantized Inception-V3 [73] features provided by the dataset [21]. Both types of features are pre-trained on ImageNet [62]. The contrastive refinement module appended to the feature backbone is a lightweight Transformer encoder [7], and so is the uniqueness filter.
Following [9], we standardized each video to have an equal length by using random sub-sampling for longer videos and nearest-neighbor interpolation for shorter videos. Similar to [9], we did not observe much difference when using different lengths, and we fixed the frame count at 200.
The model appended to the feature backbone for contrastive refinement is a stack of Transformer encoders with multi-head attention modules [7]. There are two training scenarios: 1. Training with TVSum [31], SumMe [34], YouTube, and OVP [68], divided into the canonical, augmented, and transfer settings; 2. Training with a subset of videos from the YouTube-8M dataset [21]. We refer to the training in the first scenario as standard and the second as YT8M. The pre-trained features are first projected into 128 dimensions for training in both scenarios using a learnable, fully connected layer. The projected features are then fed into the Transformer encoders. The model architecture and associated optimization details are outlined in Table 1. Training the 10,000 YouTube-8M videos takes approximately 6 min for 40 epochs on a single NVIDIA RTX A6000.
We tune two hyperparameters: The ratio a, which determines the size of the nearest neighbor set N t and the coefficient λ 1 , which controls the balance between the alignment and uniformity losses.

5.5. Quantitative Results

In this section, we compare our results with previous work and conduct the ablation study for different components of our method.
Training-free zero-shot performance. As shown in Table 2 and Table 3, L ¯ align * and L ¯ uniform * directly computed using GoogleNet [54] pre-trained features, achieve performance superior to most methods in terms of τ , ρ , and F1 score. Notably, the correlation coefficients τ and ρ surpass supervised methods, e.g., (0.1345, 0.1776) v.s. dppLSTM’s (0.0298, 0.0385) and SumGraph’s (0.094, 0.138) for TVSum. Although DR-DSN2000 has slightly better performance in terms of τ and ρ for TVSum, it has to reach the performance after 2000 epochs of training, while our results are directly obtained with simple computations using the same pre-trained features as those also used by DR-DSN.
More training videos are needed for the contrastive refinement. For the results in Table 2 and Table 3, the maximum number of training videos is only 159, coming from the SumMe augmented setting. For the canonical setting, the training set size is 40 videos for TVSum and 20 for SumMe. Without experiencing many videos, the model tends to overfit specific videos and cannot generalize well. This is similar to the observation in contrastive representation learning, where a larger amount of data—whether from a larger dataset or obtained through data augmentation—helps the model generalize better [15,60]. Therefore, the contrastive refinement results in Table 2 and Table 3 hardly outperform those computed using pre-trained features.
Contrastive refinement on YouTube-8M videos and transfer to TVSum. The model generalizes to the test videos better when sufficient training videos are given, as shown by the results for TVSum in Table 4. After the contrastive refinement, the results with only L ¯ align * are improved from (0.0595, 0.0779) to (0.0911, 0.1196) for τ and ρ . We can also observe improvement over L ¯ align * & L ¯ uniform * brought by contrastive refinement.
Contrastive refinement on YouTube-8M videos and transfer to SumMe. The reference summaries in SumMe are binary scores, and summary lengths are constrained to be within 15 % of the video lengths. Therefore, the majority of the reference summary receives exactly zero scores. The contrastive refinement may still enhance the confidence scores for these regions, which receive zero scores from annotators due to the 15 % constraint. This can ultimately reduce the average correlation with the reference summaries, as seen in Table 4.
Suppose that the predicted scores are refined to have sufficiently high confidence for regions with nonzero annotated scores; in this case, they are likely to be selected by the knapsack algorithm used to compute the F1 scores. Therefore, we consider scores that achieve both high F1 and high correlations to be of high quality, as the former tends to overlook the overall correlations between the predicted and annotated scores [69], while the latter focuses on their overall ranked correlations but places less emphasis on prediction confidence. This analysis may explain why the contrastive refinement for L ¯ align * improves the F1 score but decreases the correlations.
The effect of L ¯ align . As can be observed in Table 2, Table 3 and Table 4, solely using L ¯ align can already well-quantify the frame importance. This indicates that L ¯ align successfully selects frames with diverse semantic information, which are indeed essential for a desirable summary. Moreover, we assume that diverse frames form the foundation of a good summary, consistently using L ¯ align for further ablations.
The effect of L ¯ uniform . L ¯ uniform measures how consistent a frame is with the context of the whole video, thus helping remove frames with diverse contents that are hardly related to the video theme. It is shown in Table 2 and Table 4 that incorporating L ¯ uniform helps improve the quality of the frame importance for TVSum. We now discuss why L ¯ uniform hurts SumMe performance.
Compared to TVSum videos, many SumMe videos already contain consistent frames due to their slowly evolving properties. Such slowly evolving features can be visualized by T-SNE plots in Figure 3. For videos with such consistent content, the L ¯ uniform tends to be high for most of the frames. We show the normalized histogram of L uniform * for both TVSum and SumMe videos in Figure 4. As can be observed, SumMe videos have distinctly higher L uniform * than those of TVSum videos. Consequently, for videos possessing monotonous content, most of the frames share a similar visual cue, such as the background, and the frames that are most likely to be keyframes are those with abrupt or novel content. Therefore, the global consistency metric, L ¯ uniform * , is not discriminative enough to be sufficiently helpful and may alleviate the importance of frames with novel content. As a result, the other two metrics—local dissimilarity and uniqueness—are the main roles that determine keyframes in such videos, as shown in Table 2, Table 3 and Table 4.
The effect of the uniqueness filter H ¯ θ ^ . As shown in Table 2 and Table 3, although H ¯ θ ^ works well for TVSum videos, it hardly brings any benefits to the SumMe videos. Thus, the good performance of the uniqueness filter for TVSum may be due to the relatively straightforward nature of the background frames in TVSum, which are easily identified by the uniqueness filter even with training on only a few videos. Therefore, we suppose that H ¯ θ ^ needs to be trained on more videos to filter out more challenging background frames such that it can generalize to a wider range of videos. This is validated by the L ¯ align & H ¯ θ ^ results in Table 4, which indicate both decent F1 scores and correlation coefficients for both TVSum and SumMe. The TVSum performance can be further boosted when L ¯ uniform is incorporated.
Comparison with DR-DSN [12] on YouTube-8M. As per Table 2, DR-DSN is the only unsupervised method that matches our performance in terms of τ and ρ and has an official implementation available. We trained DR-DSN on our dataset of YouTube-8M videos to compare it against our method. As shown in Table 4, DR-DSN has difficulty generalizing to the evaluation videos.
Ablations over λ 1 anda. As shown in Figure 5, when L ¯ align & H ¯ θ ^ is used to produce importance scores, a larger a will make the TVSum performance unstable in terms of both F1 and correlation coefficients, although the SumMe performance is relatively more stable with respect to a. We hypothesize that when a becomes larger, the nearest neighbor set becomes noisier, diminishing the effectiveness of both the alignment loss during training and the local dissimilarity metric (post-training alignment loss) used for generating importance scores, due to the inclusion of semantically irrelevant neighbors. For λ 1 , smaller values generally perform better when a has a reasonable value, as larger values of λ 1 tend to make the uniformity loss suppress the alignment loss. Similarly, too small λ 1 will make the alignment loss suppress the uniformity loss, as we observed unstable training when further decreasing λ 1 . As shown in Figure 6, the analysis of the interaction between λ 1 and a when using L ¯ align & H ¯ θ ^ & L ¯ uniform is used to produce importance scores, similar to that in Figure 5. However, we can see that the performance was improved for TVSum but undermined for SumMe due to incorporating L ¯ uniform .
Ablation on model sizes.Table 5 shows the ablation results for different sizes of the Transformer encoder [7], where the number of layers and the number of attention heads are varied. Meanwhile, we compare the results with those obtained from DR-DSN [12] trained on the same collected YouTube-8M videos, as DR-DSN has the best τ and ρ among past unsupervised methods and is the only one that has a publicly available official implementation. As can be observed, the model performance is generally stable with respect to the model sizes, and we choose 4L8H. Moreover, the DR-DSN has difficulty generalizing well to the test videos when trained on the YouTube-8M videos.
Comparing the effects of different pre-trained features. As our method can directly compute importance scores using pre-trained features, it is also essential for it to be able to work with different kinds of pre-trained features. To prove this, we computed and evaluated the importance scores generated with 2D supervised features, 3D supervised features, and 2D self-supervised features in Table 6. Different 2D features, whether supervised or self-supervised, all delivered decent results. Differences exist but are trivial. The conclusion that L ¯ unif helps TVSum but undermines SumMe also holds for most of the features. Based on this, we conclude that as long as the features contain decent semantic information learned from supervision or self-supervision, they are enough to efficiently compute the importance scores. The performance of these features transferred to different downstream image tasks does not necessarily generalize to our method for video summarization, as the latter only requires reliable semantic information (quantified as dot products) to calculate heuristic metrics for video frames.
Notably, our method does not perform optimally with 3D supervised video features. This outcome is anticipated because these 3D features are trained to encode information based on video-level labels, thus capturing less detailed semantic information in individual frames, which is crucial for our method. Still, such 3D features contain part of the holistic information of the associated video and may be a good vehicle for video summarization, which can benefit from such information.

5.6. Qualitative Results

We show the effect of the local dissimilarity ( L ¯ align ), the global consistency ( L ¯ uniform ), and the uniqueness scores generated by the uniqueness filter H ¯ θ ^ in Figure 7. We visualize and discuss the effects in pairs, i.e., L ¯ align & L ¯ uniform and L ¯ align & H ¯ θ ^ . In the upper half of Figure 7, the green bar selects a frame with high local similarity but low global consistency, which is a title frame with a disparate appearance and hardly conveys any valuable information about the video. While the black bar selects a frame related to the main content of the video (an interview), it has semantic neighbors with almost the same look and is less likely to contain diverse semantics. The red bar selects a frame with moderate local dissimilarity and global consistency. This frame, along with its semantic neighbors, conveys diverse information; for example, the car with or without people surrounding it. Moreover, it is highly relevant to the overall video context: an interview at a car company.
For the lower half of Figure 7, the green bar selects a frame with information noticeably different from its neighbors, e.g., the sea occupies different proportions of the scene. However, such a frame can appear in any video with water scenes, rendering it not unique to the belonging video. Hence, its uniqueness score is low. The black bar selects a frame with an object specifically belonging to this video in the center, but the local semantic neighborhood around it hardly conveys diverse information. The red bar selects a frame with both high local dissimilarity and high uniqueness, which is the frame related to the gist of the video: St. Maarten landing.

6. Conclusions

We make the first attempt to approach training-free, zero-shot video summarization by leveraging pre-trained deep features. We utilize contrastive learning to propose three metrics—local dissimilarity, global consistency, and uniqueness—to generate frame importance scores. The proposed metrics directly enable the creation of summaries with quality that is better or competitive compared to previous supervised or unsupervised methods requiring extensive training. Moreover, we propose contrastive pre-training on unlabeled videos to further boost the quality of the proposed metrics, the effectiveness of which has been verified by extensive experiments. It would be interesting to explore multi-modal zero-hot video summarization for future work.

Author Contributions

Conceptualization, Z.P.; formal analysis, Z.P.; funding acquisition, Y.N.; investigation, Z.P. and M.O.; methodology, Z.P.; project administration, Y.N. and H.N.; resources, Y.N. and H.N.; software, Z.P.; supervision, Y.N., M.O. and H.N.; validation, Z.P., Y.N. and M.O.; visualization, Z.P.; writing—original draft, Z.P.; writing—review and editing, Z.P., Y.N., M.O. and H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by JST CREST grant no. JPMJCR20D3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

TVSum: https://github.com/yalesong/tvsum, accessed on 18 July 2024. SumMe: https://paperswithcode.com/dataset/summe, accessed on 18 July 2024. YouTube-8M: https://research.google.com/youtube8m/, accessed on 18 July 2024.

Conflicts of Interest

The author M.O. was employed by the company CyberAgent, Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Otani, M.; Song, Y.; Wang, Y. Video summarization overview. Found. Trends® Comput. Graph. Vis. 2022, 13, 284–335. [Google Scholar] [CrossRef]
  2. Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video summarization with long short-term memory. In Proceedings of the European Conference on Computer Vision, ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  3. Zhang, K.; Grauman, K.; Sha, F. Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision, ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
  4. Fu, T.J.; Tai, S.H.; Chen, H.T. Attentive and adversarial learning for video summarization. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV, Waikoloa Village, HI, USA, 7–11 January 2019. [Google Scholar]
  5. Fajtl, J.; Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Summarizing videos with attention. In Proceedings of the Asian Conference on Computer Vision, ACCV, Perth, Australia, 2–6 December 2018. [Google Scholar]
  6. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, NeurIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  8. Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised video summarization with adversarial LSTM networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  9. Rochan, M.; Ye, L.; Wang, Y. Video summarization using fully convolutional sequence networks. In Proceedings of the European Conference on Computer Vision, ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
  10. Liu, Y.T.; Li, Y.J.; Yang, F.E.; Chen, S.F.; Wang, Y.C.F. Learning hierarchical self-attention for video summarization. In Proceedings of the IEEE International Conference on Image Processing, ICIP, Taipei, Taiwan, 22–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3377–3381. [Google Scholar]
  11. Rochan, M.; Wang, Y. Video summarization by learning from unpaired data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  12. Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the Conference on Artificial Intelligence, AAAI, New Orleans, LA, USA, 26–28 March 2018. [Google Scholar]
  13. Jung, Y.; Cho, D.; Kim, D.; Woo, S.; Kweon, I.S. Discriminative feature learning for unsupervised video summarization. In Proceedings of the Conference on Artificial Intelligence, AAAI, Honolulu, HI, USA, 27–28 January 2019. [Google Scholar]
  14. Jung, Y.; Cho, D.; Woo, S.; Kweon, I.S. Global-and-Local Relative Position Embedding for Unsupervised Video Summarization. In Proceedings of the European Conference on Computer Vision, ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  15. Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  16. Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2021, arXiv:2111.07832. [Google Scholar]
  17. Hamilton, M.; Zhang, Z.; Hariharan, B.; Snavely, N.; Freeman, W.T. Unsupervised Semantic Segmentation by Distilling Feature Correspondences. arXiv 2022, arXiv:2203.08414. [Google Scholar]
  18. Wang, X.; Girdhar, R.; Yu, S.X.; Misra, I. Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3124–3134. [Google Scholar]
  19. Zhuang, C.; Zhai, A.L.; Yamins, D. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  20. Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the International Conference on Machine Learning, ICML, Virtual, 13–18 July 2020. [Google Scholar]
  21. Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. Youtube-8m: A large-scale video classification benchmark. arXiv 2016, arXiv:1609.08675. [Google Scholar]
  22. Pang, Z.; Nakashima, Y.; Otani, M.; Nagahara, H. Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV, Waikoloa, HI, USA, 2–7 January 2023. [Google Scholar]
  23. Narasimhan, M.; Rohrbach, A.; Darrell, T. CLIP-It! language-guided video summarization. In Proceedings of the Annual Conference on Neural Information Processing Systems, NeurIPS, Virtual, 6–14 December 2021. [Google Scholar]
  24. He, B.; Wang, J.; Qiu, J.; Bui, T.; Shrivastava, A.; Wang, Z. Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14867–14878. [Google Scholar]
  25. Takahashi, Y.; Nitta, N.; Babaguchi, N. Video summarization for large sports video archives. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6–8 July 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 1170–1173. [Google Scholar]
  26. Tjondronegoro, D.; Chen, Y.P.P.; Pham, B. Highlights for more complete sports video summarization. IEEE Multimed. 2004, 11, 22–37. [Google Scholar] [CrossRef]
  27. Li, B.; Pan, H.; Sezan, I. A general framework for sports video summarization with its application to soccer. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, 6–10 April 2003; Proceedings. (ICASSP’03). IEEE: Piscataway, NJ, USA, 2003; Volumn 3, pp. III–169. [Google Scholar]
  28. Choudary, C.; Liu, T. Summarization of visual content in instructional videos. IEEE Trans. Multimed. 2007, 9, 1443–1455. [Google Scholar] [CrossRef]
  29. Liu, T.; Kender, J.R. Rule-based semantic summarization of instructional videos. In Proceedings of the Proceedings. International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; IEEE: Piscataway, NJ, USA, 2002; Volumn 1, p. I. [Google Scholar]
  30. Liu, T.; Choudary, C. Content extraction and summarization of instructional videos. In Proceedings of the 2006 International Conference on Image Processing, Atlanta, Georgia, 8–11 October 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 149–152. [Google Scholar]
  31. Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. TVSum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  32. Sang, J.; Xu, C. Character-based movie summarization. In Proceedings of the 18th ACM international Conference on Multimedia, Firenze, Italy, 25–26 October 2010; pp. 855–858. [Google Scholar]
  33. Tsai, C.M.; Kang, L.W.; Lin, C.W.; Lin, W. Scene-based movie summarization via role-community networks. IEEE Trans. Circuits Syst. Video Technol. 2013, 23, 1927–1940. [Google Scholar] [CrossRef]
  34. Gygli, M.; Grabner, H.; Riemenschneider, H.; Van Gool, L. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision, ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  35. Zhao, B.; Li, X.; Lu, X. Hierarchical recurrent neural network for video summarization. In Proceedings of the ACM International Conference on Multimedia, ACM MM, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar]
  36. Zhao, B.; Li, X.; Lu, X. HSA-RNN: Hierarchical structure-adaptive RNN for video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  37. Feng, L.; Li, Z.; Kuang, Z.; Zhang, W. Extractive video summarizer with memory augmented neural networks. In Proceedings of the ACM International Conference on Multimedia, ACM MM, Seoul, Republic of Korea, 22–26 October 2018. [Google Scholar]
  38. Wang, J.; Wang, W.; Wang, Z.; Wang, L.; Feng, D.; Tan, T. Stacked memory network for video summarization. In Proceedings of the ACM International Conference on Multimedia, ACM MM, Nice, France, 21–25 October 2019. [Google Scholar]
  39. Casas, L.L.; Koblents, E. Video Summarization with LSTM and Deep Attention Models. In Proceedings of the International Conference on MultiMedia Modeling, MMM, Thessaloniki, Greece, 8–11 January 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 67–79. [Google Scholar]
  40. Ji, Z.; Xiong, K.; Pang, Y.; Li, X. Video summarization with attention-based encoder–decoder networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef]
  41. Ji, Z.; Jiao, F.; Pang, Y.; Shao, L. Deep attentive and semantic preserving video summarization. Neurocomputing 2020, 405, 200–207. [Google Scholar] [CrossRef]
  42. Liu, Y.T.; Li, Y.J.; Wang, Y.C.F. Transforming multi-concept attention into video summarization. In Proceedings of the Asian Conference on Computer Vision, ACCV, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  43. Lin, J.; Zhong, S.h. Bi-Directional Self-Attention with Relative Positional Encoding for Video Summarization. In Proceedings of the IEEE 32nd International Conference on Tools with Artificial Intelligence, ICTAI, Baltimore, MD, USA, 9–11 November 2020. [Google Scholar]
  44. Yuan, Y.; Li, H.; Wang, Q. Spatiotemporal modeling for video summarization using convolutional recurrent neural network. IEEE Access 2019, 7, 64676–64685. [Google Scholar] [CrossRef]
  45. Chu, W.T.; Liu, Y.H. Spatiotemporal Modeling and Label Distribution Learning for Video Summarization. In Proceedings of the IEEE 21st International Workshop on Multimedia Signal Processing, MMSP, Kuala Lumpur, Malaysia, 27–29 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
  46. Elfeki, M.; Borji, A. Video summarization via actionness ranking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV, Waikoloa Village, HI, USA, 7–11 January 2019. [Google Scholar]
  47. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  48. Park, J.; Lee, J.; Kim, I.J.; Sohn, K. SumGraph: Video Summarization via Recursive Graph Modeling. In Proceedings of the European Conference on Computer Vision, ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  49. Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Yokoya, N. Video summarization using deep semantic features. In Proceedings of the Asian Conference on Computer Vision, ACCV, Taipei, Taiwan, 20–24 November 2016. [Google Scholar]
  50. Zhao, J.; Mathieu, M.; LeCun, Y. Energy-based generative adversarial network. arXiv 2016, arXiv:1609.03126. [Google Scholar]
  51. Chen, Y.; Tao, L.; Wang, X.; Yamasaki, T. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM MM Asia, Beijing, China, 16–18 December 2019. [Google Scholar]
  52. Li, Z.; Yang, L. Weakly Supervised Deep Reinforcement Learning for Video Summarization With Semantically Meaningful Reward. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, WACV, Waikoloa, HI, USA, 5–9 January 2021. [Google Scholar]
  53. He, X.; Hua, Y.; Song, T.; Zhang, Z.; Xue, Z.; Ma, R.; Robertson, N.; Guan, H. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the ACM International Conference on Multimedia, ACM MM, Nice, France, 21–25 October 2019. [Google Scholar]
  54. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  55. Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  56. Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
  57. Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670. [Google Scholar]
  58. Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  59. Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision, ECCV, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  60. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, ICML, Virtual, 13–18 July 2020. [Google Scholar]
  61. Wang, F.; Liu, H. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  62. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, NeurIPS, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
  63. Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  64. Nguyen, P.X.; Ramanan, D.; Fowlkes, C.C. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  65. Liu, D.; Jiang, T.; Wang, Y. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  66. Lee, P.; Uh, Y.; Byun, H. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the Conference on Artificial Intelligence, AAAI, New York, NY, USA, 7–12 February 2020. [Google Scholar]
  67. Lee, P.; Wang, J.; Lu, Y.; Byun, H. Weakly-supervised temporal action localization by uncertainty modeling. In Proceedings of the Conference on Artificial Intelligence, AAAI, Vancouver, BC, Canada, 2–9 February 2021. [Google Scholar]
  68. De Avila, S.E.F.; Lopes, A.P.B.; da Luz Jr, A.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
  69. Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkila, J. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  70. Kendall, M.G. The treatment of ties in ranking problems. Biometrika 1945, 33, 239–251. [Google Scholar] [CrossRef] [PubMed]
  71. Beyer, W.H. Standard Probability and Statistics: Tables and Formulae; CRC Press: Boca Raton, FL, USA, 1991. [Google Scholar]
  72. Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-specific video summarization. In Proceedings of the European Conference on Computer Vision, ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  73. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  74. Saquil, Y.; Chen, D.; He, Y.; Li, C.; Yang, Y.L. Multiple Pairwise Ranking Networks for Personalized Video Summarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  75. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  76. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  77. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  78. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  79. Hara, K.; Kataoka, H.; Satoh, Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6546–6555. [Google Scholar]
  80. He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
  81. Bao, H.; Dong, L.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
  82. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Figure 1. A comparison between our method and previous work.
Figure 1. A comparison between our method and previous work.
Jimaging 10 00229 g001
Figure 2. A conceptual illustration for the three metrics: local dissimilarity, global consistency, and uniqueness in the semantic space. The images come from the SumMe [34] and TVSum [31] datasets. The dots with the same color indicate features from the same video. For concise demonstration, we only show one frame for “Video 2” and “Video 3” to show the idea of uniqueness.
Figure 2. A conceptual illustration for the three metrics: local dissimilarity, global consistency, and uniqueness in the semantic space. The images come from the SumMe [34] and TVSum [31] datasets. The dots with the same color indicate features from the same video. For concise demonstration, we only show one frame for “Video 2” and “Video 3” to show the idea of uniqueness.
Jimaging 10 00229 g002
Figure 3. TSNE plots for all 25 SumMe videos. As can be observed, many videos contain features that slowly evolve and maintain an overall similarity among all the frames.
Figure 3. TSNE plots for all 25 SumMe videos. As can be observed, many videos contain features that slowly evolve and maintain an overall similarity among all the frames.
Jimaging 10 00229 g003
Figure 4. The histogram (density) of L ¯ uniform * (before normalization) for TVSum and SumMe videos. SumMe videos have distinctly higher values than those for TVSum videos.
Figure 4. The histogram (density) of L ¯ uniform * (before normalization) for TVSum and SumMe videos. SumMe videos have distinctly higher values than those for TVSum videos.
Jimaging 10 00229 g004
Figure 5. Ablation results over λ 1 and a when using L ¯ align & H ¯ θ ^ to produce importance scores.
Figure 5. Ablation results over λ 1 and a when using L ¯ align & H ¯ θ ^ to produce importance scores.
Jimaging 10 00229 g005
Figure 6. Ablation results over λ 1 and a when using L ¯ align & H ¯ θ ^ & L ¯ uniform to produce importance scores.
Figure 6. Ablation results over λ 1 and a when using L ¯ align & H ¯ θ ^ & L ¯ uniform to produce importance scores.
Jimaging 10 00229 g006
Figure 7. The qualitative analysis of two video examples. The left column contains importance scores, where “GT” stands for ground truth. The green bar selects an anchor frame with high L ¯ align but low L ¯ uniform or H ¯ θ ^ , the red bar selects one with non-trial magnitude for both metrics, and the black bar selects one with low L ¯ align but high L ¯ uniform or H ¯ θ ^ . We show five samples from the top 10 semantic nearest neighbors within the dashed boxes on the right for each selected anchor frame.
Figure 7. The qualitative analysis of two video examples. The left column contains importance scores, where “GT” stands for ground truth. The green bar selects an anchor frame with high L ¯ align but low L ¯ uniform or H ¯ θ ^ , the red bar selects one with non-trial magnitude for both metrics, and the black bar selects one with low L ¯ align but high L ¯ uniform or H ¯ θ ^ . We show five samples from the top 10 semantic nearest neighbors within the dashed boxes on the right for each selected anchor frame.
Jimaging 10 00229 g007
Table 1. Model and optimization details.
Table 1. Model and optimization details.
LayersHeads d model d head d inner OptimizerLRWeight DecayBatch SizeEpochDropout
Standard4112864512Adam0.00010.000132 (TVSum)
8 (SumMe)
400
YT8M4812864512Adam0.00010.0005128400
Table 2. Ablation results in terms of τ and ρ , along with their comparisons to previous work in the canonical setting. DR-DSN60 refers to the DR-DSN trained for 60 epochs; similarly, DR-DSN2000. Our scores with superscript ∗ are directly computed from pre-trained features. The results were generated with ( λ 1 , a ) = ( 0.5 , 0.1 ) . Boldfaced scores represent the best among supervised methods, and blue scores are the best among the methods without using annotations. Methods with † are vision–language approaches. Please refer to the text for analyses of the results.
Table 2. Ablation results in terms of τ and ρ , along with their comparisons to previous work in the canonical setting. DR-DSN60 refers to the DR-DSN trained for 60 epochs; similarly, DR-DSN2000. Our scores with superscript ∗ are directly computed from pre-trained features. The results were generated with ( λ 1 , a ) = ( 0.5 , 0.1 ) . Boldfaced scores represent the best among supervised methods, and blue scores are the best among the methods without using annotations. Methods with † are vision–language approaches. Please refer to the text for analyses of the results.
TVSumSumMe
τ ρ τ ρ
Human baseline [74]0.17550.20190.17960.1863
Supervised
VASNet [5,74]0.16900.22210.02240.0255
dppLSTM [2,69]0.02980.0385−0.0256−0.0311
SumGraph [48]0.0940.138--
Multi-ranker [74]0.17580.23010.01080.0137
Clip-It [23]0.1080.147-
A 2 Summ [24]0.1370.1650.1080.129
Unsupervised
DR-DSN60 [12,69]0.01690.02270.04330.0501
DR-DSN2000 [12,74]0.15160.198−0.0159−0.0218
SUM-FCNunsup [9,74]0.01070.01420.00800.0096
SUM-GAN [8,74]−0.0535−0.0701−0.0095−0.0122
CSNet + GL + RPE [14]0.0700.091--
Training-free
L ¯ align * 0.10550.13890.09600.1173
L ¯ align * & L ¯ uniform * 0.13450.17760.08190.1001
Contrastively refined
L ¯ align 0.10020.13210.09420.1151
L ¯ align & L ¯ uniform 0.12310.16250.06890.0842
L ¯ align & H ¯ θ ^ 0.13880.18270.05850.0715
L ¯ align & L ¯ uniform & H ¯ θ ^ 0.16090.21180.03580.0437
Table 3. Ablation results regarding F1 and their comparisons with previous unsupervised methods. The boldfaced results are the best ones. Please refer to Table 2’s caption for the explanation of the notations and the text for analyses of the results.
Table 3. Ablation results regarding F1 and their comparisons with previous unsupervised methods. The boldfaced results are the best ones. Please refer to Table 2’s caption for the explanation of the notations and the text for analyses of the results.
TVSumSumMe
CATCAT
Unsupervised
DR-DSN60 [12]57.658.457.841.442.842.4
SUM-FCNunsup [9]52.7--41.5-39.5
SUM-GAN [8]51.759.5-39.143.4-
UnpairedVSN [11]55.6-55.747.5-41.6
CSNet [13]58.85959.251.352.145.1
CSNet + GL + RPE [14]59.1--50.2--
SumGraphunsup [48]59.361.257.649.852.147
Training-free
L ¯ align * 56.456.454.643.543.539.4
L ¯ align * & L ¯ uniform * 58.458.456.847.246.0741.7
Contrastively refined
L ¯ align 54.655.15346.847.141.5
L ¯ align & L ¯ uniform 58.859.957.446.748.441.1
L ¯ align & H ¯ θ ^ 53.85654.345.24545.3
L ¯ align & L ¯ uniform & H ¯ θ ^ 59.559.959.746.845.543.9
Table 4. The transfer evaluation setting with the YouTube-8M dataset, where the training is solely conducted on the collected YouTube-8M videos and then evaluated on TVSum and SumMe. The results from DR-DSN [12] are also provided for comparison.
Table 4. The transfer evaluation setting with the YouTube-8M dataset, where the training is solely conducted on the collected YouTube-8M videos and then evaluated on TVSum and SumMe. The results from DR-DSN [12] are also provided for comparison.
TVSumSumMe
F1 τ ρ F1 τ ρ
Unsupervised
DR-DSN [12]51.60.05940.078839.8−0.0142−0.0176
Training-free
L ¯ align * 55.90.05950.077945.50.10000.1237
L ¯ align * & L ¯ uniform * 56.70.06800.089942.90.05310.0649
Contrastively refined
L ¯ align 56.20.09110.119646.60.07760.0960
L ¯ align & L ¯ uniform 57.30.11300.149040.90.01530.0190
L ¯ align & H ¯ θ ^ 58.10.12300.161248.70.07800.0964
L ¯ align & L ¯ uniform & H ¯ θ ^ 59.40.15630.204843.20.04490.0553
Table 5. Ablation results for the model size and comparison with DR-DSN [12] trained on the same YouTube-8M videos, where 2L2H represents “2 layers 2 heads” and the rest goes similarly. All three components L ¯ align & H ¯ θ ^ & L ¯ uniform are used with ( a , λ 1 ) = ( 0.05 , 0.25 ) for both SumMe and TVSum for fair comparison with DR-DSN, which also uses a representativeness-based training objective.
Table 5. Ablation results for the model size and comparison with DR-DSN [12] trained on the same YouTube-8M videos, where 2L2H represents “2 layers 2 heads” and the rest goes similarly. All three components L ¯ align & H ¯ θ ^ & L ¯ uniform are used with ( a , λ 1 ) = ( 0.05 , 0.25 ) for both SumMe and TVSum for fair comparison with DR-DSN, which also uses a representativeness-based training objective.
TVSumSumMe
F1 τ ρ F1 τ ρ
DR-DSN [12]51.60.05940.078839.8−0.0142−0.0176
2L2H58.00.14920.195342.90.06890.0850
2L4H58.10.14450.189442.80.06440.0794
2L8H58.80.15350.201144.00.05840.0722
4L2H57.40.14980.196345.30.06270.0776
4L4H58.30.15340.200943.10.06400.0790
4L8H58.50.15640.205042.70.06180.0765
Table 6. Evaluation results with different pre-trained features. The results were produced under the transfer setting with a = 0.1 .
Table 6. Evaluation results with different pre-trained features. The results were produced under the transfer setting with a = 0.1 .
TVSumSumMe
L ¯ align * L ¯ align * & L ¯ unif * L ¯ align * L ¯ align * & L ¯ unif *
F1 τ ρ F1 τ ρ F1 τ ρ F1 τ ρ
Supervised (2D)
VGG19 [75]50.620.07450.097155.910.11190.147345.160.09290.115143.280.08990.1114
GoogleNet [54]54.670.09850.128557.090.12960.169941.890.08320.103140.970.07500.0929
InceptionV3 [73]55.020.10930.143455.630.08190.108242.710.08780.108742.300.06880.0851
ResNet50 [76]51.190.08060.105155.190.10730.141042.300.08680.107643.860.07370.0914
ResNet101 [76]51.750.08290.108154.880.11180.146942.320.09110.113044.390.07360.0913
ViT-S-16 [77]53.480.06910.090356.150.10170.133240.300.06520.080840.880.05660.0701
ViT-B-16 [77]52.850.06700.087356.150.08760.115242.100.06940.086041.650.05820.0723
Swin-S [78]52.050.08250.108257.580.11200.147541.180.08800.109041.630.08250.1022
Supervised (3D)
R3D50 [79]52.090.05900.076653.350.06670.086937.400.01070.013841.030.01500.0190
R3D101 [79]49.770.05610.072752.150.06440.083433.620.01730.021634.960.02120.0264
Self-supervised (2D)
MoCo [80]51.310.07970.103455.970.10620.139042.010.07680.095343.190.07110.0882
DINO-S-16 [15]52.500.09700.126857.570.12000.158342.770.08480.105042.670.07370.0913
DINO-B-16 [15]52.480.08930.117057.020.11470.151541.070.08610.106644.140.06790.0843
BEiT-B-16 [81]49.640.11250.146856.340.12700.166536.910.05540.068638.480.05070.0629
MAE-B-16 [82]50.400.06860.089254.580.10130.132740.320.05600.069539.460.04840.0601
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pang, Z.; Nakashima, Y.; Otani, M.; Nagahara, H. Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization. J. Imaging 2024, 10, 229. https://doi.org/10.3390/jimaging10090229

AMA Style

Pang Z, Nakashima Y, Otani M, Nagahara H. Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization. Journal of Imaging. 2024; 10(9):229. https://doi.org/10.3390/jimaging10090229

Chicago/Turabian Style

Pang, Zongshang, Yuta Nakashima, Mayu Otani, and Hajime Nagahara. 2024. "Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization" Journal of Imaging 10, no. 9: 229. https://doi.org/10.3390/jimaging10090229

APA Style

Pang, Z., Nakashima, Y., Otani, M., & Nagahara, H. (2024). Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization. Journal of Imaging, 10(9), 229. https://doi.org/10.3390/jimaging10090229

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop