2.1. Approaches for Video Summarization
During the last few years, a significant number of works have produced a wide range of video summarization techniques, leading to notable results.
In [
14], the authors formulate a video summarization as a sequential decision-making process while they develop a deep summarization network, trained with an end-to-end reinforcement learning-based framework that is able to predict for each video a probability that indicates whether the particular frame will be part of the video summary. The above model architecture consists of an encoder-decoder where the encoder is a convolutional neural network (CNN), responsible for frame feature extraction and the decoder is a long-short term memory network (LSTM), responsible for the frame probabilities. A novel supervised technique was proposed in [
15] for summarizing videos based on an LSTM architecture. This approach automatically selects keyframes or keyshots, deriving compact and meaningful video summaries. In addition they report that techniques such as domain adaptation may improve the entire process of video summarizing. A generic video summarization algorithm was proposed in [
16] by fusing the features from different multimodal streams. A low-level feature fusion approach using as input visual, auditory and textual streams has been used, so as to develop a well-formed representation of the input video in order to construct a video summarization based on the informative parts from all streams.
In [
17] it is pointed out that the main goal of a video summarization methodology is to make a more compact version of the initial raw video, without losing much semantic information and making quite comprehensive for the viewer. They present an innovative solution namely SASUM, which in contrast to the techniques so far that take only the diversity of the summary, extracts the most descriptive parts of the video summarizing the video. Specifically, SASUM consists of a frame selector as well as the video descriptors to compose the final video that will minimize the distance with the generated description from the description that has already been created by humans. A memory and computational efficient technique based on a hierarchical graph-based algorithm, which is able to make spatio-temporal segmentation on long video sequences, was presented in [
18]. The spatio-temporal algorithm repeatedly makes segments into space–time regions clustered by their frequencies, constructing a tree consisting of such spatio-temporal segments. Moreover, the algorithm is boosted by introducing dense optical flow to describe the temporal connections on the aforementioned graph. In [
19], it is emphasized that the huge number of videos that are produced on a daily basis need a summary technique to present a condensed format of the video without the unnecessary information. More specifically, their approach, namely SalSum makes use of a generative adversarial network (GAN) which has been pre-trained using the human eye fixations. The model combines colors as well as visual elements in an unsupervised model. The protrusions, along with the color information deriving from the visual flow of video through SalSum, compose a video summary.
The work proposed in [
20] focuses on the computational model development based on the visual attention in order to summarize videos, mostly from television archives. Their computational model is using several techniques in order to ensemble a static the video summary, such as face detection, motion estimation and saliency map computation. The final video summary from the above computational model consists of a collection of key frames or saliency images extracted from the raw video. A novel video summarization approach, namely VISCOM was proposed in [
21] and was based on the color occurrence matrices from the video, used to describe each video frame. Then, a synopsis of the most informative frames of the original video was composed. VISCOM was tested on a large amount of videos from a variety of categories, in order to make the aforementioned video summarization model robust. In [
22] authors focused on the importance of the video summary on tasks such as video search, retrieval and so forth.. In relation to the approaches based on recurrent neural networks, they tested a fully convolutional sequence neural network on semantic segmentation as the solution of the sequence labeling problem for the video summarization task.
A deep video feature extraction process was proposed in [
23], aiming to find the most informative parts of the video which are required so as to analyze video content. They included various levels of content to train their deep feature extraction technique. Their deep neural network also combined the description of the video, in order to extract the video features and then constructed the video summary by applying clustering based techniques also mentioned by the authors in [
24]. The evaluation followed on their work is based on their own video summaries constructed by humans. The main goal of [
24] was to remove redundant frames of an input video by clustering informative frames, which appeared to be the most effective way to construct a static video summary, built from all cluster centers. The frame representation that has been used within the clustering process was based on the Bag-of-Visual Words model. KVS is a novel video summarization approach, proposed in [
25], specified from the video category provided, mainly from the title or the description of the video. A temporal segmentation is initially applied on a given video; its result is used as input on the KVS supervised algorithm, in order to build a higher quality video summaries compared to the unsupervised blind video category approaches.
Ma et al. [
26] proposed an approach for keyframe extraction and video skimming that was based on a user attention model. To build a motion model, they extracted video, audio, and linguistic features and built an attention model based on the motion vector field. They created three types of maps based on intensity, spatial and temporal coherence which were then fused to form a saliency map. They also incorporated a static model to select salient background regions and extracted faces as well as camera attention features and finally, the created audio, speech and music models. The aforementioned attention components were linearly fused to create an “attention” curve. Local maxima of this curve within shots were used for keyframe extraction, while skim segments were selected using several criteria. Mahaseni et al. [
27] trained a deep adversarial LSTM network consisting of a “summarizer” and a “discriminator” so as to minimize distance between ground truth videos and their summarizations, based on deep features extracted by a CNN. More specifically, the former consists of a selector and an encoder that selects interesting frames from the input video and encode them to a deep feature. The latter is a decoder that classifies a given frame as “original” or “summary”. The deep neural network proposed here tries to fool the discriminator by providing the video summary as the original input video, assuming that both representations are the same.
We should note that all methods and techniques presented in this section are quite significant for creating video summaries, with some of them being the current state-of-the-art. However, most of them do not consider both visual and aural information. Adding that none of the aforementioned works is applied on user-generated videos, our work, which concentrates at the combination of information from the different modalities extracted from a user-generated video stream, can address this need.
2.2. Related Data Sets
As has already been mentioned, in this work we aim to automatically generate summaries from user-generated videos, mostly from action and extreme sports. Therefore, at the following we attempt to present recent, publicly available data sets, for related video summarization tasks.
The “MED Summaries” [
25] is a new dataset for evaluation of dynamic video summaries, containing annotations of 160 videos in total, with ten event categories in the test set. Indicative categories are “birthday party”, “changing a vehicle tire”, “flash mob gathering”, “getting a vehicle unstuck”, “grooming an animal”, and so forth. The “TVSum” (Title-based Video Summarization) dataset [
28] aims to solve the challenging task of prior knowledge in the main topic of the video. The entire dataset consists of 50 videos of various genres (e.g., “news”, “how-to”, “documentary”, “vlog”, “egocentric”) and 1000 annotations of shot-level importance scores obtained via crowd-sourcing (20 per video), while video duration ranges between 2 and 10 min. The video and annotation data permit an automatic evaluation of various video summarization techniques, without having to conduct an (expensive) user study. The “SumMe” [
29] is a video summarization dataset consisting of 25 videos, covering holidays, events and sports, downloaded from the popular platform of YouTube, each annotated with at least 15 human-created summaries (390 in total), while the length of the videos ranges from 1 to 6 min. The “UT Ego” (Univ. of Texas at Austin Egocentric) Dataset [
30] contains 10 (4 out of 10 are available due to privacy reasons) videos captured from head-mounted cameras on a variety of activities such as “eating”, “shopping”, “attending a lecture”, “driving”, and “cooking”. Each video is about 3–5 h long, captured at 15 fps and at 320 × 480 resolution uncontrolled setting. Therefore, videos contain shots with fast motion. Finally, the VSUMM dataset [
31] has been initially used to produce a static video summary, by a novel evaluation method, able to remove the subjectivity of the summary quality by allowing objective comparisons of methodology between different approaches. This dataset, also known as “YouTube Dataset” consists of 50 videos from the Open Video Project (
http://www.open-video.org/). The duration of the videos varies from 1 to 4 min while the approximately duration of the videos in total is approx. 75 min. The videos originate from a variety of genres such as documentary, educational, ephemeral, historical and lecture. There exist 250 user summaries created manually by 50 individuals, each one annotating five videos, that is, each video has five video summaries created by five different users.
However, in all of the aforementioned cases, the datasets are either not sufficiently large or they are from a wider domain, that is, they are not explicitly user-generated data. Therefore, in this work we also aim to compile a well-defined user-generated dataset to evaluate for training the proposed methodology.