Deep Reinforcement Learning for Query-Conditioned Video Summarization

: Query-conditioned video summarization requires to (1) ﬁnd a diverse set of video shots/frames that are representative for the whole video, and that (2) the selected shots/frames are related to a given query. Thus it can be tailored to different user interests leading to a better personalized summary and differs from the generic video summarization which only focuses on video content. Our work targets this query-conditioned video summarization task, by ﬁrst proposing a Mapping Network (MapNet) in order to express how related a shot is to a given query. MapNet helps establish the relation between the two different modalities (videos and query), which allows mapping of visual information to query space. After that, a deep reinforcement learning-based summarization network (SummNet) is developed to provide personalized summaries by integrating relatedness, representativeness and diversity rewards. These rewards jointly guide the agent to select the most representative and diversity video shots that are most related to the user query. Experimental results on a query-conditioned video summarization benchmark demonstrate the effectiveness of our proposed method, indicating the usefulness of the proposed mapping mechanism as well as the reinforcement learning approach.


Introduction
It is estimated that by 2021 it will take a person about 5 million years to watch all the videos that are uploaded each month [1].In order to process the enormous amount of videos that emerge every day, a tool is required which can significantly reduce the cumbersomeness for digesting the ever-increasing amount of videos.Video summarization helps to address this problem and aims to automatically select a small subset of the frames/shots that captures the most interesting parts in a concise manner [2][3][4][5][6][7].Thus it reduces the time and cost required to analyze video information, and provides significant advantages for efficient video browsing, searching and understanding, and further enhances various down-streaming applications such as video-based question answering [8], robot learning [9] and surveillance data analysis [10,11].
Different from traditional video summarization which only focuses on video content, query-conditioned video summarization is tasked to generate user-oriented summaries conditioned on a given query in the form of text [4,[12][13][14][15].As shown in Figure 1, different user queries for the same video will have different summary results, so that the summary can be tailored to different user interests, leading to a better personalized summary.Given a user query that reveals a specific user interest, query-conditioned video summarization focuses on predicting summaries that are (1) in close correspondence with the semantic meaning of the user query; (2) most representative of the original video, producing a minimal and compact set of video frames/shots.Therefore, this task can be viewed as a step towards personalized video summarization [13].It also enables more accurate evaluation by addressing the issue that generic video summarization is a highly subjective task where users may have different video-content preferences.Moreover, it enables users to more efficiently search video contents via text with multiple key words instead of only searching for video titles [14].
Figure 1.The illustration of the query-conditioned video summarization task.Given a user query, query-conditioned video summarization aims to predict a summary that is relevant to the query in a concise manner.Summary 1 and Summary 2 are two summarization results for the same video but for different user queries.
Queries can be supplied in the form of text and can be either storyline-based [16] or keyword-based.The storyline-based summarization is proposed in order to retrieve video results based on a query that can be represented as a graph.Xiong et al. [16] generate a storyline representation consisting of the following four story elements along a time-line: {Actors, Location, Supporting objects, Events}.However their method assumes prior knowledge for the list of locations and events in order to train location/event specific classifiers based on web-images, which can be challenging in many situations.Keyword-based summarization requires fewer fine-grained annotations and has been explored in several previous works.In [17,18], the authors investigated an attention-based personalized summarization technique tailored to Cultural Heritage scenarios by looking for scenes that are similar to web images that match user preferences.However, it is tailored to a cultural heritage scenario, and may not generalize to wider scenarios.More recently, Sharghi et al. [13] propose a more generalized query-conditioned summarization task including datasets and evaluation metrics.They collect a set of concepts containing representative semantic information for a wide range of commonly used terms such as specific objects, people, and fine-grained entities.The annotation in each video shot is either 0 or 1 indicating the absence/presence of the shot in the summary for a given query.Thus, for each query and each video, the ground truth consists of a binary semantic vector.Due to its general applicability to a wide range of scenarios as well as its personalized query-specific evaluation, our proposed algorithm targets this task.
The key challenge for this query-conditioned video summarization task is to identify the relevance between a video and a given query, and simultaneously generate summaries that are of great interest with a minimum number of video shots.To address the above issues, we propose to use a deep reinforcement learning based approach for query-conditioned video summarization.Reinforcement learning is used in this work to guide our summarization agent using a set of rewards that encode our underlying intuition of what qualities a successful summarization result should have.Deep reinforcement learning approaches [19][20][21] have been extensively used in a variety of computer vision tasks such as object segmentation [22], video captioning [23], action recognition [24], and also generic video summarization [25][26][27][28].For example, Zhou and Qiao [25] develop a deep reinforcement learning-based summarization network with a diversity-representativeness reward to generate summaries, and achieve a good performance on generic video summarization.Inspired by their work, we hypothesize that deep reinforcement learning can also be applied to address the query-conditioned video summarization task to provide personalized summary results instead of a single generic summary for all users.We propose to use deep reinforcement learning to solve this sequential decision-making problem by learning a good policy for video shot selection in order to generate good video summaries.A recurrent neural network is used to learn a sequence of shot-selection actions and based on the complete set of actions a reward is assigned.Reinforcement learning in this scenario allows us to iteratively refine the selection process, allowing the model to take better and better actions.
Specifically, we develop a deep reinforcement learning query-conditioned video summarization network (SummNet), to learn a robust video shot selection policy by jointly modeling relatedness, diversity and representativeness to tackle the above challenges.We introduce a notion of relatedness that expresses how related a video shot is to a given query by proposing a mapping network (MapNet) that maps video shots to query space.Establishing this relation between the videos and the queries allows us to formalize a relatedness reward.Together with the diversity and representativeness rewards inspired by the work of Zhou and Qiao [25], SummNet is able to provide personalized summary results taking into consideration different user interests.The contributions of our paper include: (1) We propose SummNet, a deep reinforcement learning-based framework for predicting summaries given different user queries.To the best of our knowledge, this paper is the first to use deep reinforcement learning for this task.(2) We introduce MapNet for modeling relatedness to capture the relations between the video shots and the user queries, and jointly encourage the agent to select summary results by computing relatedness, as well as representativeness and diversity rewards.(3) We conduct comprehensive experiments on a benchmark dataset which is particularly designed for query-conditioned video summarization.The proposed method outperforms the current state-of-the-art algorithms, which demonstrates its effectiveness.

Video Summarization Using Reinforcement Learning
For generic video summarization, the first work using reinforcement learning was Masumitsu and Echigo [29] who propose a voting scheme-based method to predict the importance score for each frame and thereby generate the video summary.The voting scheme is obtained by user's action of watching (Accept) or skipping (Reject) previous similar frames.They use projected low-level feature vectors in an eigenspace to reduce the influence of correlation between different elements.However, the performance of the approach is restricted due to the limited learning abilities of the feature representations.Instead of hand-crafted feature representations for video frames, recent advances in the field of deep learning have enabled the learning of more robust features directly from data and have thus led to better reinforcement learning-based video summarization performances.
Zhou and Qiao [25] propose the DSN (Deep Summarization Network), an unsupervised video summarization technique, that makes use of a diversity-representativeness reward to mimic the way how humans summarize videos.The summaries are generated by predicting the probabilities that a given frame is a key-frame and then sampling summary frames based on this probability.Subsequently, they propose another reinforcement learning-based method that relies on video-level category labels to address the problem in a weakly supervised manner [27].In order to do this, a summarization network with deep Q-learning (DQSN) is explored that guides the agent using a global recognizability reward based on a companion classification network, which is tasked with computing whether a summary can be recognized.Note, this task differs from the query-conditioned task in that it only requires that the summaries maintain the category label for each video and is, therefore, still a form of generic summarization and not personalized towards a specific query.Another work [28] is mainly targeting the challenge of summarizing extremely lengthy videos and implements a reinforcement learning-based approach that relies on SeqDPPs [30] to provide a guiding principle about how to best partition a video sequence into different segments.This differs from prior work, which perform manually partitioning of the video into segments of the same length and do thereby not take the video content into account.Lei et al. [26] propose another reinforcement learning-based summarization approach that also dynamically segments videos.The video is first segmented using a trained action classifier, so that each clip contains a single action, then a deep recurrent neural network is applied to select the most distinct frames for each clip.
More recently, Lan et al. [31] propose a framework tasked with fast-forwarding a video to present a representative subset of frames.This can be regarded as another form of video summarization for producing key information.In their work, FastForwardNet (FFNet), a reinforcement learning framework using a Markov decision process is developed.It automatically fast-forwards a video and decides the number of frames to skip next while presenting the most important subsets to the users, which makes it computationally efficient as not all frames are processed.However, different from query-conditioned video summarization, the model only processes a subset of the video instead of the whole video due to the forwarding prediction.

Query-Conditioned Video Summarization
Query-conditioned video summarization aims to produce different summaries corresponding to different user interests.Oosterhuis et al. [32] present a graph-based approach to incorporate videos and queries in order to construct both relevant and visual appealing trailers for video summarization.Each video segment and query is represented by a node, and the weights of the edges between the video segments and the queries in the graph are computed based on semantic matching, while visual similarity is used to add edges between nodes corresponding to video segments.The segment is thus regarded to be related to a query if it is either semantically similar to the query, or is visually similar to other segments which are related to the query.In [14], a quality-aware relevance estimation is proposed that relies on measuring the distance using the cosine similarity between embeddings of the frames and the text queries, where the textual representation is achieved by first using the word2vec model [33], and then an Long Short-Term Memory (LSTM) [34] to encode each into a single fixed-length embedding.Summaries are then obtained by selecting key frames using a linear combination of submodular objectives which jointly consider diversity, representativeness and quality of the visual features, as well as the similarity between the query and frames.In [17,18], the authors investigate summarization of personalized videos of culture heritage scenarios with user preferences as input.They propose a 3D Convolutional Neural Network (CNN)-based approach to encode both visual apparent motion features and real motion features, that are measured by GPS sensors.Then visual semantic classifiers, one trained for each user preference, are used to assess the relatedness between the users preferences and the extracted key shots in the cultural heritage scenario.
The problem has also been addressed by making use of Determinantal Point Processes (DPP) [35] to model temporal relations.In [12], the authors propose a Sequential and Hierarchical Determinantal Point Process (SH-DPP) to generate summaries which consist of the key shots in the video that are most related to a given query, and study experimental performance on two densely annotated benchmark datasets.However, their datasets are originally collected for the generic video summarization task, which leads to restrictions when performing comprehensive evaluation due to the differences between the two tasks.In follow up work, they propose a more comprehensive dataset with an efficient evaluation metric for the query-conditioned summarization task [13].Instead of focusing on low-level visual features or only temporal overlap, they evaluate similarity between the predicted and ground-truth video shots based on the shots' semantic information, where each video shot is annotated with a list of semantic concepts.Their proposed approach to combine the query and visual information uses a query-conditioned DPP that integrates a memory network for modeling query-related and contextual importance.
More recently, Zhang et al. [36] apply a generative adversarial network (GAN) [37] architecture and propose a three-player query-conditioned adversarial loss (prediction loss, ground-truth loss, and random loss) to force the generator to learn better summary results.Though they achieve state-of-the-art performance, GAN training can often be unstable making the model difficult to train and making the approach highly-reliant on parameter choices.This is true even when more stable GAN modules, such as the Wasserstein GANs [38], are used as in their proposed framework.We, therefore, propose to exploit a deep reinforcement learning framework (SummNet) for the query-conditioned video summarization task to achieve more stable and robust performances, by introducing a mapping mechanism (MapNet) to measure relatedness between the given query and the video shots, and then jointly training the agent with additional diversity and representativeness rewards.

Our Approach
Our approach consists of two main parts: (1) MapNet maps the video shots to the user query space in order to provide a link between the two modalities and enable us to express the relatedness reward.(2) SummNet uses the trained MapNet and given a certain query, predicts the importance scores based on a deep reinforcement learning policy that is guided by the three rewards.Importance scores in this context refer to the probability that a given video shot is being included in the summary.In this section, we first introduce the architecture of MapNet and describe how it can be used to relate the visual information to the text queries.Then we present the details of SummNet and describe how the three rewards work jointly to obtain the summarization results.

Video Embedding
As shown in Figure 2, we evenly segment each video V into 75-frame video shots (5-second long).This corresponds to the procedure proposed in [13] and enables us to compare our method to their collected ground-truths and use their evaluation metrics.We denote V = {v t } T t=1 , where T is the total number of video shots in the video.We make use of both the 2D and 3D CNNs to extract the visual representation and encode the video shots more robustly.The C3D video descriptor can capture shot-level feature representation along both spatial and temporal dimensions, while the ResNet feature extractor is used to obtain frame-level video representation.By integrating both the above two feature vectors, the model can utilize the enhanced feature representation and perform better.We use the ResNet 152 model [39], which has been pretrained on the ILSVRC 2015 dataset [40], to obtain the features after the "fc7" layer, where "fc" corresponds to the fully connected layer.The 3D CNN feature extractor we apply is the C3D trained on the Sports1M dataset [41] and features are extracted after the "fc6" layer.The shots are down-sampled to a temporal length of 16 and the features extracted by the 2D and 3D CNNs for these shots are denoted as { f 2d t } T t=1 and { f 3d t } T t=1 .Note that we average the feature representation of the 16 frames for the 2D CNN extractor to obtain a single feature vector for each shot.We down-sample { f 2d t } T t=1 and { f 3d t } T t=1 in order to reduce computational complexity, and concatenate them to obtain a joint visual representation.The combined feature representation is denoted as { f c t } T t=1 .Afterwards, we feed them into a fc layer to get the encoded visual representation { f t } T t=1 for each video.

Query Embedding
We use the Skip-gram model that has been trained on the Google News dataset [33] to encode each word into a feature vector.We sum the feature vectors of the two concepts for each query and the resulting feature vector is used as the query embedding { f w i } Q i=1 , where Q is the total number of queries.Note that, following the setting of the dataset proposed in Sharghi et al. [13], each query contains two concepts in order to account for the fact that users often enter a query that contains more than one word.In order to obtain an embedding vector for each video shot the different user queries corresponding to that shot are averaged to get the whole-query embedding { f q t } T t=1 .

Mapping Mechanism
We develop the mapping mechanism to model the mapping between the visual space and the query space.We take the visual representation { f t } T t=1 as the input, and use three fc embedding layers in order to predict the query embedding { ḟ q t } T t=1 for the video shots.In between the fc layers we utilize leaky ReLUs to improve the nonlinearity learning ability of the model and dropout to avoid overfitting.MapNet, illustrated in Figure 2, aims to predict the query embedding given a video, so that { ḟ q t } T t=1 is as close to the ground-truths query embedding { f q t } T t=1 as possible.The loss L between { ḟ q t } T t=1 and { f q t } T t=1 is computed using the mean squared error: .The combined whole-query embedding { f q t } T t=1 is then achieved by averaging over all queries for each video shot.MapNet is trained to predict { ḟ q t } T t=1 to be close to { f q t } T t=1 .

Summarization Network
As shown in Figure 3, SummNet takes both video shots and the user query as the input and first encodes them to a joint embedding space.Then the importance score for each video shot is predicted, with a higher score implying a larger probability for a shot being selected in the summary.Here we utilize the same video embedding vector { f c t } T t=1 and the word embedding f w i (the embedding for each query instead of the whole-query embedding) as illustrated in Section 3.1.Each time we take the query embedding for one query as the input, so here we use f w i , instead of { f w i } Q i=1 as the notation.

Importance Scores Prediction
We transform the visual information { f c t } T t=1 and the query information f w i using a fc layer each and then concatenate the feature representations to generate the joint video-query embedding.Note, before the concatenation, the query feature vector is repeated to match the number of shots in the batch in order to fuse the visual information at each time step with the query information.A Bidirectional-LSTM (Bi-LSTM) [42] module is then applied to capture long-term temporal dependencies in the video and model both past and future temporal relations.After that, the output is transformed by two more fc layers, with a dropout layer and a batch normalization layer in between.A sigmoid activation is used in the end to predict the importance score {s t } T t=1 for each video shot, which corresponds to the probabilities that are used to generate the video summaries.

Video Summary Prediction
During the training phase, the three rewards are used to guide the agent to select a good set of video shots as a summary.The video shots selection is performed using a Bernoulli distribution, which is defined as: where p(•) is the probability of taking the action a t in the state of s t given the learned policy π and its parameter θ.The action a t ∈ {0, 1} indicates if the current video shot is selected.
Figure 3.The architecture of our proposed Summarization Network (SummNet).Given a user query, it takes both the visual embedding { f c t } T t=1 and the query embedding f w i as the input, processes each using a fc layer and then concatenates both representations to generate the joint video-query embedding.A Bi-LSTM module is then used to model long temporal relations and the output is passed to two fc layers, with a dropout layer and a batch normalization layer in between.Afterwards, a sigmoid activation is used to predict the importance scores {s t } T t=1 for the video shots.Based on the predicted importance scores, the reinforcement learning-based loss is used to guide the agent to take good actions.A policy π is learned based on the following three rewards to perform the right actions {a t } T t=1 in order to select the summary shots: (1) Relatedness reward R rel uses MapNet and measures the distances between the predicted and ground-truth query embedding.(2) Diversity reward R div reduces the redundancies in generated summaries.(3) Representativeness reward R rep aims to provide the most representative video shots of the original video as the summary.

Relatedness Reward Using MapNet
We take the video embedding after the concatenating layer { f c t } T t=1 as illustrated in Section 3.1, and feed them into the pretrained MapNet.MapNet maps each shot to the query embedding space, producing { f q t } T t=1 .These feature vectors are used to compute the relatedness reward in our proposed framework by considering their distance to the input query.We propose that with larger relatedness reward (smaller distance between the input query embedding and the predictions), the selected video shots are more relevant to the query making them stronger candidates for the predicted summary.We therefore compute the average squared difference between the mapping results produced by MapNet and the current user query for the selected shots as the relatedness reward and use the exponential function to ensure a positive reward.The relatedness reward is thus defined as: where f q denotes the query embedding for the current user query, and M denotes the selected video shots in the video.

Diversity and Representativeness Rewards
We apply the frame-level diversity and representativeness rewards designed by Zhou and Qiao [25] for our shot-level summarization task due to its effectiveness in measuring the quality of generated summaries in visual space.Ideally, the predicted summaries should be the most representative of the original video content and diverse in the sense that summaries should be compact and not include redundancies.The diversity reward is computed between every two video shots based on a pairwise dissimilarity measure and is defined as: where f c t and f c t are the video embeddings for different video shots.We follow [25] by restricting the number of close neighbor shots that are considered, and set d( f c t , Restricting the number of neighbors is required in order to ensure that similarity between two video shots that are far from each other along the temporal dimension in the video are not penalized as this would cause problems especially for long video sequences [30].Here, λ is a hyperparameter used to control the restriction.The diversity reward for the selected video shots M in the video can thus be computed based on the dissimilarity function (4): The representativeness reward is computed using the k-medoids algorithm as proposed in Gygli et al. [43] by first selecting a set of medoids (selected video shots) and then measuring the distances among all different video shot embeddings and their nearest medoids.It aims to minimize the mean squared error of those distances as follows: f c t denotes the video shot that is selected in the summary, and each video shot f c t in the video is compared to all selected ones in M to compute the minimal distance.We averages along the feature dimension in the exponent to avoid that the values for the representativeness reward are consistently very small for feature vectors of high dimension.

Policy Gradient Descend
Our goal is to train a summarization agent that maximizes the rewards under the policy π with the parameter θ.The expected reward J(θ) can be defined as: where p(a|π(s; θ)) denotes the probability distribution over the actions of sequences.We follow [25] to compute the derivative of the objective function and approximate the gradient by taking the average of N repeated episodes for each video while subtracting a constant baseline b that is computed as the moving average of the previous rewards.Introducing the constant baseline and averaging over a number of repeated episodes helps lower the variance of the gradients and thereby allows the network to converge faster to a good solution.The detailed formulation for the derivative of the objective function J(θ) is formalized as: where R j rel , R j div and R j rep are the three rewards at the j th output of N episodes.To optimize θ, we further introduce a ground-truth loss term L summ to allow the network to predict better summary results through a direct supervised feedback signal.Here we use the output {s t } T t=1 in Section 3.2 and compare it to the ground-truth summary {s g t } T t=1 (s g t ∈ {0, 1}) by computing the mean squared error: Thus θ can be optimized with the learning rate α as:

Video Summarization Inference
As shown in Figure 4, the inference phase is independent of MapNet, as well as the rewards.Given a video and a certain user query, the predicted scores {s t } T t=1 are generated as outlined in the Importance Scores Prediction and Video Summary Prediction paragraphs in Section 3.2.A threshold φ is then applied to these raw scores in order to get the binarized summary prediction.A threshold φ is applied to the raw scores to achieve binarized summary prediction.

Videos
The proposed framework is evaluated on a query-conditioned video summarization dataset that was proposed in Sharghi et al. [13].The dataset contains dense per-video-shot concept annotations, which are used as semantic descriptions in order to evaluate the method.The dataset contains four 3∼5 h long videos that capture uncontrolled everyday lives.The video contents contain noise due to changes in camera view point, illumination changes and motion blur [44] which makes predicting video summaries a challenging task.The queries are selected from a dictionary of 48 concepts that are constructed by Sharghi et al. [13] and that are comprehensive for the videos.Each query consists of two concepts in order to account for the fact that users often enter a query that contains more than one word.In total 46 unique queries each containing two concepts are formalized including one empty query, which aims to learn the generic video summarization that only depends on visual features.Following previous works, we use two videos for training, one for validation, and the remaining one for testing.

Evaluation Metrics
We follow the protocol developed by Sharghi et al. [13] to compare our framework to previous approaches.In order to match the ground-truth and predicted summaries, it uses the maximum weight matching of a bipartite graph, where the prediction and the ground-truth summaries are on two opposing sides, and the edge weights are computed by the Intersection over Union (IoU).Here the IoU for the matched shots is semantically measured based on the overlapping concepts on a shot level.The concepts that are being considered for each shot are the aggregated concepts for all queries in the ground-truth.If one shot is labelled as several concepts and is denoted as A, and another shot's label is B, IoU between the two shots can be defined as: the number of overlap concepts between A and B the total number of concepts in A and B .
For example, if one shot is labelled as {Car, Street}, another one as {Street, Tree, Sign}, the IoU for the two shots is 1/4 = 0.25 [13].Based on the number of matched video shots corresponding to the maximum weights, Precision, Recall and F-measure scores are computed to evaluate the performance of the predicted summaries.In this way, the evaluation metric focuses on the higher-level semantic information instead of lower-level visual features or temporal overlaps, which is a key requirement for measuring the quality of video summaries [45].

Implementation Details
We implement our framework using PyTorch [46] on a Tesla V100 GPU.We down-sample both the 2D and 3D visual features to 1024-dim.In MapNet, the three fc layers transform the learned features to 1024-, 512-and 300-dim, and a dropout rate of 0.6 and a leaky ReLU with a slope of 0.2 are used.Further, Adam [47] is used for the optimization of MapNet.In SummNet, the outputs of the fc layers for the visual feature and text feature are 2048-and 300-dim respectively.The Bi-LSTM module we use has one layer, with 1024 hidden units for each direction, followed by another fc layer to transform the features to 128-dim.After that, a dropout layer is used with a drop rate of 0.2.In the end, the last fc layer embeds the features from 128-dim to 1-dim in order to predict the importance score.We use a zero vector for the scenario where none of the video shots are related to the two concepts of the given query.Following the example of Zhou and Qiao [25], we set the temporal distance λ equal to 20, and the number of episodes N equal to 5. The constant baseline b is computed as a weighted moving average, where the previously accumulated rewards account for 90% and the current reward for 10%.We use Adam for optimizing SummNet, with a weight decay of 0.00001 and clipping of gradients with a norm larger than 5.The learning rate α, the step and the ratio for learning rate decay are optimized via cross-validation.The threshold φ is experimentally set by performing a grid search for the best parameter to get the binary summarization results.The training time for MapNet and SummNet are on average 2.99 ms and 408.98 ms respectively for each batch.The overall training time of the model is around 3 h.During the inference phase, given a query it takes on average 23.78 ms to obtain the prediction.This demonstrates the efficiency of our proposed method which allows small training and inference times.

Comparisons to the State-of-the-Art Methods
We compare our methods to four previous methods that have been evaluated on this queryconditioned video summarization benchmark.The results are shown in Table 1.It can be observed that our method outperforms the previous four approaches on all datasets and especially improves performance on the fourth video by large margins.The first three methods SeqDPP [30], SH-DPP [12] and QC-DPP [13] are all based on a sequential DPP, a probabilistic model, which defines a probability distribution for modeling subset selection that integrates both individual importance and collective diversity.However, DPP-based models may be influenced by an exposure bias problem as pointed out in Sharghi et al. [4], where the model, unlike during the inference phase, is only exposed to the training data distribution and not its own predictions.While the GAN architecture of the state-of-the-art method with adversarial learning allows the learning of good summaries [36], training of GANs can be unstable making the approach more reliant on different parameter choices.Our approach, instead, is more stable and robust, and outperforms QueryGAN consistently on all videos.On average, it achieves an improvement of 1.15% in terms of F-measure, which demonstrates that our proposed deep reinforcement learning model together with the proposed mapping mechanism facilitates learning of a model that can predict more accurate summaries by jointly considering relatedness, representativeness and diversity.We analyze the effect of the mapping mechanism by dropping the MapNet network in our method and compare the performance to the full model.Note that the MapNet is used to express the relatedness reward, and that dropping MapNet leads to removing the relatedness reward in our model.The W/O MapNet model is therefore only trained with two rewards: representativeness and diversity rewards.The results of this comparison are shown in Table 2.We can see that the combined proposed model outperforms the model without MapNet (W/O MapNet) on all four videos and that the average performance decreases by 3.29% when removing the mapping mechanism of MapNet.This indicates that the mapping mechanism allows the model to incorporate the connection between the visual features and the text features by learning a transformation between the two modalities.It also shows that it is able to exploit this transformation to provide a good reward signal for the summarization model in order to learn query-conditioned video summaries.Despite of the fact that the ground-truth labels are included in the summarization model as part of the supervised loss and should enable the model to learn the relation between the visual and the query information, it appears that the proposed relatedness reward of MapNet enhances the robustness of the results.

The Effect of Ground-Truth Regularization
We train the model without the ground-truth loss term (W/O L summ ) to evaluate the effect of including the ground-truth labels.The results are shown in Table 2, and the performances on four videos decrease by 4.0%∼8.5%.From these results we can conclude that the ground-truth loss L summ does have a large contribution to the learning process, which makes sense as the supervised feedback through the ground-truth labels provides extensive information of how the "real" summaries should look like.Especially the link between the visual information and the query will be reduced considerably when removing the ground-truth loss as it would only be supplied indirectly through the MapNet reward.This can also be seen by comparing the model without ground-truth loss (W/O L summ ) to the model trained without MapNet (W/O MapNet), where the performance decreases considerably in average from 43.91% to 40.86% in terms of F-measure.This illustrates that both the ground-truth loss and the relatedness reward contribute to the overall performance, however, the ground-truth labels, which provide direct supervised feedback, contribute more to the overall performance.

The Effect of Rewards
We conduct another ablation experiment by dropping the three rewards (W/O rewards) to analyze the effect of the joint relatedness, representativeness and diversity rewards.The results are shown in Table 3.We observe that the F-measure drops 5.20% from 47.20% to 42.00%.This indicates that the model can learn a better summary by making use of the three rewards.After dropping the three rewards, the model is trained independent of MapNet as well as reinforcement learning.That is, it is only trained by the mean squared error to compute the loss between the ground-truths and the predictions in a supervised manner.The results demonstrate that while the model trained only using a supervised loss can be able to predict some good summaries, the usage of reinforcement learning can further enable the model to predict higher quality summaries by large margins.Table 3.We compare the proposed model to two other models, one that does not make use of the three rewards, and one that does not make use of the representativeness+diversity rewards, but that are otherwise identical.Note that we repeat the results of the combined full model in this

The Effect of Representativeness+Diversity Rewards
We further compare our model to the one that does not make use of representativeness and diversity rewards (W/O rep+div rewards), and the results are shown in Table 3.The representativeness and diversity rewards are originally designed for the traditional generic video summarization task that does not take into account the user queries.However, it can be observed that the two rewards can also be utilized in this query-conditioned task.The performance decreases by 2.57% on average after dropping the two rewards, with decreases around 1.0%∼2.5% for the first three videos and 4.66% for the last one.This indicates that modeling the representativeness and diversity of the summaries in addition to the relatedness reward, which establishes the relation between the videos and the query, helps to produce a minimal and concise set of video shots.

Qualitative Results
We provide two qualitative results for Video1 of our framework.In Figure 5, the ground-truths and the predictions for user queries {Drink, Food} and {Hat, Phone} are presented.Figure 5a,d illustrate several video shot samples of Video1 for the two user queries, and Figure 5b,c,e,f present the selected video shots as the summary for the ground-truths (blue lines) and the predictions (green lines).Note that the evaluation metric was proposed in [13] and that we use the matched pairs in the bipartite graph based on the semantic user annotations, instead of low-level visual features.The similarity between the two shots is defined by IoU, which is computed based on the overlap of the corresponding concepts of the two shots.It aims to match semantic concepts in the video instead of unique video shots in order to provide an evaluation metric that is more robust to subjectiveness and highly similar video shots in the video.Thus the matched pairs do not necessarily map from the predictions to the ground-truths in a tight and strict sequential order.

Conclusions
In this paper, we propose a deep reinforcement learning framework for query-conditioned video summarization, by applying relatedness, diversity and representativeness rewards to guide the agent to learn a video shots selection policy.To measure the relatedness, we design a MapNet that maps video shots from visual space to query space and based on this mapping we design a reward that encourages the agent to select the most related video shots given a certain query.Combined with additional rewards that encourage diversity and representativeness of the video shots in the summary, our reinforcement learning-based approach enables us to learn the video shots selection policy.Comprehensive experiments demonstrate the effectiveness of our approach, which outperforms the current state-of-the-art algorithm on all videos.In future work, we plan to explore the link between the visual features and text features more in order to further improve the modelling of the relatedness, which is a key challenge in the query-conditioned video summarization task.

Figure 2 .
Figure 2. The architecture of our proposed Mapping Network (MapNet) has two branches.(1) Video shots are encoded using a ResNet and a C3D model to generate 2D/3D visual features { f 2d t } T t=1 and { f 3d t } T t=1 .After down-sampling, the visual features are fused through concatenation ({ f c t } T t=1 ) followed by a fc layer to get { f t } T t=1 as a video embedding.Then three groups of fc layers, dropout layers and leaky ReLUs are used to map the video shots embeddings to the predicted query embeddings { ḟ q t } T t=1 .(2) Different user queries each consists of two concepts are encoded via a Skip-gram model to generate query embedding { f w i } T t=1 .The combined whole-query embedding { f

Figure 4 .
Figure 4.The pipeline for the inference phase.Given a video and a certain user query, the predicted scores are generated using SummNet without the MapNet and the reinforcement learning rewards.A threshold φ is applied to the raw scores to achieve binarized summary prediction.

Figure 5 .
Figure 5. Visualization results of our proposed method given the query {Drink, Food}, and {Hat, Phone} for Video1.(a,d) are several video shots samples.(b,e)and (c,f) show the ground-truths and the predicted summaries, respectively.The x−axis is the shot number, and the blue/green lines represent the selected video shots in the summary.

Table 2 .
We compare the proposed model to two models, one that does not make use of MapNet, and one that does not make use of the ground-truth loss L summ , but that are identical otherwise.The best results (F-measure) are highlighted in bold.
table, in order to provide a more convenient comparison.The best results (F-measure) are highlighted in bold.