Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.


I. INTRODUCTION
V IDEO has become an essential source for humans to learn and acquire knowledge. Due to the increased demand for sharing and accumulating information, there is a massive amount of video being produced out there in the world every day. However, compared to images, videos usually contain much semantic information, and thus it is hard for humans to organize videos. Therefore, it is critical to developing an algorithm that can efficiently perform multimedia analysis and automatically understand the semantic information of videos.
A common approach for video analysis and understanding is to form a joint embedding space of video and sentence using multimodal learning. Similar to the fact that humans experience the world with multiple senses, the goal of multimodal learning is to develop a model that can simultaneously process multiple modalities, such as visual, text, and audio, in an integrated manner by constructing a joint embedding space. Such models can map various modalities into a shared Euclidean space where distances and directions capture useful semantic relationships. It enables us to not only learn the semantic presentations of videos by leveraging other modalities information but also show great potential in tasks such as cross-modal retrieval and visual recognition.
Recent works in learning an embedding space bridge the gap between sentence and visual information by utilizing advancements in image and language understanding [1], [2]. Most approaches build the embedding space by connecting visual and textual embedding paths. Generally, a visual path uses a Convolutional Neural Network (CNN) to transform visual appearances into a vector. Likewise, Recurrent Neural Network (RNN) embeds sentences in a textual path. However, capturing the relationship between video and sentence remains challenging. Recent works suffer from extracting visual dynamics in a video [3]- [6].
In this paper, we propose a novel framework equipped with multiple embedding networks so that we can capture various relationships between video and sentence, leading to more compelling video retrieval. Precisely, one network captures the relationship between an overall appearance in the video and a textual feature. Others consider consecutive appearances or action features. Thus, the networks learn their own embedding spaces. We fuse the similarities measured in the multiple spaces using the weighted summing strategy to produce the final similarity between video and sentence.
The main contribution of this paper is a novel approach to measure the similarity between video and sentence by fusing similarities in multiple embedding spaces. Consequently, we can measure the similarity with multiple understandings and relationships of video and sentence. Besides, we emphasize that the proposed method can quickly expand the number of embedding spaces. We demonstrated the effectiveness of the proposed method by the experimental results. We conducted video retrieval experiments using query sentences on the standard benchmark dataset and demonstrated an improvement of our approach compared to existing methods.

Vision and Language Understanding
There have been many efforts in connecting vision and language, which focus on building a joint visual-semantic space [7]. Various applications in the area of computer vision needs such a joint space to realize tagging [1], retrieval [8], captioning [8], and visual question answering [9].
Recent works in image-to-text retrieval embed image and sentence into a joint embedding trained with ranking loss. A penalty is applied when an incorrect sentence is ranked higher than the correct sentence [10]- [14]. Another popular loss function is triplet ranking [1], [15]- [17]. VSE++ model improves the triplet ranking loss by focusing on the hardest negative samples (the most similar yet incorrect sample in a mini-batch) [18].

Video and Sentence Embedding
As same as image-text retrieval approaches, most videoto-text retrieval methods learn a joint embedding space [19]- [21]. The method [3] incorporates web images searched with a sentence into an embedding process to take into account fine-grained visual concepts. However, the method treats video frames as a set of unrelated images and average them out to get a final video embedding vector. Thus, it may lead to inefficiency in learning an embedding space since temporal information of the video is lost.
Mithun et al. tackle this issue [22] by learning two different embedding spaces to consider temporal and appearance. Also, they extract audio features from the video for learning space. This approach achieved accurate performance for the sentence-to-video retrieval task. However, this approach puts equal importance on both embedding spaces. Practically, equal importance does not work well. There are cases that one embedding space is more important than the others in capturing semantic similarity between video and sentence. Therefore, we propose a novel mechanism emphasizing a space so that we can know the critical visual cues.

A. Overview
We describe the problem of learning a joint embedding space for video and sentence embedding. Given video and sentence instances which are sequences of frames and words, the aim is to train embedding functions that map them into a joint embedding space. Formally, we use embedding functions f : X → Z and g : Y → Z, where X and Y are video and sentence domains, respectively. Z is a joint embedding space. The similarity s(f(x), g(y)) between X and Y are calculated in Z by a certain measurement. Therefore, the ultimate goal is learning f and g satisfying the following equation: s(f(x i ), g(y i )) > s(f(x i ), g(y j )), ∀ i = j. This encourages similarity will increase in a same pair x i and y i , whereas it decreases in a different pair x i and y j .
As we illustrated the overview of the proposed framework in Fig. 1, there are two networks for embedding videos: the global and the sequential visual networks. These networks have their counterparts that are embedding the sentence. Thus, we develop multiple visual and textual networks that are responsible for embedding a video and a sentence, respectively. We form one embedding space by merging two networks (one visual and one textual) so that we can align a visual feature to textual information. Specifically, the global visual network aligns average visual information of the video to the sentence. Likewise, the sequential visual network aligns a temporal context to the sentence. Consequently, the networks receive a video and sentence as inputs and map them into the joint embedding spaces.
We propose to fuse similarities measured in the embedding spaces to produce the final similarity. One embedding space has the global visual network extracting a visual object feature from the video. Thus, this embedding space describes a relationship between global visual appearances and textual features. The other embedding space captures the relationship between sequential appearances and textual features. The similarity scores of the two embedding spaces are then combined using the weighted sum to put more emphasis on one embedding space than the other space.
Our motivation is twofold. Firstly, since videos and sentences require different attention, we aim to develop a more robust similarity measurement by combining multiple embedding spaces in a weighted manner instead of using a hardcoded average. Secondly, in order to highlight spatial and temporal information of videos according to a sentence, we need to develop a mechanism that utilizes textual information to emphasize spatial and temporal features.

B. Textual Embedding Network
We decompose the sentence to variable-length sequences of tokens. Then, we transform each token into a 300dimensional vector representation by using the pre-trained GloVe model [23]. The length of the tokens depends on the sentence. Therefore, in order to obtain a fixed-length meaningful representation, we encode the GloVe vectors using the Gated Recurrent Unit (GRU) [24] with H hidden states, resulting in a vector φ(y) ∈ R H . We set H = 512. This embedded vector φ(y) goes to four processing modules: global and sequential visual networks, spatial attention mechanism, and similarity aggregation. We further transform φ(y) in each processing module.

C. Global Visual Network
The global visual network aims to obtain a vector representation that is a general visual feature over the video frames. We divide the input video frames into N chunks, and one frame is sampled from each chunk by random sampling. We set N = 20. We extract visual features from the sampled frames using the ResNet-152 pre-trained on the ImageNet dataset [25]. Specifically, we resize the sampled frames to 224 × 224, and then the ResNet encodes them, resulting in 2048-dimensional N vectors. Note that we extract the vectors directly from the last fully connected layer of the ResNet. Subsequently, we apply average-pooling to the N vectors to merge them. Consequently, we obtain a feature vector ψ g (x) ∈ R 2048 containing a global visual feature in the video.
We learn a joint embedding space of the global visual and textual networks. As defined in Eq. (1) and (2), we use embedding functions to embed ψ g (x) and φ(y) into a Ddimensional space.
There are learnable parameters W g ∈ R 2048×D ,Ŵ g ∈ R H×D , and b g ,b g ∈ R D . We set D = 512 in this paper. We use cosine similarity to measure the similarity s g (y, x) between the video and sentence in the joint embedding space.

D. Sequential Visual Network
Similar to the global visual network, we divide the input video frames into N chunks and take the first frames of each chunk as the input of the sequential visual network. Then, we use the ResNet-152 to extract a 7 × 7 × 2048-dimensional vector ψ s (x) from each frame at the last convolution layer of the ResNet. The ψ s (x) contains visual features in spatial regions. Considering spatial regions where we should pay further attention may change by sentences, we need to explore relationships between spatial and textual features. Therefore, we incorporate a spatial attention mechanism into the sequential visual network. We apply the attention mechanism to ψ s (x) to emphasize spatial regions. Finally, we capture the sequential information of the video using a single layer Long-Short Term Memory (LSTM) [26] with an H-dimensional hidden state. We denote the vector embedded by the sequential visual network as f s (x) ∈ R D .
The sequential visual network uses LSTM to capture a meaningful sequential representation of the video. However, the LSTM transforms all spatial details of a video into a flat representation, resulting in losing its spatial reasoning with the sentence. Therefore, we employ the spatial attention mechanism in order to obtain a spatial relationship between video and sentence. Inspired by the work [9], we develop the spatial attention mechanism to learn with regions in a frame to attend for each word in the sentence. We illustrated the spatial attention mechanism in Fig. 2. The mechanism associates the visual feature vector ψ s (x) with the textual vector φ(y) to produce a spatial attention map a ∈ R 7×7 . Then, we combine ψ s (x) with a to produce the spatial attended feature ψ a (x). Formally, ψ a (x) and a are defined in Eq. (5) and (6), respectively.
Specifically, Eq. (4) defines an embedding function of the sequential visual network. γ represents the LSTM, and ⊗ is element-wise product. The p and q mean intermediate outputs, which are 512-dimensional vectors. We have learn-  Fig. 2. The spatial attention mechanism used in the sequential visual network.
The joint embedding space of the sequential visual and textual networks is formed by f s (x) and g s (y). We measure the similarity s s (x, y) in this joint embedding space. The formulations are described below, whereŴ s ∈ R H×D and b s ∈ R D are learnable parameters.

E. Similarity Aggregation
There are many approaches to aggregate similarities. An average is a straightforward approach. Some cases work well on average. However, the average may cause unexpected behaviors if an inaccurate similarity is considerably high or low. Therefore, we adopt the Mixture of Experts fusion strategy [27] for aggregating similarities with weights that changes according to the input sentence. Consequently, we can emphasize one embedding space using the weights for merging the multiple similarities.
We propose to aggregate the similarities measured in multiple embedding spaces so that we can produce the final similarity s(x, y) with various understandings of videos and sentences. We illustrate the proposed similarity aggregation in Fig. 3. Specifically, we merge the similarities using the weight W m ∈ R 2 generated by considering the textual feature. W t ∈ R D×2 is a learnable parameter. concat() is a function that concatenates given scalar values.

F. Optimization
As described in Section III-A, we optimize the proposed architecture by enforcing similarity of a video x i and its counterpart sentence y i will be greater than similarities of the video x i and other sentence y j , such as s(x i , y i ) ≥ s(x i , y j ) or s(x j , y i ). We achieve this by using the triplet ranking loss [28]- [30], where α is a margin.
Given a dataset D = (x i , y i ) N i=1 , with N pairs, we optimize the following equation by stochastic gradient descent [31], [32].

IV. EXTENDABILITY
The proposed architecture has extendability of new embedding networks. The steps of the extension are straightforward. We build visual and textual networks and then merge them to form a joint embedding space. In this paper, we add embedding networks using the Inflated 3D Convolutional Neural Network (I3D) model [33] so that the networks can capture video activities. We utilize the pre-trained RGB-I3D model to extract embedding vectors from continuous 16 frames of video. Consequently, a 1024-dimensional embedding vector f e (x) is produced for each video.
The joint space is learnt by using f e (x) and the textual embedding vector g e (y), and the similarity s e (x, y) is measured in this joint embedding space. We transform φ(y) usinĝ W e ∈ R D×1024 andb e ∈ R 1024 . The similarity aggregation is also straightforward to be extended. Specifically, we concatenate all similarities and merge them with a weight W m ∈ R M , where M represents a number of embedding spaces. M = 3 in this case. g e (y) =Ŵ e φ(y) +b e (16) s e (x, y) = f e (x) · g e (y) f e (x) g e (y) We stress that the extendability is vital since this enables us to incorporate other feature extraction models into the framework quickly. There are abundant approaches to extract features from videos [34]- [43]. Various aspects and approaches are necessary for the understanding of videos and sentences.

V. EXPERIMENTS
We carried out the sentence-to-video retrieval task on the benchmark dataset to evaluate the proposed method. The task is retrieving the video associated with the query sentence from the test videos. We calculated similarities over the test videos with the query sentence, and then we picked up videos according to the similarities in descending order.
We reported the experimental results using rank-based performance metrics, i.e., Recall@k and Median rank. The Recall@k calculates the percentage of the correct video in top-k retrieved results. In this paper, we set k = 1, 5, 10. Median rank calculates the median of the ground-truth results in the ranking. For Recall@k, the bigger value indicates better performance. When Median Rank is a lower value, retrieved results are closer to the ground-truth items. Therefore, a lower median rank means better retrieval performance.
Following the convention in sentence-to-video retrieval, we used the Microsoft Research Video to Text dataset (MSR-VTT) [44], which is a large-scale video benchmark for video understanding. The MSR-VTT contains 10000 video clips from YouTube with 41.2 hours in total. The dataset provides videos in 20 categories, e.g., music, people, and sport. Each video is associated with 20 different description sentences. The dataset consists of 6513 videos for training, 497 videos for validation, and 2990 videos for testing.
We evaluated four variants of the proposed method: singlespace, sequential or I3D dual-space, and triple-space models. Firstly, the single-space model represents the proposed framework composed of the global visual and textual embedding networks. These two networks form a single embedding space. Then, we measure the final similarity in the single embedding space. Secondly, the sequential dual-space model (dual-S) is the proposed framework using two embedding spaces learned by the global and sequential visual networks, and textual embedding networks. We measure the final similarity by merging two similarities, s g and s s , as described in Eq. (11). Thirdly, the I3D dual-space model (dual-I) has global visual and I3D embedding networks. Lastly, the triple-space model added the I3D and textual embedding networks into the dualspace model. We produce the final similarity by merging s g , s s , and s e with the similarity aggregation.

A. Sentence-to-Video Retrieval Results
We summarized the results of the sentence-to-video retrieval task on the MSR-VTT dataset in Table I. We compared the   TABLE I  THE RESULTS OF SENTENCE-TO-VIDEO RETRIEVAL TASK ON MSR-VTT  DATASET. THE BOLD AND UNDERLINED RESULTS REPRESENT THE FIRST-AND THE SECOND-BEST,  proposed method to the existing methods [18], [22], [28], [45] The proposed method obtained 7.1 (dual) at R@1, 21.2 (triple) at R@5, 32,4 at R@10 (triple), and 29 at MR (triple). These are competitive with [22], [45], which are the state-of-the-art.
VSE and VSE++ adopt a triplet ranking loss, and VSE++ incorporates hard-negative samples into the loss to facilitate practical training [46]. We adopted this strategy. The results show that VSE++ performs significantly better than VSE at every R@k. Although the single-space model and VSE++ adopted similar loss functions, we found slight improvements in performance. However, the dual-space model achieves much better performance compared to VSE++. The results demonstrate the importance of using the sequential visual information of videos for learning an embedding space.
Multi-Cues [22] calculates two similarities in separated embedding spaces and then averages them to produce a final similarity. The proposed method has higher performance compared to Multi-Cues. The similarity aggregation strategy is the main difference between the proposed method and Multi-Cues. Note that the average suffers from aligning videos and sentences due to their variations. Thus, some videos need global visual features, and some need sequential visual features. The proposed method has a flexible strategy for merging similarities. The experimental results show that the proposed strategy is more useful for measuring similarity than a naive merging approach with equal importance to each embedding space, such as average.
The Cat-Feats [45] embeds videos and sentences by concatenating feature vectors extracted by three embedding modules: CNN, bi-directional GRU, and max pooling. Cat-Feats is slightly better than the proposed method at R@1 and R@5, e.g., 7.7 and 7.1 for Cat-Feats and dual-space at R@1, respectively. Whereas, the proposed method (triple-space) outperforms Cat-Feat at R@10 and median rank, such as 32 and 29 by Cat-Feat and the triple, respectively. These results imply that the proposed method and Cat-Feats can assist each other. There is a possibility to improve performance by incorporating the feature concatenation mechanism used in Cat-Feat into the proposed method.
Finally, the proposed method with triple-space achieves better results than single-and dual-space at three metrics: R@5, 10, and median rank. Therefore, The results show that integrating multiple similarities can lead to a better, reliable retrieval performance.

VI. ABLATION STUDY
We carried out ablation studies to evaluate the components of the proposed method. We conducted the following three experiments.

A. Embedding Spaces
We changed the number of embedding spaces and developed the four variants of the proposed architecture. The experimental results are shown in Table II. there are certain improvements from single to multiple spaces at all the metrics. Therefore, we can verify the effectiveness of the proposed method that integrates multiple spaces. Subsequently, we compared the two duals, dual-S is better than the dual-I at R@1 and R@10, whereas dual-I is superior at R@5. Thus, dual-S and dual-I can complement each other. The triple contains the embedding spaces of both of the duals, and it outperforms the single and the duals. Therefore, we can confirm the effectiveness of the combination of multiple embedding spaces, which is the key insight of the proposed method for video and sentence understanding.

B. Spatial Attention Mechanism
We conducted experiments using dual-S with or without the spatial attention mechanism. The dual-S without the attention encodes each frame using ResNet into a 2048-dimensional vector. Then, the LSTM processed the sequences of the vectors. Table III shows the results. The dual-S with the attention achieves better results than without attention at all metrics, R@1, R@5, R@10, and median rank. Therefore, we can confirm that the proposed spatial attention mechanism improves performance significantly.
Besides, we observed that the performance of the dual-S with attention is almost competitive with dual-I. However, dual-S without the attention is worse than the dual-I. Considering that the I3D model extracts useful spatial and temporal features for action recognition, the dual-S without attention could not extract sequential features effectively. Whereas, the dual-S with attention obtained such features. We stress that this is another supportive evidence showing the effectiveness of the proposed attention mechanism.

C. Similarity Aggregation
We investigated the impacts of similarity aggregation using average or the proposed weighted sum. We used dual-S and dual-I for this investigating experiment. The experimental results are shown in Table IV. There are improvements by the weighted sum at R@1, R@10, and MR in both of the dual-S and dual-I. Therefore, we confirmed the effectiveness of the proposed similarity aggregation module.
Furthermore, we performed an analysis of the weights in the similarity aggregation. As described in Eq. (12), the weights are flexibly determined according to the given sentence. In other words, the weight represents the importance of embedding spaces. We attempted to go further understanding by analyzing the weights. For simplicity, we used the dual-S model in this analysis. Therefore, the analysis is on the importance of global and sequential visual features. We summarized the statistics of the weights in Table V. The statistics show that the proposed method assigns larger weights to the global features.
We show the accumulative step histogram for the global weight in Fig. 4. The ratio reached 0.75 at the weight 0.5. Thus, 0.75 total instances received larger weights for the global feature. In contrast, the weights of the sequential feature are larger only in the remained 0.25 instances. Therefore, the global feature is more critical than the sequential feature. Thus, dual-S aggressively used global features. Figure 5 shows examples of videos and sentences with assigned weight to the global feature. Videos containing explicit scenes tend to have larger weights on the global visual feature since objects in the videos are relatively easy for detection. On the other hand, videos containing unclear scenes tend to assign larger weights to the sequential visual feature.

VII. CONCLUSION
In this paper, we presented a novel framework for embedding videos and sentences into multiple embedding spaces. The proposed method uses distinct embedding networks to capture various relationships between visual and textual features, such as global appearance, sequential visual, and action features. We produce the final similarity between a video and a sentence by merging similarities measured in the embedding spaces with the weighted sum manner. The proposed method can flexibly determine the weights according to a given sentence. Hence, the final similarity can incorporate an essential relationship between video and sentence.
We carried out sentence-to-video retrieval experiments on the MSR-VTT dataset to demonstrate that the proposed framework significantly improved the performance when the number of embedding spaces increased. The proposed method achieved competitive results compared to the state-of-the-art methods [22], [45]. Furthermore, we verify all the critical components in the proposed method through the ablation experiments. Even though the components are individually useful, their cooperation can generate significant improvements.