The VISIONE Video Search System: Exploiting Off-the-Shelf Text Search Engines for Large-Scale Video Retrieval

This paper describes in detail VISIONE, a video search system that allows users to search for videos using textual keywords, the occurrence of objects and their spatial relationships, the occurrence of colors and their spatial relationships, and image similarity. These modalities can be combined together to express complex queries and meet users’ needs. The peculiarity of our approach is that we encode all information extracted from the keyframes, such as visual deep features, tags, color and object locations, using a convenient textual encoding that is indexed in a single text retrieval engine. This offers great flexibility when results corresponding to various parts of the query (visual, text and locations) need to be merged. In addition, we report an extensive analysis of the retrieval performance of the system, using the query logs generated during the Video Browser Showdown (VBS) 2019 competition. This allowed us to fine-tune the system by choosing the optimal parameters and strategies from those we tested.


Introduction
With the pervasive use of digital cameras and social media platforms, we witness a massive daily production of multimedia content, especially videos and photos.This phenomenon poses several challenges for the management and search of visual archives.On one hand, the use of content-based retrieval systems and automatic data analysis is crucial to deal with visual data that typically are poorly-annotated (think for example to user-generated content).Futhermore, there is an increasing need of scalable systems and algorithms in order to manage larger data collections.

Textual KIS
A slow pan up from a canyon, static shots of a bridge and red rock mountain.A river is visible at the ground of the canyon.The bridge is a steel bridge, there is a road right to the mountain in the last shot.In this work, we present a video search system, called VISIONE, which provides users with various functionalities to easily search for target videos.It relies on artificial intelligence techniques to automatically analyze and annotate visual content and it employs an efficient and scalable search engine to index and search the video data.A demo of VISIONE running on the VBS V3C1 dataset described in the following, is publicly available at http://visione.isti.cnr.it/.
A first version of VISIONE [1] was presented at the Video Browser Showdown (VBS) 2019 challenge [2].VBS is an international video search competition [3,4,2] that evaluates the performance of interactive video retrieval systems.Performed annually since 2012, it is becoming increasingly challenging as its video archive grows and new query tasks are introduced in the competition.The V3C1 dataset [5] used in the competition since 2019, consists of 7475 videos gathered from the web, for a total of about 1000 hours.The competition search tasks are: Known-Item-Search (KIS), textual KIS and Ad-hoc Video Search (AVS).Figure 1 gives an example of each task.The KIS task models the situation in which someone wants to find a particular video clip visually presented, assuming that it is contained in a specific collection of data.The textual KIS is a variation of the KIS task, where the target video clip is no longer visually presented to the participants of the challenge but it is rather described in details by text.This task simulates situations in which a user wants to find a particular video clip, without having seen it before, but knowing the content of the video exactly.For the AVS task, instead, a general textual description is provided and participants need to find as many correct examples as possible, i.e. video shots that fit the given description.
VISIONE can be used to solve both Known-Item and Ad-hoc Video Search tasks.It integrates several content-based analysis and retrieval modules, including a keyword search, a spatial object-based search, a spatial color-based search, and a visual similarity search.The main novelty of our system is that we specifically designed a textual encoding to be used for indexing and searching video content.This aspect of our system is crucial: we can exploit the latest text search engine technologies, which nowadays are characterized by high efficiency and scalability, without the need to define a dedicated data structure or even worry about implementation issues like software maintenance, updates to new hardware technologies, etc.
The main contributions of this work are presenting how all these search functionalities can be implemented and integrated into a unified framework that relies on a full-text search engine, such as Apache Lucene 1 .Moreover, for the proposed textual encoding, we experimentally studied which text scoring function (ranker) is the best suited for the video search task.
The rest of the paper is organized as follows.The next section reviews related works.Section 3 gives an overview of our system and its functionalities.Key notions on our proposed textual encoding and other aspects regarding the indexing and search phases are presented in Section 4. Section 5 presents an experimental evaluation to determine which text scoring function (ranker) is the best in the context of a Know-Item search task.Section 6 draws the conclusions.

Related Work
Video search is a challenging problem of great interest in the multimedia retrieval community.It employs various information retrieval and extraction techniques, such as content-based image and text retrieval, computer vision, speech and sound recognition, and so on.The Video Browser Showdown (VBS) contest provides a live and fair performance assessment of video retrieval systems and therefore in recent years has become a reference point for comparing stateof-the-art video search tools.During the competition, the participants have to perform various KIS and AVS tasks in a limited amount of time (generally within 5-8 minutes for each task).To evaluate the interactive search performance of each video retrieval system, several search sessions are performed by involving both expert and novice users2 .
Several video retrieval systems participated at the VBS in the last years [4,2,6,7].Most of them, including our system, support multimodal search with interactive query formulation.The various systems differ mainly on (i) the search functionalities supported (e.g.query-by-keyword, query-by-example, query-by-sketch, etc.), (ii) the data indexing and search mechanisms used at the core of the system, (iii) the techniques employed during video preprocessing to automatically annotate selected keyframes and extract image features, (iv) the functionalities integrated into the user interface, including advanced visualization and relevance feedback.Among all the systems that participated in VBS, we recall VIRET [8], vitrivr [9], and SOM-Hunter [10], which won the competition in 2018, 2019, and 2020, respectively.
VIRET [8,11] is an interactive frame-based video retrieval system that provides three main retrieval modules (query by keyword, queries by color sketch, and query by example).The keyword search relies on automatic annotation of video keyframes.In the latest versions of the system, the annotation is performed using a retrained deep CNN (NasNet [12]) with a custom set of 1243 class labels.A retrained NasNet is also used to extract deep features of the images, which are then employed for similarity search.An interesting functionality supported by VIRET is the temporal sequence search, which allows a user to describe more than one frame of a target video sequence by also specifying the expected temporal ordering of the searched frames.
Vitrivr [13] is an open-source multimedia retrieval system that supports content-based retrieval of several media types (images, audio, 3D data, and video).For video retrieval, it offers different query modes, including query by sketch (both visual and semantic), query by keywords (concept labels), object instance search, speech transcription search, and similarity search.For the query by sketch and query by example, vitrivr uses several low-level image features and a DNN pixel-wise semantic annotator [14].The textual search is based on scene-wise descriptions, structured metadata, OCR, and ASR data extracted from the videos.Faster-RCNN [15] (pre-trained on the Openimages V4 dataset) and a ResNet-50 [7] (pre-trained on ImageNet) are used to support object instance search.The latest version of vitrivr also supports temporal queries.[10] is an open-source video retrieval system that supports keyword search functionality (the same used in VIRET).The main novelty of this system is that it relies on the user's relevance feedback to dynamically update the search results displayed using self-organizing maps (SOMs).

SOM-Hunter
One of the main peculiarities of our system compared to others VBS participating is that it has been designed with an emphasis on scalability, employing text encoding that allows us to take advantage of a full-text search engine to efficiently index and search video content.In previous papers, we already exploited the idea of using text encoding, named Surrogate Text Representation [16], to index and search image for deep features [16,17,18,19].In the this work, we extend this idea to also index information regarding the position of objects and colors that appear in the images, as well as tags.
Our system, like almost all current video retrieval systems, relies on artificial intelligence techniques for automatic video content analysis (including automatic annotation and object recognition).Nowadays, content-based image retrieval systems (CBIR) are the possible solution to the problem of retrieving and exploring a large volume of images resulting from the exponential growth of accessible image data.Many of these systems use both visual and textual features of the images, but often most of the images are not annotated or only partially annotated.Since manual annotation for a large volume of images is impractical, Automatic Image Annotation (AIA) techniques aim to bridge this gap.For the most part, AIA approaches are based solely on the visual features of the image using different techniques: one of the most common approaches consists in training a classifier for each concept and obtaining the annotation results by ranking the class probability [20,21].There are other AIA approaches that aim to improve the quality of image annotation by using the knowledge implicit in a large collection of unstructured text describing images, and are able to label images without having to train a model (Unsupervised Image Annotation approach [22,23,24]).In particular, the image annotation technique we exploited is an Unsupervised Image Annotation technique originally introduced in [25].
Recently, image features built upon Convolutional Neural Networks (CNN) have been used as an effective alternative to descriptors built using image local features, like SIFT, ORB and BRIEF, to name but a few.CNNs have been used to perform several tasks, including image classification, as well as image retrieval [26,27,28] and object detection [29].Moreover, it has been proved that the representations learned by CNNs on specific tasks (typically supervised) can be transferred successfully across tasks [26,30].The activation of neurons of specific layers, in particular the last ones, can be used as features to semantically describe the visual content of an image.Tolias et al. [31] proposed the Regional Maximum Activations of Convolutions (R-MAC) feature representation, which encodes and aggregates several regions of the image in a dense and compact global image representation.Gordo et al. [32] inserted the R-MAC feature extractor in an end-to-end differentiable pipeline in order to learn a representation optimized for visual instance retrieval through back-propagation.The whole pipeline is composed by a fully convolutional neural network, a region proposal network, the R-MAC extractor and PCA-like dimensionality reduction layers, and it is trained using a ranking loss based on image triplets.In our work, as a feature extractor for video frames, we used a version of R-MAC that uses the ResNet-101 trained model provided by [33] as the core.This model has proven to perform best on standard benchmarks.
Object detection and recognition techniques also provide valuable information for semantic understanding of images and videos.In [34] the authors propose a model for object detection and classification, which integrates Tensor features.The latter are invariant under spatial transformation and together with SIFT features (which are invariant to scaling and rotation) allow improving the classification accuracy of detected objects using a Deep Neural Network.In [35,36], the authors present a cloud based system that analyses video streams for object detection and classification.The system is based on a scalable and robust cloud computing platform for performing automated analysis of thousands of recorded video streams.The framework requires a human operator to specify the analysis criteria and the duration of video streams to analyze.The streams are then fetched from a cloud storage, decoded and analyzed on the cloud.The framework executes intensive parts of the analysis on GPU-based servers in the cloud.Recently, in [37], the authors proposed an approach that combines Deep Convolutional Neural Network and SIFT.In particular, they extract features from the analyzed images with both approaches, they fuse the features by using a serial-based method that produces a matrix that is fed to ensemble classifier for recognition.
In our system, we used YOLOv3 [38] as CNN architecture to recognize and locate objects in the video frames.The architecture of YOLOv3 jointly performs a regression of the bounding box coordinates and classification for every proposed region.Unlike other techniques, YOLOv3 performs these tasks in an optimized fully-convolutional pipeline that takes pixels as input and outputs both the bounding boxes and their respective proposed categories.This CNN has the great advantage of being particularly fast (given the large number of images we had to process) and at the same time exhibiting remarkable accuracy.To increase the number of categories of recognizable objects, we used three different variants of the same network trained on different data sets, namely, YOLOv3, YOLO9000 [39], and YOLOv3 OpenImages [40].

The VISIONE video search tool
VISIONE is a visual content-based retrieval system designed to support large scale video search.It allows a user to search for a video by formulating textual or visual queries describing the content of a scene of a target video (see Figure 2).VISIONE, in facts, integrates several search functionalities and exploits deep learning technologies to mitigate the semantic gap between text and image.Specifically it supports: • query by keywords: the user can specify keywords including scenes, places or concepts (e.g.outdoor, building, sport) to search for video scenes; • query by object location: the user can draw on a canvas some simple diagrams to specify the objects that appear in a target scene and their spatial locations; • query by color location: the user can specify some colors present in a target scene and their spatial locations (similarly to object location above); • query by visual example: an image can be used as a query to retrieve video scenes that are visually similar to it.
Moreover, the search results can be filtered by indicating whether the keyframes are in color or in b/w, or by specifying its aspect ratio.

The User Interface
The VISIONE user interface is designed to be simple, intuitive and easy to use also for users who interact with it for the first time.As shown in Figure 2, it integrates the searching and the browsing functionalities in the same display window.
The search interface (Figure 3) provides: • a text box, named "Scene tags", where the user can type keywords describing the target scene (e.g."park sunset tree walk"); • a color palette and an object palette that can be used to easily drag & drop a desired color or object on the canvas (see below); • a canvas, where the user can sketch objects and colors that appear in the target scene simply by drawing bounding-boxes that approximately indicate the positions of the desired objects and colors (both selected from the palettes above) in the scene; • a text box, named "Max obj.number", where the user can specify the maximum number of instances of the objects appearing in the target scene (e.g.: two glasses); • two checkboxes where the user can filter the type of keyframes to be retrieved (B/W or color images, 4:3 or 16:9 aspect ratio).
The canvas is split into a grid of 7×7 cells, where the user can draw the boxes and then move, enlarge, reduce or delete them to refine the search.The user can select the desired color from the palette, drag & drop it on the canvas and then resize or move the corresponding box as desired.There are two options to insert objects in the canvas: (i) directly draw a box in the canvas using the mouse and then type the name of the object in a dialog box (auto-complete suggestions are shown to the user), (ii) drag & drop one of the object icon appearing in the object palette on the canvas.For convenience, a selection of 38 common (frequently used) objects are included in the object palette.
Note that when objects are inserted in the canvas (e.g. a "person" and a "car"), then the system filters out all the images not containing the specified object (e.g.all the scenes without a person or without a car).However, images with multiple instances of those objects can be returned in the search results (e.g.images with two or three people and one or more cars).The user can use the "Max obj.number" text box to specify the maximum number of instances of an object appearing in the target scene.For example by typing "1 person 3 car 0 dog" the system returns only images containing at most one person, three cars and no dog.

Query by example
The "Scene tags" text box provides auto-complete suggestions to the users and for each tag also indicates the number of keyframes in the databases that are annotated with it.For example, by typing "music" the system suggests "music (204775); musician (1374); music hall (290); ...", where the numbers indicates how many images in the database are annotated with the corresponding text (e.g.204775 images for "music", 1374 images for "musician", etcetera).This information can be exploited by the user when formulating the queries.Moreover, the keyword-based search supports wildcard matching.For example, with "music * " the system searches for any tag that starts with "music".
Every time the user interact with the search interface (e.g type some text or add/move/delete a bounding box) the system automatically updates the list of search results, which are displayed in the browsing interface, immediately below the search panel.In this way the user can interact with the system and gradually compose his query by also taking into account the search results obtained so far to refine the query itself.Moreover, while browsing the results, the user can use one of the displayed image to perform an image Similarity Search and retrieve frames visually similar to the one selected.A Similarity Search is executed by double clicking on an image displayed in the search results.
The search results can also be grouped together according to the fact that the keyframes belong to the same video.This visualization option can be enabled/disabled by clicking on the "Group by video" checkbox.
Finally, the browsing part of the user interface (Figure 4) allows accessing the information associated with the video, every displayed keyframe belongs to, a keyframe-based video summary and to play the video starting from the selected keyframe.In this way, the user can easily check if the selected image belong to the searched video.

System Architecture Overview
The general architecture of our system is illustrated in Figure 5.Each component of the system will be described in detail in the following sections; here we give an overview of how it works.To support the search functionalities introduced above, our system exploits deep learning technologies to understand and represent the visual content of the database videos.Specifically, it employs: • an image annotation engine, to extract scene tags (see Sec. 4.1); • state-of-the-art object detectors, like YOLO3 , to identify and localize objects in the video keyframes (see Sec 4.2); • spatial colors histograms, to identify dominant colors and their locations (see Sec 4.2); • the R-MAC [31] deep visual descriptors, to support the Similarity Search functionality (see Sec. 4. 3) The peculiarity of the approach used in VISIONE is to represent all the different types of descriptors extracted from the keyframes (visual features, scene tags, colors/object locations) with a textual encoding that is indexed in a single text search engine.This choice allows us to exploit mature and scalable full-text search technologies and platforms for indexing and searching large-scale video database without the need to implement dedicated structures.In particular, VISIONE relies on the Apache Lucene full-text search engine.The text encoding used to represent the various types of information, associated with every keyframe, is discussed in Section 4.
Also the queries formulated by the user through the search interface (e.g. the keywords describing the target scene and/or the diagrams depicting objects and the colors locations) are transformed into textual encoding, in order to process them.We designed a specific textual encoding for each typology of data descriptor as well as for the user queries (see Section 4).
Records, containing the information extracted from every keyframe, are composed of four textual fields, as shown in Figure 5: • Scene Tags, containing automatically associated tags The user query is broken down into four sub-queries (one for each search operation), and a query rescorer (the Lucene QueryRescorer implementation in our case) is used to combine the search results of all the sub-queries.

Indexing and Searching Implementation
In VISIONE, as already anticipated, content of keyframes is represented and indexed using automatically generated annotations, positions of occurring objects, positions of colors, and deep visual features.In the following we describe how these descriptors are extracted, indexed, and searched.

Image Annotation
One of the most natural way of searching in a large multimedia data set is using a keyword-based query.To support such kind of queries, we employed our automatic annotation system 4 we introduced in [25].Our system is based on an unsupervised image annotation approach that exploits the knowledge implicitly existing in a huge collections of unstructured texts describing images.This allows us to annotate the images without using a specified trained model.The advantage is that the target vocabulary we used for the annotation reflects well the way people actually describe their pictures.Specifically, our system uses the tags and the descriptions contained in the metadata of a large set of media selected from the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset [41].Those tags are validated using WordNet [42], cleaned and then used as the knowledge base for the automatic annotation.
The subset of the YFCC100M dataset that we used for building the knowledge base was selected by identifying images with relevant textual descriptions and tags.To this scope, we used a metadata cleaning algorithm that leverages on the semantic similarities between images.Its core idea is that if a tag is contained in the metadata of a group of very similar images, then that tag is likely to be relevant for all these images.The similarity between images was measured by means of visual deep features; specifically, we used the output of the sixth layer of the neural network Hybrid-CNN5 as visual descriptors.As a result of our metadata cleaning algorithm we selected about 16 thousands terms associated to about one million images.The set of deep features extracted from those images were then indexed using the MI-file index [43] in order to allow us to access the data and perform similarity search in a very efficient way.
The annotation engine is based on a k-NN classification algorithm.An image is annotated with the most frequent tags associated with the most similar images in the YFCC100M cleaned subset.The specific definition of the annotation algorithm is out of the scope of this paper and we refer to [25] for further details.In Figure 6, we show an example of annotation obtained with our system.Please note that our system also provide a relevance score to each tag associated to the image.The bigger the score the more relevant the tag.We used our annotation system to label the video keyframes of the V3C1 dataset.For each keyframe we produce a "tag textual encoding" by concatenating all the tags associated to the images.In order to represent the relevance of the associated tag, each tag is repeated a number of time equal to the relevance score of the tag itself (the relevance of each tag is approximated to an integer using the ceiling function).The ordering of the tags in the concatenation is not important because what matters are the tag frequencies.In Figure 6 the box named Textual Document shows an example of concatenation associated to a keyframe.

Annotation Search
The annotations, generated as described above, can be used to retrieve videos, by typing keywords in the "Scene tags" text box of the user interface (see Figure 3).As already anticipated in Section 3.2, we call Annotation Search this searching option.The Annotation Search is executed performing a full-text search.As described in Section 5, during the VBS competition the BM25 similarity was used as text scoring function.

Objects and Colors
Information related to objects and colors in a keyframes are treated in a similar way in our system.Given a keyframe we store both local and global information about objects and colors contained in it.As we discussed in Section 3.2, the positions where objects and colors occur are stored in the Object&Color BBoxes field; all objects and colors occurring in a frame are stored in the Object&Color Classes field.

Objects
We used a combination of three different versions of YOLO to perform object detection: YOLOv3 [38], YOLO9000 [39], and YOLOv3 OpenImages [40], to extend the number of detected objects.
The idea of using YOLO to detect objects within video has already been exploited in VBS, e.g. by Truong et al. [44].The peculiarity of our approach is that we combine and encode the spatial position of the detected objects in a single textual description of the image.In particular, each object detected in the image I is indexed using a specific textual encoding ENC = (cod loc cod class ) that puts together the location cod loc and the class cod class corresponding to the object.To obtain this information, we use a grid of 7 × 7 cells overlaid to the image to determine where (over which cells) each object is located.The textual encoding of this information is created as follows.For each image, we have a space-separated concatenation of ENCs, one for all the cells (cod loc ) in the grid that contains the object (cod class ): for example, for the image in Figure 7 the rightmost car is indexed with the sequence {e3car f 3car ... g5car} where "car" is the cod class of the object car, e3 is the cod loc of the cell at column "e" and row 3, f 4 is the cod loc of the cell at column "f" and row 3, etc.This information is stored in the Object&Color BBoxes field of the record associated with the keyframe.
In addition to the position of objects, we also maintain global information about the objects contained in a keyframe, in terms of number of occurrences of each object detected in the image (see Figure 7).Occurrences of objects in a keyframe are encoded by repeating the object (cod class ) as many times as the number of the occurrences (cod occ ) of the object itself.This information is stored using an encoding that composes the classes with their occurrences in the image: (cod class cod occ ).For example, in Figure 7, YOLO detected 2 persons, 3 cars, which are also classified as vehicle by the detector, and 1 horse, also classified as animal and mammal, and this results in the Object Classes encoding as "person1 person2 vehicle1 vehicle2 vehicle3 car1 car2 car3 mammal1 horse1 animal1".This information is stored in the Object&Color Classes field of the record associated with the keyframe.

Colors
To represent colors, we use a palette of 32 colors6 which represents a good trade-off between the huge miscellany of colors and simplicity of choice for the user at search time.For the creation of the color textual encoding we used the same approach employed to encode the object classes and locations, using the same grid of 7 × 7 cells.To assign the colors to each cell of the grid we used the following approach.We first evaluate the color of each pixel by using the CIELAB color space.Then, we map the evaluated color of the pixel to our 32-colors palette.To do so, we perform a k-NN similarity search between the evaluated color and our 32 colors to find the colors in our palette that most match the color of the current pixel.The metric used for this search is the Earth Mover's Distance [45].We take into consideration the first two colors in k-NN results.The first color is assigned to that pixel.We then compute the ratio between the scores of the two colors and if it is greater than 0.5 then we also assign the second color to that pixel.This is done to allow matching of very similar colors during searching.We repeat this for each pixel of a cell in the grid and then we sum the occurrences of each color of our palette for all the pixels in the cell.Finally, we assign to that cell all the colors whose occurrence is greater than 7% of the number of pixels contained in the cell.So more than one color may be assigned to a single cell.This redundancy helps reducing misclassified colors from what they appear to the human eye.
The colors assigned to all the 7 × 7 images cells are then encoded into two textual documents, one for the color locations and one for the global color information, using the same approach employed to encode object classes and locations, and discussed above.Specifically, the textual document associated to the color location is obtained by concatenating textual encodings of the form cod loc cod class , where cod loc is an identifier of a cell and cod class is the identifier of a color assigned to the cell.This information is stored in the Object&Color BBoxes field of the record associated with the keyframe.The textual document for the color classes is obtain by concatenating the text identifiers (cod class ) of all the colors assigned to the image.This information is stored in the Object&Color Classes field of the record associated with the keyframe.
Object and Color Location Search At run-time phase, the search functionalities for both the query by object and color location are implemented using two search operations: the bounding box search (BBox Search) and the object/color-class search (OClass Search).
The user can draw a bounding box in a specific position of the canvas and specify which object wants to found in that position, or he/she can drag & drop a particular object or a color from the palette in the user interface and resize the corresponding bounding box as desired (as shown in the "Query by object/colors" of Figure 3).All the bounding boxes present in the canvas, both related to colors and objects, are then converted into the two textual encoding described above (Object&Color Bounding Boxes and Object&Color Classes encodings) in the same way as for the index generation.
Then for the actual search phase, first an instance of the OClass Search operator is executed.This operator tries to find a match between all the objects represented in the canvas and the frames that contains these objects as stored in the index.During the VBS2019 competition the metric used to find this match was the dot product.This produces a result set containing a subset of the dataset with all the frames that match the objects represented in the canvas.After this, theBBox Search operator performs a rescoring of the result set by matching the textual encoding of the Object and Color Bounding Boxes encoding of the query (the encoding generated from all the bounding boxes present in the canvas) with all the corresponding encodings in the index.The metric used in this case during the VBS competition was BM25.After the execution of these two search operators, in the browsing part of the user interface are shown the frames that satisfied these two searches ordered by descending score.

Deep Visual Features
VISIONE also supports content-based visual search functionality, i.e., it allows users to retrieve keyframes visually similar to a query image given by example.In order to represent and compare the visual content of the images, we use the Regional Maximum Activations of Convolutions (R-MAC) [31], which is a state-of-art descriptor for image retrieval.The R-MAC descriptor effectively aggregates several local convolutional features (extracted at multiple position and scales) into a dense and compact global image representation.We use the ResNet-101 trained model provided by Gordo et al. [32] as an R-MAC feature extractor since it achieved the best performance on standard benchmarks.The used R-MAC descriptors are 2048-dimensional real-valued vectors.
To efficiently index the R-MAC descriptor, we transform the deep features into a textual encoding suitable for being indexed by a standard full-text search engine.We used the Scalar Quantization-based Surrogate Text representation to transform the deep features into a textual encoding, which was proposed in [19].The idea behind this approach is to map the real-valued vector components of the R-MAC descriptor into a (sparse) integer vector that acts as the term frequencies vector of a synthetic codebook.Then the integer vector is transformed into a text document by simply concatenating some synthetic codewords so that the term frequency of the i-th codeword is exactly the i-th element of the integer vector.For example, the four-dimensional integer vector [2, 1, 0, 1] is encoded with the text "τ 1 τ 1 τ 2 τ 4 ", where {τ 1 , τ 2 , τ 3 , τ 4 } is a codebook of four synthetic alphanumeric terms.The overall process used to transform an R-MAC descriptors into a textual encoding is summarized in Figure 8 (for simplicity, the R-MAC descriptor is depicted as a 10-dimensional vector in the figure).
The mapping of the deep features into the term frequencies vectors is designed (i) to preserve as much as possible the rankings, i.e. similar features should be mapped into similar term frequencies vectors (for effectiveness) and (ii) to produce sparse vectors, since each data object will be stored in as many posting lists as the non-zero elements in its term frequencies vector (for efficiency).To this end, the deep features are first centered using their mean and then rotated using a random orthogonal transformation.The random orthogonal transformation is particularly useful to distribute the variance over all the dimensions of the vector as provide good balancing for high dimensional vectors without the need to search for an optimal balancing transformation.In this way, we try to increase the cases where the dimensional components of the features vectors have same mean and variance, with mean equal to zero.Moreover the used roto-traslation preserves the rankings according to the dot-product (see [19] for more details).Since search engines, like the one we used, use an inverted file to store the data, as a second step, we have to sparsify the features.Sparsification guarantees the efficiency of these indexes.To achieve this, Scalar Quantization approach maintains components above a certain threshold by zeroing all the others and quantize the non-zero elements to integer values.To deal with negative values the Concatenated Rectified Linear Unit (CReLU) transformation [46] is applied before the thresholding.Note that the CReLU simply makes an identical copy of vector elements, negate it, concatenate both original vector and its negation, and then zeros out all the negative values.Eventually, as the last operation, we apply the Surrogate Text Representation technique [16] that allows us to transform an integer vector into text by generating a text document that repeats the codewords associated with the components of the vector a number of times proportional to the value of the components themselves.
In our system the Surrogate Text Representation of a dataset image is stored in the "Visual Features" field of our index (Figure 5) Similarity Search In VISIONE, to start a Similarity Search the user can select a keyframe as query (e.g.any one presented in the result set of a previous search).The system then relies on the Surrogate text encodings of images to perform the Similarity Search.The surrogate texts are compared using the dot product as in [19].This choice achieved very good performance for large-scale image retrieval task.

Overview of the Search Process
As we described so far, our system relies on three search operations: an Annotation Search, a BBox Search, an OClass Search,and a Similarity Search.Every time a user interacts with the VISIONE interface (add/remove/update a bounding box, add/remove a keyword, click on an image, etc...), a new query Q is executed, where Q is the sequence of the instances of search operations currently active in the interface.The query is then split into subqueries, where a subquery contains instances of a single search operation.In a nutshell, the system runs all the subqueries using the appropriate search operation and then combines the search results using a sequence of reorderings.In particular, we designed the system so the OClass Search operation has the priority and a first ranking of all images in our index is returned by using this search operation.The result set contains all the images which match the given query with taking into account the classes drawn in the canvas (both object and colors), and not their spatial location.If the query includes also some scene tags (text box of the user interface), then the Annotation Search is performed but only on the result set generated by the first OClass Search.So in this case the Annotation Search actually produce only a rescoring of the results obtained at the previous step.Finally, another rescore is performed using the BBox Search.If the user does not issue any annotation keyword in the interface, only the OClass Search and BBox Search are used.If, on the other hand, only one or more keywords are put in the interface, only the Annotation Search is used to find the results.Similarity Search, on contrast, is the only search operation that is stand-alone in our system, i.e. it is never combined with other search operations.However, we note that in future versions of VISIONE it may be interesting to also include the possibility of using Similarity Search to reorder the results obtained from other search operations.

Evaluation
A user query is executed as a combination of search operations (Annotation Search, BBox Search, OClass Search, and Similarity Search).This has been discussed in Sections 3.2 and 4. The final result set returned to the user highly depends on the results returned by each executed search operation.Each search operation is implemented in Apache Lucene using a specific ranker that determines how the textual encoding of the database items are compared with the textual encoding of the query in order to provide the ranked list of results to the query.Examples of rankers, commonly used in LUCENE, are those based on the TF-IDF and BM25 weighting measure.
In our first implementation of the system, that we used to participate at the Video Browsing Showdown (VBS) competition in 2019, we selected the various rankers, for each search operation, by performing some very preliminary tests and by (subjectively) estimating the performance of the system using our personal experience and feeling.Specifically, we tested a set queries with different rankers and we select the ranker that provided us with good results in the top positions of the returned items.However, given the lack of a ground truth, this qualitative analysis was based on a subjective feedback provided by a member of our team who explicitly looked at the top-returned images obtained with the various tested scenarios, and judged how good the results were.
As the choice of the rankers strongly influences the performance of the system, in this section we present a more in-depth and objective analysis based on a set of realistic queries submitted by multiple users.Specifically, we used the query logs acquired during the participation at the challenge, where all sequence of search operations, executed as consequence of users interacting with the system, are stored.By using these query logs, we were able to re-execute the same user sessions using different rankers.In this way we objectively measured the performance of the system, obtained when the same sequence of operation was executed with different rankers.The final scope, of this analysis, is finding the best rankers combination, for our system, for typical behavior of users interacting with the systems.Intuitively, the best combination of rankers is the one that, on average, puts more often good results (that is target results for the search challenge) at the top of the result list for the various users interacting with the system during the challenge.
We focus mainly on the rankers for the BBox Search, OClass Search, and Annotation Search.We do not consider the Similarity Search as it is an independent search operation in our system, and previous work [19] already proved that the dot product (TF ranker) works well with the surrogate text encodings of the R-MAC descriptors, which are the features adopted in our system for the Similarity Search.

Experiment Design and Evaluation Methodology
As anticipated before, our analysis makes use of the log of queries executed on our system during the 2019 VBS competition.The competition was divided in three content search tasks: visual Known-Item Search (visual KIS), textual Known-Item Search (textual KIS) and ad-hoc Video Search (AVS), already described in Section 1.For each task, a series of runs is executed.In each run, the users are requested to find one or more target videos.When the user believes that he/she has found the target video, he/she submits the result to the organization team, that evaluate the submission.
After the competition, the organizers of VBS provided us with the VBS2019 server dataset that contains all the tasks issued at the competition (target video/textual description, start/end times of target video for KIS tasks, and groundtruth segments for KIS tasks), the client logs for all the systems participating to the competition, and the submissions made by the various teams.We used the ground-truth segments and the log of the queries submitted to our system to evaluate the performance of our system under different settings.We restricted the analysis only to the logs related to textual and visual KIS tasks since ground-truths for AVS tasks were not available 7 .
During the VBS competition a total of four users (two experts and two novices) interacted with our system to solve 23 tasks (15 visual KIS and 8 textual KIS).The total number of queries executed on our system for those tasks was 1600 8 .Figure 9: Example of the ground-truth keyframes for a 20 second video clip used as a KIS task at VBS2019.During the competition, our team correctly found the target video by formulating a query describing one of the keyframes depicting a lemon.However, note that most of the keyframes in the ground-truth were not relevant for the specific query submitted to our system.
In our analysis, we considered four different rankers to sort the results obtained by each search operation of our system.Specifically we tested the rankers based on the following text scoring function: • BM25: Lucene's implementation of the well-known similarity function BM25 introduced in [47].
• TFIDF: Lucene's implementation of the weighing scheme known as tfidf (Term Frequency-Inverse Document Frequency) introduced in [48] and referred in Lucene as "Classic Similarity".• TF: implementation of dot product similarity over the frequency terms vector.
• NormTF: implementation of cosine similarity (the normalized dot product of the two weight vectors) over the frequency terms vectors.
Since we have three search operations and four rankers, we have a total of 64 possible combinations for the implementation of our system.We denote each configuration with a triplet R BB -R AN -R OC where R BB is the ranker used for the BBox Search, R AN is the ranker used for the Annotation Search, and R OC is the ranker used for the OClass Search.
In our first implementation of the system, the one used at the 2019 VBS competition, we employed the combination BM25 -BM25 -TF.With the analysis reported in this section, we compare all the different combinations in order to find the one that is most suited for the video search task.
For the analysis reported in this section we went trough the logs and automatically re-executed all the queries using the 64 different combinations of rankers in order to find the one that, with the highest probability, finds a relevant result (i.e. a keyframe in the ground-truth) in the top k returned results.Each combination was obtained by selecting a specific ranker (among BM25, NormTF, TF, and TFIDF) for each search operation (BBox Search, Annotation Search, and OClass Search).

Evaluation Metrics
The user has to retrieve a video segment from the database using the functionality of the system.A video segment is composed of various keyframes, which can be significantly different one from another.As an example, see Figure 9 showing the keyframes of a videos that was found using our system.
In our analysis, we assume that the user stops examining the ranked result list as soon as he/she finds one relevant result, that is one of the keyframes belonging to the target video.Therefore, given that relevant keyframes can be significantly different one from the other, we do not take into account the rank position of all the keyframes composing the groundtruth of a query, as required for performance measures like Mean Average Precision or Discounted Cumulative Gain.We want to measure how the system is good at proposing in the top position at least one of the target keyframes.In this respect, we use the Mean Reciprocal Rank (Equation 5.1.1)as a quality measure, since it allows us to evaluate how good is the system in returning at least one relevant result (one of the keyframes of the target video) in top position of the result set.
Formally, given a set Q of queries, for each q ∈ Q we define: • rank(I (q) j ) as the rank of the image I (q) j in the ranked results returned by our system after executing the query q • r q = min j=1,...n q rank(I (q) j ) as the rank of the first correct result in the ranked result list for the query q.
The Mean Reciprocal Rank (MRR) for the query set Q is given by where the Reciprocal Rank (RR) for a single query q is defined as We evaluated the Mean Reciprocal Rank for each different combination of rankers of our system.Moreover, as we expect that a user inspects just a small portion of the results returned in the browsing interface, we also evaluate the performance of each configuration in finding at least one correct result in the top k positions of the result list (k can be interpreted as the maximum number of images inspected by a user).To this scope we computed the Mean Reciprocal Rank at position k (MRR@k): where In the experiments we consider values of k smaller than 1000, with a focus on values between 1 and 100 as we expect cases where a user inspects more than 100 results to be less realistic.

Results
In our analysis, we used |Q| = 521 queries (out of 1600 mentioned above) to calculate both MRR and MRR@k.The rest of the queries executed on our system during the VBS2019 competition are not eligible for our analysis since they are not informative to choose the best ranker configuration.In fact: • about 200 queries involved the execution of a Similarity Search, a video summary or a filtering, whose results are independent of the rankers used in the three search operations considered in our analysis; • the search result sets of about 800 queries do not contain any correct result due to the lack of alignment between the text associated with the query and the text associated with images relevant to the target video.For those cases, the system is not able to display the relevant images in the result set whatever the ranker used.In facts, the effect of using a specific ranker only affects the ordering of the results and not the actual selection of them.
Figure 10 shows the MRR of all 64 combinations of rankers.The graph shows that there is a significant difference between the best and the worse combination.Note that the combination that we used at VBS 2019 (indicated with diagonal lines in the graph), and that was chosen according to subjective feelings, has a good performance, but it is not the best.In fact, we noticed that there exist some patterns in the combinations of the rankers used for the OClass and the Annotation Search which are particularly effective and some which, instead, provide us with very poor results.For example, the combinations that use TF for the OClass Search and BM25 for the Annotation Search gave us the overall best results.While the combinations that use BM25 for the OClass Search and the NormTF for the Annotation search have the worse performance.Specifically, we have MRR of 0.023 for the best (NormTF-BM25-TF) and 0.004 for the worse (BM25-NormTF-BM25).
From a further analysis of the MRR results, it turned out quite clearly that for the Annotation Search the ranker BM25 is particularly effective, while the use of the TF ranker highly degrades the performance.This is even more evident in  the results shown in Figure 11, where for each search operation we expose the MRR values obtained for a fixed ranker varying the rankers used with the other search operations.It also turned out that for theBBox Search the TFIDF and the NormTF have better performance than TF and BM25 rankers.Moreover, for the OClass Search the BM25 has the worse performance in general and the TF ranker is the one that provides us with the best results.Furthermore, to complete the analysis on the performance of the rankers, we analyze the MMR@k where k is the parameter that controls how many results are shown to the user in the results set.The results are reported in Figure 12, where we varied k between 1 and 1, 000.In this case, for a better understanding of the chart we reported only eight combinations (the four with the best MMR@k, the four with worst MMR@k, and the configuration used at VBS2019).In conclusion, we identified the configuration NormTF-BM25-TF as the best one for the BBox-Annotation-OClass search operations which provides as a relative improvement of 38% in MRR and 40% in MRR@100 with respect to the setting previously used at the VBS competition.

Conclusions
In this paper, we described a frame-based interactive video retrieval system, named VISIONE, that participated to the Video Browser Showdown (VBS) contest in 2019.VISIONE includes several retrieval modules and supports complex multi-modal queries, including query by keywords (tags), query by object location sketch, query by color location sketch, and query by visual example.A demo of VISIONE running on the VBS V3C1 dataset is publicly available at http://visione.isti.cnr.it/.
VISIONE exploits a combination of artificial intelligence techniques to automatically analyze the visual content of the video keyframes and extract annotations (tags), information on objects, and colors appearing in the keyframes (including the spatial relationship among them), and deep image descriptors.A distinct aspect of our system is that all these extracted features are converted into specifically designed text encodings that are then indexed using a full-text search engine.The main advantage is that in this way, VISIONE can build on the latest search engine technologies, which today guarantee high efficiency and scalability.
The evaluation reported in this work show that the effectiveness of the retrieval is highly influenced by the similarity measures used to compare the textual encodings of the video features.In fact, by performing an extensive evaluation of the system under several configurations, we observed that an optimal choice of the text scoring functions used to sort the search results can improve the performance in terms of Mean Reciprocal Rank up to an order of magnitude.Specifically, for our system we found out that TF, NormTF, and BM25, are particularly effective for comparing textual representations of object/color classes, object/color bounding boxes, and concept tags, respectively.

A⋯Figure 1 :
Figure 1: Examples of KIS, textual KIS and AVS tasks.

Figure 2 :
Figure 2: VISIONE User Interface composed by two parts: the search and the browsing.

Figure 3 :Figure 5 :
Figure 3: The video search functionalities as designed in the VISIONE User Interface

Figure 6 :
Figure6: Example of our image annotation and its representation as single textual document.In the textual document, each tag is repeated a number of time equal to the least integer greater than or equal to the tag relevance.

Figure 7 :
Figure 7: Example of our textual encoding for objects and their spatial locations.The textual encoding for color locations is obtained in a similar way.

Figure 10 :Figure 11 :
Figure 10: Mean Reciprocal Rank of the 64 combinations of ranker: the one filled with diagonal lines in the graph is the combination used at the VBS2019 competition.

Figure 12 :
Figure12: MRR@k for eight combinations of the rankers (the four best, the four worst and the setting used at VBS2019) varying k from 1 to 1000.
• Object&Color BBoxes, containing text encoding of colors and objects locations • Object&Color Classes, containing global information on objects and colors in the keyframe • Visual Features, containing text encoding of extracted visual features.These four fields are indexed by the full-text search engine and are used to serve the four main search operations of our system: • Annotation Search, search for keyframes associated with specified annotations • BBox Search, search for keyframes having specific spatial relationships among objects/colors • OClass Search, search for keyframes containing specified objects/colors • Similarity Search, search for keyframes visually similar to a query image These four search operations are fully described in Section 4.