A Knowledge-Driven Multimedia Retrieval System Based on Semantics and Deep Features

: In recent years the information user needs have been changed due to the heterogeneity of web contents which increasingly involve in multimedia contents. Although modern search engines provide visual queries, it is not easy to ﬁnd systems that allow searching from a particular domain of interest and that perform such search by combining text and visual queries. Different approaches have been proposed during years and in the semantic research ﬁeld many authors proposed techniques based on ontologies. On the other hand, in the context of image retrieval systems techniques based on deep learning have obtained excellent results. In this paper we presented novel approaches for image semantic retrieval and a possible combination for multimedia document analysis. Several results have been presented to show the performance of our approach compared with literature baselines.


Introduction
The main aim of a search engine is to satisfy user information needs [1] retrieving relevant information for the user [2]. Relevance may be divided in two main classes called objective (system-based) and subjective (user-based) relevance respectively [3][4][5]. The objective relevance takes into account the direct match between the topic of the retrieved document and the desired topic, according to a user query. Several studies on human relevance [6][7][8] show that several criteria are involved in output evaluation of information retrieval process. The subjective relevance is related to the concepts of aboutness and appropriateness of retrieved information, so it depends on the user and his/her judgment. Relevance can also been divided into five typologies [9]: an algorithmic relevance between the query and the set of retrieved information objects; a topicality-like type, associated with aboutness; cognitive relevance, related to the user information need; situational relevance, depending on the task interpretation; and motivational and effective, which is goal-oriented. Moreover, relevance has two main features-multidimensional, because different users can grasp and evaluate differently the acquired information and dynamic because their skills and knowledge about a specific information can change over time. The information must been represented to be analyzed. We can use symbols to represent information and these symbols are called signs [10]. A sign can be defined as, "something that stands for something else, to someone in some capacity" [11]. A sign cannot be limited to words, but also images, sounds, videos and more. Starting form these considerations, the design of a modern information retrieval system takes into account that information can be represented in different forms, in order to improve the efficiency and effectiveness of the whole retrieval process. The existence of different kinds of information representation increases the semantic gap between low-level features and high-level concepts. Moreover, in a general content based approach the query is unknown to the system if it is not represented with low-level features. The main effect of the semantic gap is that a query expresses in terms of low-level features can return wrong results if the conceptual content is not given. For this reason, the representation of low-level features is a crucial task in information retrieval systems.
In the last years, the extracting of low-level features is an important topic highly investigated in literature. New techniques based on deep-learning approach have been presented form the scientific community. In particular, novel feature called deep feature are extracted as an output of deep neural networks (DNNs).
In our work we propose novel techniques to analyze and combine semantic information and visual features for multimedia web documents retrieval. Our techniques are the base of a whole framework for multimedia query posing and document analyzing. Our approach has been extensively tested and compared with well-known semantic and content-based retrieval techniques.
The paper is organized as follows-in Section 2 we present a literature overview putting in evidence the differences with our approach; Section 3 gives a top-level view of our system, while in Section 4 we discuss the used strategy for our experiments; eventually, discussion and conclusions are reported in Section 5.

Related Work
Retrieval of multimedia objects had a contribution by several fields of scientific communities such as artificial intelligence, pattern recognition, computational vision and deep learning. In this section, we present the topics of our work, introducing and analyzing several approaches and techniques for text and image retrieval.
A user tends to handle high-level concept representations, such as keyword and textual descriptors, to interpret images and measure their similarity [12]. The difference in knowledge representation between low-level features and user semantic knowledge is referred as semantic gap [13,14]. The authors of References [15,16] proposed an interesting survey on low-level features and high-level semantics for image analysis and retrieval. Moreover, an interesting approach is performed at the query level [17]. Authors represent a query by three levels: (i) low-level features, such as text, colour and shape; (ii) derived features with the same degree of logical inference; (iii) abstract features. In the authors' opinion there are two ways to mitigate the semantic gap. The first consists in a visual and textual query posing performed by the user; on the other hand, the use of a multimedia Knowledge base to drive the multimedia analysis. In this paper we investigate the second approach and propose a framework based on semantic similarity to analyze the multimedia contents. In Reference [18], authors discuss semantic similarity measures and divide them into four types: (i) path-based; (ii) information content-based; (iii) features-based and (iv) hybrid methods. Path-based measures express the similarity between two concepts as a function of path length between concepts in a knowledge structure (e.g., taxonomies or semantic networks). The main idea of the information content-based measures is that each concept represent a well defined information in a knowledge structure and two or more concepts are similar if they share common information. The features-based measures are independent from knowledge structure and they derive similarity exploiting the properties of the knowledge structure itself. An hybrid measure combines the measure types described above and different relations between concepts such as is-a and part-of. In our work we use a path-based and information content-based semantic similarity measures.
The features used to represent images have a low dimension compared with the original raw data, they are generally composed by a series of numerical values and can be represented through different data formats (e.g., a vector). The authors of Reference [19] propose a comprehensive review on feature extraction for content-based image retrieval systems (CBIRs) and in Reference [20] is presented a review on SIFT and different features extracted from convolutional neural networks (CNN) for image retrieval. In general, we can divide features in global, local and deep features. The global features aim to represent an image as a whole considering, for example, colour, shape and texture.
Local features describe the key-points of an image object, generally used in object detection and recognition. Deep features are a new approach based on DNNs. In some case, deep features are the output of the last level of the DNN or other levels with additional operators [20]. Many authors consider the second to last level of the DNN as a deep feature applying an aggregation layer based on Global Max Pooling or Global Average Pooling.
There are several descriptors presented in literature and the remainder of this section. We introduce different baselines used in our work, as discussed in the following sections. PHOG (Pyramid of Histograms of Orientation Gradients) is described in detail in [21], where the authors propose a descriptor based on HOG (Histograms of Orientation Gradients). Its goal is the representation of an image through its local shape and their spatial arrangement. A local feature is an image pattern which differs from its immediate neighbourhood. It is usually related to a change of an image property or several properties simultaneously [22]. Example of local features are SIFT [23] and SURF [24]. ORB [25] was proposed as a valid alternative to mitigate the high computational cost of SIFT and SURF; it is a fusion of FAST keypoint detection and BRIEF descriptor. In this paper, we use PHOG as a global descriptor due to its good results reported in the discussed literature. In addition, we use ORB as a local feature because it has a comparable accuracy with SIFT and SURF but it presents a fast computation [26]. In the last years, the progress of Computer Vision-based on Deep Learning achieved impressive performances in terms of accuracy and image understanding [27]. In this context, different DNNs have been proposed, and specific architectures called Convolution Neural Network presents the best results. A Convolution Neural Network (CNN/ConvNet) is a type of artificial neural feed-forward network inspired by the organization of the animal visual cortex. A CNN is organized in different layers: (i) input layer; (ii) convolution; (iii) normalisation; (iv) pooling; (v) full connection [27].
Over the years,the scientific community proposed different CNN architectures. In this section we briefly introduce the ones used in our work. VGGNet [28], developed by VGG (Visual Geometry Group) of the University of Oxford and presented at ImageNet LSVRC (Large Scale Visual Recognition Competition) in 2014 in the classification task, it is one of the CNNs more used in research works becoming one of the most cited in the literature. ResNet [29] won the first place in the ILSVRC 2015 classification competition with a top 5 error rate of 3.57%. ResNet is a residual network. Residual networks solve some learning problems when many levels are added to the convolutional network in some points because in this condition, the previous CNNs architecture the performance degrades quickly. In 2014, Google researchers introduced the Inception network [30], which ranked first in the ILSVRC 2014 competition. The issues addressed by the authors to design Inception are mainly related to the difficulty of choosing the kernel size, the creation of too deep networks which generate overfitting and reduce the computational cost. The authors present different versions of Inception, and in each version, they apply a set of optimizations to improve accuracy and decrease the computational complexity. MobileNet [31,32] is a neural network proposed by Google researchers. This neural network can be used on mobile devices or in cases where processing power is limited.
In our system, we use the CNNs reported above for feature extraction. In References [33,34] the authors propose a literature review of deep learning in Content-based Information Retrieval Systems (CIBR) and, according to them we use deep descriptors extracted from the second last layer of a CNN and apply max or average pooling.
Over the years, in literature can find many frameworks for CBIR and Information retrieval systems based on ontology and multimedia features. The authors of Reference [35] proposed an approach to support multi-modal image retrieval based on the Bayes point machine to associate words and images. In Reference [36], authors use the latent semantic indexing together with both textual and visual features to extract the underlying semantic structure of a web page. The authors of Reference [37] propose an iterative similarity propagation approach to explore the inter-relationships between web images and their textual annotations for image retrieval. In Reference [38], it is introduced a semantic combination technique to efficiently fuse text and image retrieval systems in multimedia information retrieval. In Reference [39], the authors report a description of an ontology model, the integrates domain specific features, and processing algorithms focused on the domain specified by the user. YaSemIR [40] is a free and open-source semantic IR semantic system based on Lucene, which uses conceptual labels to annotate documents and questions. In Reference [41], authors discuss and present a study about different multimedia retrieval techniques based on ontologies in the semantic web. They compare these techniques to highlight the advantages of text, image, video and audio-based retrieval systems. The authors of Reference [42] propose a recommendation system for e-business applications. The recommendation strategy aimed is based on a hybrid approach, combining intrinsic characteristics of objects, past behaviours of users in terms of usage patterns and user interest expressed by ontologies to computes customized recommendations. In Reference [37], the authors present two methods to improve the Image recovery system performance. The first proposed method defined the most efficient way to GLCM texture for the recovery process. It also increased the recovery precision, combining the most efficient GLCM structure with DWT decomposition. The second proposal combined colour and texture characteristics to improve the method recovery services. This method combined the HSV colour with the most efficient GLCM texture features and with the GLCM and DWT texture features.
The authors of Reference [43] propose an efficient "bag-of-words" model that uses deep local descriptors of the convolutional neural network. The selection of high-quality descriptors provides a simple and effective way to choose the most discriminating local descriptors, which significantly improves the accuracy of retrieval. They evaluate Different methods of pre-processing the descriptors, and they report that the RootCFM is to be the best. The model uses a large visual codebook combined with the inverted index for efficient storage and fast recovery. The authors of Reference [44] show a Semantic Event Retrieval System, that includes high-level concepts and uses concept selection based on semantic embeddings. In Reference [45], the authors propose a new multimedia embedding for few-example event recognition and translation.
In this article, we propose a novel framework for multimedia web document retrieval system combining semantic similarity measures based on a formal and semantic multimedia knowledge base and different image descriptors. In particular, deep descriptors have been computed using more performance aggregation functions, reducing their dimension and improving the accuracy of the result considerably.

The Proposed System
In this section, we present the architecture of the implemented system and describe in detail the modules. The Figure 1 shows the system at a glance.
The system consists of two main subsystems. The first one has in charge the creation and population of the database from a document collection; moreover, it extracts images and their descriptors and normalizing the text. The second one implements the retrieval process, using semantic measure and image retrieval techniques. Multimedia Web Documents Repository Processor creates the document collection used in the retrieval tasks. The Figure 2 shows an activity diagram of this subsystem. Web documents are processed to obtain a structured object, then is stored in a NoSQL document-based database. In particular, the subsystem extracts image and text from the document normalizing the text, while the images are processed to extract features. The result of the this tasks is then merged into a single structured object if the document has at least one image and the text is composed of at least 25 terms. We remove documents without images because we want evaluate a combination of textual and visual descriptors.
The Text normalization is a fundamental step in an information retrieval systems because it allows to obtain documents with only clean text without stopwords and other kind of "noise" as for example misspelling terms [1]. Moreover, we use stemming and lemmatization tasks to have each word in its basic form. As shown in the Figure 3 the Tags Removal module removes the HTML code from the text. The second module transforms the text into lowercase text and removes special chars (i.e., all chars except number, letter, _ and ' ). Stopwords Removal processes the normalized text removing words without any particular meaning (e.g., articles, adverbs, conjunctions, . . . ). Stemming-Lemmatization module analyzes the output of the previous blocks and performs text transformations. Stemming transforms words into their canonical form to achieve more effective matching. The problem in using stemming is that words with different meanings can be associated with the same root. To solve this issue, we use a more sophisticated algorithm which uses lemmas instead of the root. The use of lemmas is more efficient because it represents the canonical form of the word (e.g., for verbs it uses infinitive form).The last step before document storing is performed by the Out of vocabulary words removal module. It removes all words that do not have sense in English language (e.g., misspelling words).
The semantic similarity module is in charge of compute the similarity on the textual document information. In our work we use different kinds of similarity metrics as described in the previous section.   The semantic measures based on path are:

•
Shortest Path based Measure: this measure only takes into account len(c 1 , c 2 ). It assumes that the sim(c1, c2) depend on closeness of two concepts are in the taxonomy, and that a similarity between two terms is proportional to the number of edges between them.
• Wu & Palmer's Measure: it introduced a scaled measure. This similarity measure considers the position of concepts c1 and c2 in the taxonomy relatively to the position of the most specific common concept lso(c 1 , c 2 ). It assumes that the similarity between two concepts is the function of path length and depth in path-based measures.
Semantic measures based on information content (IC) assume that each concept includes much information in a knowledge based (i.e., WordNet [46]). Similarity measures are based on the information content of each concept. The measures are:
• Jiang's Measure: Jiang's Measure uses both the amount of information needed to state the shared information between the two concepts and the information needed to fully describe these terms. The value is a semantic distance between two concepts. Semantic similarity is the opposite of the semantic distance.
Image Extractor module fetches the images in each HTML document. The retrieved images are analyzed and filtered by size and format; in our case, we consider only JPEG images but we can easily consider other format. The same module extracts visual features from these images. In our framework, we consider three types of features-local, global and deep. We use ORB as local feature due to its performance compared with other similar descriptors [26]. The same consideration is made regrading the global descriptor PHOG [47]. We consider four deep descriptors derived from CNNs-VGG16, ResNet50, InceptionV3, MobileNetV2 due to the novelty in the use of this kind of features. For each CNNs we have implemented a global average pooling and a global max pooling. In this work the CNNs are pre-trained on ImageNet. Multimedia Web Document Composer module builds the documents to store in our NoSQL database (i.e., MongoDB). In particular, we use a json document consists of the following fields:  The Retrieval subsystem has been configured in three different search cases. The first case (Case A) is the text only query. In Figure 5 the activity diagram is shown. The system normalizes the text of the query posed by the user and considers the domain to extract the concept from the knowledge Base. It computes the semantic similarity between the concept and each term of the documents. In a previous step, we apply a word sense disambiguation algorithm, and for the assignment of the score to the single documents the Equation (6) is used.
where tq is a concept, d is a document, N is the number of tokens in the document and d i is a document term. The second case (Case B) is the visual query. In Figure 6 the activity diagram of this step is shown. The system extracts the feature from the image query and computes its similarity with regards to the document collection images. Semantic similarity score between query image and multimedia document is obtained as average of cosine similarity between query and each image contained in multimedia documents, as expressed in the Equation (7), and ranks the documents based on the results obtained.
where vq is a descriptor of the query image, d represents all visual descriptors extracted from an image in the document, N is the number of images in the document and d i is a visual descriptor of a single image. The third case (Case C) is the combined visual and text query and the Figure 7 shows the activity diagram. In this case, the system performs an union of the two cases illustrated above. Then it adds them up to combine the results.
In this paper the knowledge base is implemented following a multimedia model proposed in References [48,49]. This formal representation uses signs as defined in Reference [11]. A concept can be represented in various multimedia forms. The structure of the model is composed of a triple < S, P, C >, defined as: • S: the set of signs; • P: the set of properties useed to relate signs to concepts; • C: the set of constraints on the set P. The knowledge base is an ontology logically represented by a semantic network (SN). We can see a SN as a graph where the nodes are the concepts and the arcs the semantic relations between them. The language chosen to describe this model is the DL version of the Web Ontology Language (OWL) [50], a standard language in the semantic web and we use a NoSql technology to implement a SN, in particular Neo4j graph DB. The hierarchies used to represent the objects of interest in our model are shown in Figure 8. The knowledge base has been populated using ImageNet [51]. Imagenet is based on WordNet adding multimedia contents to a portion of it. The multimedia nodes added to the graph contain only the meta-data and the raw image data is stored in the document-based data structure previously described. The split of image metadata from its raw content is performed to improve the global performance of our knowledge base due to the problems related to graph db to store a manage large documents. Therefore we design and implement an hybrid technology solution based on document based and graph db. We explicit point out that this is a novelty of our work. Figure 9 shows a sketch of our knowledge base.
The Word Sense Disambiguation (WSD) is a fundamental process in semantic similarity computation because it associates each lemma of the document with the correct concept (i.e., the right sense). Our implementation uses the Lesk algorithm [52]. The assumption at the base of this algorithm is that words have neighborhoods which tend to share a topic. A simplified version of the algorithm has been proposed in Reference [53]. The Image Similarity Module works with two models: in case of features extracted with PHOG or with a CNN, we use the cosine distance, while in case of ORB, we use the best matcher. This difference is due to the dimensionality of the descriptors. We apply global max pooling or global average pooling to obtain a feature expressed as a mono-dimensional array [54] to use the cosine similarity with features extracted from the CNNs. In the case of ORB, we use the best match as suggested in Reference [55].
The main aim is to improve the performance of the system ensuring the best accuracy of results and the best precision. Several studies have shown that the use of different combined techniques could achieve better results because one technique could compensate the lacks of the others. In Figure 10 is shown a combining process. In this work, we use a the sum as combining function [56].

Experimental Results
In this section, we present the used document collections, the testing strategy and the obtained results.
We used three datasets-20 Newsgroups [57], PASCAL VOC2012 [58] and DMOZ [59]. The specific characteristic in terms of document features, contents and size allow us to evaluate in deep all the component of our framework. In particular, we use the first dataset to evaluate and select retrieval metrics based on semantic similarity, the second one to evaluate and select the deep features in CBIR context and the last one to perform the final tests in which we combine the different strategies.
20 Newsgroups is one of the most used document collections in literature. It is available in scikit-learn and managed by Python language. This version allows us to get the dataset by removing headers, footers and quotes. The 20 newsgroups collection contains twenty topics, as shown in Table 1. The document collection has been pre-processed by the normalization module. The query set used for the evaluation of semantic similarity metrics is composed of ten queries, for each one has been recognized the set of relevant categories of 20Newsgroups. For example for the query "mars" as planet the corresponding category is "sci.space".
The PASCAL VOC2017 is an image database composes by 20 objects as shown in Table 2.
The query set used for the evaluation of the visual descriptors is composed of thirteen classes each consisting of five images. The classes have a direct correspondence with the classes pre-assigned in the dataset.
DMOZ has been one of the most popular and rich multilingual open-source web directories. The project initially called ODP-Open Directory Project was born in 1998. The purpose of DMOZ was to collect and index URLs to create a directory of hierarchically organized web contents. From the DMOZ dump both image and text has been extracted as described in Section 3. In Tables 3 and 4 are reported the statistics downstream of this process.
We use as evaluation metrics Precision-Recall curve and Mean Average Precision (MAP) [1].
The Precision is expressed as:   It is the percentage of retrieved relevant documents compared with all retrieved documents. Recall is expressed as: It is the percentage of relevant retrieved documents compared with all relevant documents in the document collection.
The Precision-Recall curve is obtained as an interpolation of the Precision values for 11 standard Recall values ranging from 0 to 1 with step 0.1. Interpolation is estimated with the following criteria: We use The Precision-Recall curve to compare different retrieval algorithms considering the whole set of retrieved documents. Moreover, one of the most used measures in literature to measure the web information retrieval performances is the Mean Average Precision on the first top k results. We choice MAP, because in a web search engine the user is generally interested in the first k results. Mean Average Precision is expressed as: In this work, we consider the MAP@10 because it is mostly used on web document retrieval considering the first 10 results.
As previously stated, the measure of semantic similarity has been performed considering: path similarity, Leacock-Chodorow Similarity, Wu-Palmer Similarity, Resnik Similarity and Jiang-Conrath Similarity. We use the Lesk algorithm as disambiguation technique. The query set used consists of twelve text queries, which have polysemic meaning. An example of a polysemic query is "mars" which in WordNet has two meanings, "Mars" as god and "Mars" as a planet.
In this context we are interested in text analysis and the used document collection is 20 newsgroups. In the Figure 11 there is shown the Precision-Recall curve, where we can observe that Jiang-Conrath Similarity is the best measure, but in Figure 12 the path similarity where the best measure with respect to MAP@10.  According to results shown in Figures 11 and 12 we choice path similarity and Resnik similarity, the first one is path-based type. We consider two measures on semantic textual similarity due to the very small difference obtained obtained by experiments. In this way we can investigate more deeply in the use of a combination of these measures and multimedia descriptors.
The used visual descriptors are: PHOG, ORB and deep descriptors. In particular, the deep descriptors are: The used query set consists in sixty-five images, divided into ten classes. The used dataset is PASCAL VOC2012. Figure 13 shows the Precision-Recall curve, where we can see that the descriptor extracted from ResNet-50 with global average pooling has the best result. Moreover, in the Figure 14 the MobileNet V2 with global average pooling is the best considering MAP@10.
The chosen descriptor, according to Figures 13 and 14, is the one extracted with MobileNet V2 with global average pooling. In this case we consider only one descriptor due to the high difference in accuracy obtained by the results comparison.
The evaluation strategy has been defined by different test cases performed on DMOZ data set to have a complete analysis of our framework in a real scenario. The test cases are: •   In the Figures 15 and 16 are reported in all experiments results on the DMOZ collection. Considering the results the best measure is the combination of the path similarity with the deep descriptor. The experiments show that on a real and general web document collection as DMOZ the path similarity is better than Resnick. We argue that the previous results depend on the specific used data sets. Moreover, the combination with the deep descriptor improves the results in both case.

Conclusions and Future Work
In this paper, we proposed a novel and complete framework for multimedia information retrieval using semantic retrieval techniques combined with content-based image descriptors. The paper presents several novelties. The use of a formal multimedia knowledge base allows us to have excellent results improving the precision of used techniques compared to literature. The documents retrieved to the user are more relevant for the query due to the possibility of discriminating the different meanings of the words used in the query. In this context, the implementation of an automatic WSD task in the information retrieval task improve considerably the performance of whole retrieval process. We also presented a system based on an hybrid big data technology that integrates graph-based knowledge representation and document based approach. Moreover, different kinds of image descriptors have been added in our knowledge based to improve the representation of concepts. A deep testing shows very promising results in the combination of textual semantic metrics and deep image descriptors. Finally, we implement and test a visual query with automatic concept extraction which simplifies the query posing process obtaining comparable results with other test cases. We explicitly point out that the modularity of the proposed framework allows an easily extension of our system with other functionalities. Our future works will focus in the definition and implementation of a novel multimedia semantic measure considering both textual and multimedia information stored in our knowledge and different combination strategies. Moreover, we will investigate on the automatic generation of stories starting from the retrieved relevant documents to improve the serendipity for the user and the integration of specific knowledge domains [60,61]. In future works we will investigate on efficient techniques to improve the number of multimedia contents in ImageNet. We think that this task could improve our results, having a more comprehensive knowledge base. Moreover, we will design and implement a novel WSD method based on the analysis of the relationships among concepts in a document using our knowledge base. In addition, the use of structured text as HTML fields can enhance the efficiency and effectiveness of our retrieval methodology. We will improve the evaluation of our novel approach and additional methods using a larger query set and a document collection with high polysemic document categories. Our approach will be compared with other similar baselines proposed in literature in order to prove the effectiveness of our framework.