Semantic Search Enhanced with Rating Scores

: This paper presents SemSim e , a method based on semantic similarity for searching over a set of digital resources previously annotated by means of concepts from a weighted reference ontology. SemSim e is an enhancement of SemSim and, with respect to the latter, it uses a frequency approach for weighting the ontology, and reﬁnes both the user request and the digital resources with the addition of rating scores. Such scores are High , Medium , and Low , and in the user request indicate the preferences assigned by the user to each of the concepts representing the searching criteria, whereas in the annotation of the digital resources they represent the levels of quality associated with each concept in describing the resources. The SemSim e has been evaluated and the results of the experiment show that it performs better than SemSim and an evolution of it, referred to as SemSim RV .


Introduction
The most significant improvement within the Semantic Web research area pertains to reasoning and searching abilities. In this perspective, semantic similarity reasoning, which relies on the knowledge coded in a reference ontology [1], is a different technique with respect to the well-known deductive reasoning used in expert systems. In [2], we proposed SemSim, a semantic search method based on a Weighted Reference Ontology (WRO). In SemSim, both the resources in the search space and the requests of users are represented by means of an Ontology Feature Vector (OFV), which is a set of concepts from the WRO. We distinguish the user request, also denoted as Request Vector (RV), from the description of a resource, also referred to as Annotation Vector, indicated by AV. In the search process, SemSim contrasts the RV against each AV, and the result is a ranking of the resources that exhibit the highest similarity degree with respect to the request defined by the user.
In [2], we analyzed two different approaches in order to weigh the reference ontology, namely the frequency-based and the uniform probabilistic approaches. In the experiment described in that paper, we show that SemSim by the frequency-based approach outperforms the SemSim by the uniform probabilistic approach, as well as the most representative similarity methods from the literature.
In this work, we present a new method, referred to as SemSim e . It relies on the frequency-based approach and revises SemSim along two directions. According to the first direction, in contrasting the RV with the AV, SemSim e takes into consideration the cardinality of the set of the concepts (features) in the user request rather than the maximal cardinality of the compared OFV. This choice allows us to give more relevance to the features which are requested by the user rather than the extra features contained in the annotation vectors available in the search space. Along the second direction, SemSim has been enhanced with the rating scores High (H), Medium (M), and Low (L) in the OFV, with regard to both the request and the search space resources. Within the request, rating scores denote the preferences given by the user to the concepts of the WRO used to specify the query whereas, within the annotation vectors, rating scores represent the levels of quality associated with the concepts when they describe the resources. Consider an example rooted in the tourism domain, where the user is searching for a vacation package by specifying the following features: InternationalHotel (H), LocalTransportation (M), Cultural Activity (H), and Entertainment (L). On the basis of the given rating scores, he/she gives a high preference to resorts which are international hotels offering cultural activities, and less priority to the remaining features, in particular to the entertainments. Analogously, a holiday package annotated with HorseRiding (H), Museum (M), and ThaiMeal (L) is characterized by a high quality level with regard to the horse riding service, rather than the facilities in visiting museums or having Thai meals at lunch or dinner. Note that, in [3], a proposal concerning rating scores was given, where the concepts of the WRO are weighted according to the uniform probabilistic approach [4], rather than the frequency-based one. Furthermore, in our approach we assumed that, given a facility (for instance, HorseRiding) included in a tourist package, the higher the user's priority about that facility, the higher the expectancy about the quality of the same facility and, therefore, the greater the availability of the user for considering more expensive solutions.
In this paper, we have experimented SemSim e in the domain of tourism and we have compared it to the SemSim method defined in [2] and a further evolution of SemSim, referred to as SemSim RV . Essentially, SemSim RV is the original SemSim method where, in line with the first direction adopted in SemSim e illustrated above, more priority has been given to the features indicated by the user in his/her request. The results of the experiment show that SemSim e outperforms both these methods.
The organization of the rest of the paper is the following: Section 2 is about the related work; Section 3 shows the SemSim e method and recalls SemSim; Section 4 illustrates the experimental results; Section 5 presents the conclusion and future activities.

Related Work
In the literature, several proposals regarding semantic similarity reasoning have been defined, as for instance [5][6][7][8][9]. In general, the matchmaking between vectors of concepts is computed on the basis of their intersection as performed in Dice and Jaccard methods [10], without considering neither the information content of the concepts in the ontology nor the hierarchical relationship among them. In [11], the Weighted Sum method has been proposed where the similarity of hierarchically related concepts is evaluated, although by using a fixed value (i.e., 0.5). In [12], the similarity between sets of ontology concepts is computed on the basis of the shortest path in a graph by applying an extended version of the Dijkstra algorithm [13]. Finally, other proposals, such as [14], consider the inverse document frequency (IDF) method, and combine it with the term frequency (TF) approach.
With respect to the mentioned papers, in this work the semantic matchmaking method is performed in accordance with the information content approach defined by Lin in [15], which is an evolution of [16]. The Lin's method results in a better correlation with human judgment when compared with other approaches, such as the edge-counting [17][18][19]. Furthermore, concerning the evaluation of the similarity between vectors of concepts, we borrowed the Hungarian algorithm for solving the maximum weighted matching problem in bipartite graphs [20].
In [21], the representation of users' preferences relies on the definition of users' profiles, which are built according to an ontology-based model. The ontology has been exploited in order to establish relationships among user profiles and features. Then, the level of interest of the user is modeled by taking into account both features and historical data. Reference [22] copes with the evaluation of the similarity between users' profiles. In particular, it proposes a collaborative filtering method in order to identify similar users rating in a given set of items, and to analyze the ratings associated with these users. In [23], the traditional feature-based measures for computing similarity are used, and the authors present an ontology-based approach for evaluating similarity between classes where attributes replace features. The underpinning of this approach is that the more attributes two classes have in common, the higher similarity they have.
Finally, it turns out that SemSim e is a novel approach because, as opposed to the above mentioned proposals, it introduces rating scores for enriching the semantic annotations of both user requests and resources.

The Semantic Similarity Method
The SemSim e method is an extension of SemSim proposed in [2,24], which is based on the information content approach applied to ontologies [15]. According to this approach, concepts of the ontology are associated with weights such that, along the hierarchy, as weights of concepts decrease their information content increase. A Weighted Reference Ontology (WRO) is a pair: and, in particular: where C is a set of concepts, also referred to as f eatures, and H is the set of pairs of concepts of C that are related according to the ISA hierarchy (specialization hierarchy); • w is a function, referred to as weight, such that, given a concept c in the ontology, w(c) is a number in [0,. . .,1].
In Figure 1, a WRO related to the tourism domain is shown, whose weights are defined by using the frequency-based approach.
Given a WRO and a concept c ∈ C, the information content ic(c) is defined as [25]: The Resource Space represents all the searchable and available digital resources. To define the semantic content of a given resource, a structure gathering a set of concepts from the WRO is associated with it. This structure will be referred to as Ontology Feature Vector (OFV) and is represented as follows: Similarly, an OFV describes a user request. As mentioned in the Introduction, in the following, AV (Annotation Vector), and RV (Request Vector) will denote the semantics of a resource and a user request, respectively.
In order to search for the resources in the Resource Space, the SemSim method has been defined. This method evaluates the semantic similarity between an AV and a RV by using the semsim function which relies on the consim function. The latter allows the evaluation of the similarity between concepts, say c i , c j , as defined below: where lca (lowest common ancestor) is the least abstract concept in the hierarchy subsuming c i and c j . Given an instance of AV, say av, and an instance of RV, say rv, semsim computes consim for each pair of concepts belonging to the Cartesian product of av and rv. Indeed, we aim at identifying the set of pairs of concepts from av and rv that maximizes the sum of consim according to the Hungarian algorithm for the maximum weighted matching problem in bipartite graphs [20]. In particular, given: .., c a n } where {c r 1 ,..., c r m } ∪ {c a 1 ,..., c a n } ⊆ C, let S be defined as the following Cartesian product: and let P (rv, av) be defined as: Thus, semsim(rv, av) is defined as follows:

SemSim e
The SemSim e method, which is based on SemSim, has been conceived in order to manage a more refined representation of an OFV, where concepts are associated with scores. In SemSim e , an OFV is indicated as OFV e and, similar to the previous section, RV e and AV e as well.
In our approach, the presence of the scores in the AV e and RV e represents a refined annotation and a refined request vector, respectively. In the case of the AV e , each score, denoted by s a , stands for the quality level of the feature characterizing the resource, while in the case of RV e , denoted by s r , it indicates the priority degree assigned by the user to the requested feature. As mentioned above, these scores are: H (High), M (Medium), or L (Low). Therefore, a given OFV e , say ofv e , is represented as follows: The semsim e function evaluates the similarity between an AV e and a RV e , represented as av e and a rv e , respectively, by coupling their elements. The value calculated by semsim e is achieved by multiplying the consim value by the score matching value, namely sm, illustrated in Table 1, on the basis of both s a and s r scores. Formally, given: .., (c a n , s a n )} semsim e (rv e , av e ) is defined as follows: where (c r i , s r i ) ∈ rv e , (c a j , s a j ) ∈ av e , and P M ∈ P (rv, av) is utilized to evaluate the semsim value between rv and av, without considering the rating scores of both rv e and av e .  As mentioned above, in the experimentation presented in the next section, we also address an evolution of SemSim, referred to as SemSim RV , where in contrasting the RV with the AV, analogously to SemSim e , the number of features defined in the user request are considered rather than the maximal cardinality of the compared OFV. Formally, the semsim RV (rv, av) function is defined as the semsim(rv, av) (see formula (2)), where max{n,m} is replaced by the cardinality of the request vector m, i.e., Note that max{n, m} in Equation (2) has been replaced in both Equations (3) and (4) by m, which is the cardinality of the request vector, in order to give relevance to the user request and, in particular, to the features searched by the user.

SemSim e Evaluation
In the experimentation, we asked 15 colleagues at work to identify their privileged tourist packages (request vector, rv) by choosing 4 or (at most) 5 concepts of the ontology illustrated in Figure 1. In particular, assume the user desires to reside in an international hotel, in a location connected by flight and local transportation services, and the opportunity of enjoying cultural activities and entertainments. The related request vector is given below: rv = (InternationalHotel, LocalTransportation, CulturalActivity, Entertainment, Flight) By associating a score among H, M, and L with each concept in the rv, the user can indicate the level of his/her priority for that concept. Accordingly, the user can express his/her priority in the above rv by the following: where InternationalHotel, CulturalActivity and Flight received a higher preference with respect to LocalTransportation and Entertainment, the latter with the lowest score.
Then, we requested our colleagues to examine the 10 packages reported in Table 2, where each concept has been associated with a score among H, M, and L, specifying, similar to TripAdvisor, the quality level of the offered services. Our colleagues were also asked to choose 5 packages, out of 10, more similar to their rv, and to assign a score to each chosen one representing the degree of similarity between the package and their rv (Human Judgment-HJ). The request vectors defined by the colleagues are shown in Table 3.  For the purpose of illustration, let us consider the request vector defined by the user U3. This user desires to get to his/her destination by flight, reside in an international hotel, commute by public transportation, enjoy international meals and local attractions. As we observe, in the request vector of U3 all the features have been refined by the score H, except Flight which has been refined by M. Table 4 shows the experimental results related to the user U3. The first column contains the packages chosen by the user, that are P1, P3, P4, P5, P8, and, respectively, the related similarity values 0.95, 0.8, 0.5, 0.4, 0.2. The other columns illustrate the 10 packages ranked, respectively, according to SemSim, SemSim RV and SemSim e , with the associated similarity values with U3. In Table 5 SemSim, SemSim RV , and SemSim e correlations with H J are given. We observe that SemSim e has higher values in 73% of cases. Table 6 shows the precision and recall for all the users, where "-" stands for undefined because the set of retrieved resources is empty. Note that with respect to SemSim the precision of SemSim e increases in 5 cases and decreases in 1 case, while the recall increases in most of the cases. Regarding to SemSim RV the precision of SemSim e is the same while the recall increases in 4 cases and is the same for the remaining cases. Note that, due to the presence of the rating scores, the similarity values computed according to SemSim e never increase with respect to SemSim and SemSim RV . For this reason, the evaluation of the precision and the recall has been performed on the basis of two different thresholds: 0.5 with regard to H J, SemSim, and SemSim RV , as indicated in Table 4, and 0.4 concerning SemSim e . This distinction allows us to balance the decrease of the similarity values obtained by SemSim e due to the presence of rating scores. In particular, considering U3 in Table 4, the precision remains invariant (i.e., 1), while the recall increases (0.33 vs. 0.66 vs. 1.00). Both the precision and the recall are equal to 1 because SemSim e retrieves all and only the resources relevant to the user whereas, for instance, the recall of SemSim RV is equal to 0.66 because, among P1, P3, and P4, P1 is not retrieved.
Overall, in selecting resources more similar to users' requests, the experimental results show that SemSim e improves the performances of both SemSim and SemSim RV . This improvement is achieved because, with respect to SemSim and SemSim RV , SemSim e relies on the additional knowledge carried by the preference and quality rating scores.

Conclusions and Future Work
In this work a semantic similarity method, referred to as SemSim e , has been introduced. It relies on a previous proposal of the authors, namely SemSim. In particular, SemSim e has been endowed with the rating scores High (H), Medium (M), Low (L) in the OFV for representing both the preferences of the user and the quality of the resources in the search space. An experimentation of SemSim e has been performed in the tourism domain by comparing it with the original method SemSim, and a variant of it, referred to as SemSim RV . The experiment reveals that SemSim e outperforms both SemSim and SemSim RV , with respect to precision, recall and correlation.
Currently, we are planning to apply the proposed approach on a large dataset with regard to the annotated resources, and also by considering a large participation in the human judgment activity. In particular, we are setting up an experiment by focusing on the Digital Library of the Association for Computing Machinery (ACM). In this context, the reference ontology is represented by the ACM Computing Classification System (ACM-CCS), and the semantic annotations of the resources (more than one thousand papers) are the sets of keywords selected by the authors according to the ACM-CCS.
As a future investigation, SemSim e will be applied to the automotive domain, in the framework of a collaboration with an Italian SME working in this sector, in order to improve the decision making process from both the dealer and the customer sides. In this domain, the idea of giving more relevance to the offer rather than the request will be investigated in order to enable the customer to be aware about the new features from car companies.