You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

19 September 2023

A Conceptual Graph-Based Method to Compute Information Content

,
,
,
and
1
Instituto Politécnico Nacional, Centro de Investigación en Computación, UPALM-Zacatenco, Ciudad de México 07320, Mexico
2
Instituto Politécnico Nacional, Unidad Profesional Interdisciplinaria en Ingeniería y Tecnologías Avanzadas, Ciudad de México 07340, Mexico
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Graph Theory and Applications

Abstract

This research uses the computing of conceptual distance to measure information content in Wikipedia categories. The proposed metric, generality, relates information content to conceptual distance by determining the ratio of the information that a concept provides to others compared to the information that it receives. The DIS-C algorithm calculates generality values for each concept, considering each relationship’s conceptual distance and distance weight. The findings of this study are compared to current methods in the field and found to be comparable to results obtained using the WordNet corpus. This method offers a new approach to measuring information content applied to any relationship or topology in conceptualization.

1. Introduction

The success of information society and the World Wide Web has substantially increased the availability and quantity of information. The computational analysis of texts has aroused a great interest in the scientific community to allow adequate exploitation, management, classification, and textual data retrieval. Significant contributions have improved the comprehension of different areas using conceptual representations, such as semantic networks, hierarchies, and ontologies. These structures are models to conceptualize domains considering concepts from these representations to define semantic relationships between them [1]. The semantic similarity assessment is a very timely topic related to the explanation analysis of electronic corpora, documents, and textual information to provide novel approaches in recommender systems, information retrieval, and question-answering applications. According to psychological tests conducted by Goldstone (1994) [2], semantic similarity plays an underlying foundation by which human beings arrange and classify objects or entities.
Semantic similarity is a metric that states how close two words (representing objects from a conceptualization) are by exploring whether they share any feature of their meaning. For example, horse and donkey are similar in the context that they are mammals. Conversely, for examples such as boat and oar or hammer and screwdriver, their semantic relations do not directly depend on the higher concept of a semantic structure. Moreover, other relationships, such as meronymy, antonymy, functionality, and cause–effect, do not have a taxonomic definition but are part of the conceptualization. In the same way, arrhythmia and tachycardia are close because both diseases are related to the cardiovascular system. Additionally, the concepts are necessarily associated with non-taxonomic relationships; for example, insulin assists in the treatment of diabetes illness. In this sense, we say that a semantic relationship is one in which both cases consist of evaluating the semantic evidence presented in a knowledge source (ontology or domain corpus).
It is important to clarify that semantic similarity is not by itself a relationship, at least not in the same sense as meronymy or antonymy. In this context, similarity is a measure calculated between pairs of concepts, while the “other” relationships serve to “connect” and give meaning to the concepts. Within these relationships, some generate a hierarchical or taxonomic structure; for example, they can represent hyperonymy or hyponymy, which is the type of structure needed by many algorithms that compute semantic similarity. Thus, the DIS-C algorithm works with this type of relationship and those not generating taxonomies (for example, meronymy or antonymy).
The similarity measures assign a numerical score that quantifies this proximity based on the semantic evidence defined in one or more sources of knowledge [3]. These resources traditionally consist of more general taxonomies and ontologies, providing a formal and machine-readable way of expressing a shared conceptualization through integrated vocabulary and semantic linkings [4].
In particular, semantic similarity is suitable in tasks oriented to identify objects or entities that are conceptually near. According to the state of the art, this approach is appropriate in information systems [5]. Recently, semantic similarity has represented a pivotal issue in the technological advances concerning the semantic search field. In addition, the semantic similarity supplies a comparison of data infrastructures in different knowledge environments [6,7].
In the literature, semantic similarity is applicable in different fields of computer science, particularly novel applications focused on the information retrieval task to increase the precision and recall [8,9,10,11]; to find matches between ontology concepts [12,13]; to assure or restore ontology alignment [14]; for question-answering systems [15]; for tasks for natural language processing, such as tokenization, stopwords removing, lemmatization, word sense disambiguation, lemmatization, and named entity recognition [16,17]; for recommender systems [18,19]; for data and feature mining [20,21,22]; for multimedia content search [23]; for semantic data and intelligent integration [24,25]; for ontology learning based on web-scrapping techniques, where new definitions connected to existing concepts should be acquired from document resources [26]; for text clustering [27]; for biomedical context [28,29,30,31]; and for geographic information and cognitive sciences [6,32,33,34,35]. In a pragmatic perception, the semantic similarity helps us to comprehend human judgment, cognition, and understanding to categorize and classify various conceptualizations [36,37,38]. Thus, similitude is an essential theoretical foundation in semantic-processing tasks [39,40].
According to the evidence modeled on an ontology (taxonomy), the similarity measurements based on an ontological definition evaluate how concepts are similar by their meaning. So, the intensive mining of multiple ontologies produces further insights to enhance the approximation of similitude and determines different circumstances, where concepts are not defined in an exclusive ontology [9]. Based on the state of the art, various semantic similarity measurements are context-independent [41,42,43,44]; most of them were designed specifically for the problem and expressed on the base of domain-specific or application-oriented formalisms [31]. Thus, a person who is not a specialist can only interpret the great diversity of avant garde proposals as an extensive list of measures. Consequently, selecting an appropriate measurement for a specific usage context is a challenging task [1].
Thus, to compute semantic similarity automatically, we may consult different knowledge sources [45], such as domain ontologies like gene ontology [29,46], SNOMED CT [30,31,47], well-defined semantic networks like WordNet [48,49], and theme directories like the Open Directory Project [50] or Wikipedia [51].
Pirró (2009) [52] classified the approaches to assess similarity concerning the use of the information they manage. The literature proposed diverse techniques based on how an ontology determines similarity values. Nevertheless, Meng et al. (2013) [53] stated a classification for the semantic similarity measures: edge-counting techniques, information content approaches, feature-based methods, and hybrid measurements.
  • Edge-counting techniques [44] evaluate semantic similarity by computing the number of edges and nodes separating two concepts (nodes) within the semantic representation structures. We defined the technique preferably for taxonomic relationships (edges and nodes) in a semantic network.
  • Information content-based approaches assess the similitude by applying a probabilistic model. It takes as input the concepts of an ontology and employs an information content function to determine their similarity values in the ontology [41,54,55]. The literature bases the information content computation on the distribution of tagged concepts in the corpora. Obtaining information content from concepts consists of structured and formal methods based on knowledge discovery [31,56,57,58].
  • Feature-based methods assess similitude values employing the whole conventional and non-conventional features by a weighted sum of these items [19,59]. Thus, Sánchez et al. (2012) [4] designed a model of non-taxonomic and taxonomic relationships. Moreover, ref. [34,60] proposed to use interpretations of concepts retrieved from a thesaurus. Then, the edge-counting techniques improve since the evaluation considers a semantic reinforcement. In contrast, they do not consider non-taxonomic properties because they rarely appear in an ontology [61] and demand a fine tuning of the weighting variables to merge diverse semantic reinforcements [60]. Additionally, the edge-counting techniques examine the similarity concerning the shortest path about the number of taxonomic links, dividing two concepts into an ontology [42,44,62,63].
  • Hybrid measurements integrate various data sets, considering, in these methods, that the weights establish the portion of each data set, contributing to the similarity values to be balanced [5,63,64,65].
In this work, we are interested in approaches based on information content to evaluate the similarity between concepts within semantic representations. In principle, the information content (IC) is computed from the presence of concepts in a corpus [41,43,54]. Thus, some authors proposed the IC from a knowledge structure modeled in an ontology in various ways [3,40,56]. The measurements of IC consist of ontological knowledge, which is a drawback because they depend entirely on the coverage and details of the input ontology [3]. With the appearance of social networks [66,67], diverse concepts or terms, such as proper names, brands, acronyms, and new words, are not contained in application and domain ontologies. Thus, we cannot compute the information content supported by the knowledge resource with this information source. Domain ontologies have the problem that their construction process takes a long time, and their maintenance also requires much effort. For this reason, computation methods based on domain ontologies also have the same problem. An alternative is the crowdsensing sources, such as Wikipedia [51], which is created and maintained by the user community, which means that it is updated in a very dynamic way but maintains a set of good practices.
Additionally, this paper proposes a network model-based approach that uses an algorithm that iteratively evaluates how close two concepts are (i.e., conceptual distance) based on the semantics that an ontology expresses. A metric defined as generality of concepts is computed directly by mapping the IC of these same concepts. Network-based models represent knowledge in different manners and semantic structures, such as ontologies, hierarchies, and semantic networks. Frequently, the topology of these models consists of concepts, properties, entities, or objects depicted as nodes and relations defined by edges that connect the nodes and give causality to the structure. With this model, we used the DIS-C algorithm [68] to compute the conceptual distance between the concepts of an ontology, using the generality metric (it describes how a concept is visible to any other on the ontology) that will be mapped directly to the IC. The generality assumes that a strongly connected graph characterizes an ontological structure. Our method establishes the graph topology for the relationships between nodes and edges. Subsequently, each relationship receives a weighing value, considering the proximity between nodes (concepts). At first, a domain specialist could assign or establish the weighing values randomly.
The computation metric takes the inbound and outbound relationships of a concept. Thus, we perform an iterative adjustment to obtain an optimal change in the weighting values. In this way, the DIS-C algorithm evaluates the conceptual distances without any manual intervention and eliminates the subjectivity of human perceptions concerning the weights proposed by subjects. The network model applicable to DIS-C supports any relationship (hierarchies, meronyms, and hyponomies). We applied the DIS-C algorithm and the GEONTO-MET method [69] to compute the similitude in the feature-based approach [5], which is one of the most common models to represent the knowledge domain.
The research paper is structured as follows: Section 2 comprises the state of the art for similarity measures and approaches to computing the information and their computer science applications. Section 3 presents the methodology and foundations concerning the proposed algorithm. Section 4 shows the results of the experiments that characterize its performance. We present a discussion regarding the findings of our research in Section 5.

3. Methods and Materials

This section describes the use of the DIS-C algorithm for computing information content based on the generality of concepts in the corpus.

3.1. The DIS-C Algorithm for Information Content Computation

This work defines conceptual distance as the space dividing two concepts into a particular conceptualization described by an ontology. Another definition concerning conceptual distance addresses the dissimilarity in information content supplied by two concepts, including their specific conceptions.
The main contribution refers to the suitable adaptation of the proposed method in any conceptualization, such as a taxonomy, semantic network, hierarchy, and ontology. Notably, the method establishes a distance value for each relationship (all the types of relations in the conceptualization structure). It converts the last one into a conceptual graph (a weighted–directed graph). Additionally, each node represents a concept, and each edge is a relationship between a couple of concepts.
We apply diverse theoretical foundations from the graph theory to treat the fundamental knowledge encoded within the ontological structure. Thus, once we generate the conceptual graph, the native sequence calculates the shortest path to meet the distance value between unrelated concepts.
The Wikipedia category structure is a very complex network. So, compared to traditional taxonomy structures, Wikipedia is a graph in which the semantic similarity between concepts is evaluated by using the DIS-C algorithm, and the theoretical information approaches based on information content are integrated. So, the DIS-C algorithm computes the IC value of each concept (node) in the graph. Thus, the process guarantees to cover the whole search space.

3.2. Generality

According to Resnik (1999) [55], the information content of a concept c can be represented by the formula I ( c ) = log p c . Here, p is the probability that c is associated with any other concept, determined by dividing the sum of concepts with c as their ancestor by the total number of concepts. This method is suitable when considering taxonomic structures, where concepts at the bottom of the hierarchy inherit information from their ancestors, including themselves. Therefore, the information content is proportional to the depth of the taxonomy.
Similarly to the Resnik approach, we propose the “generality” to describe the information content of a concept. However, our method deals with ontologies and taxonomies that can contain multifold types of relations (not only a “is-a” relationship type). Moreover, the “generality” analyzes the information content of the concepts allocated in the ontology, considering how related they are. Thus, the “generality” quantifies how a concept connects with the entire ontology.
In Figure 1a, we have a taxonomy where the concept x is very general, providing information to the concept y; x is located “above” in the conceptualization, so it only provides information to the concepts that are more “below” and does not obtain any information from them. On the other hand, y obtains information from x and all the concepts found in the path x y . Moreover, y does not provide information about any concept.
Figure 1. Taxonomy (it does not have a partition) vs. Ontology (it has partitions).
In Figure 1b, we have an ontology in which concept x not only provides information to concept y but also receives information from y and the rest of the concepts in the ontology. Suppose there is no relationship between x and another different ontology concept. In that case, little information is necessary to identify that concept and denote if it is very general or abstract. Thus, the conceptual distance concerning other concepts can be more significant over the average if it only relates to a few concepts; then, the routes for linking them with most of the rest will be larger too. In contrast, more detailed concepts are established from more general concepts. Thus, let x be a general concept. It implies that the rest of the concepts will be near x in their meanings. If x is the most general concept, the mean distance from other concepts in the ontological representation to x will be smaller. We conclude that the “generality” of concept x refers to the balanced proportion of information content needed by x from other terms for their meanings and IC that x gives to the rest of the concepts in the ontology.
We propose that information provided by a concept x to all others is proportional to the average distance of x towards all the others. Similarly, information obtained by x from all other concepts is proportional to the average distance from all concepts to x. Thus, a first approximation to the definition of generality is shown in Equation (6):
g ( x ) = y C Δ K ( y , x ) | C | y C Δ K ( x , y ) | C | = y C Δ K ( y , x ) y C Δ K ( x , y ) ,
where Δ K ( x , y ) is the conceptual distance from concept x to concept y in the conceptualization K.
In the case of the taxonomy of Figure 1, the distance from any concept y to a more general concept x will be infinite, as no path connects y with x, so the generality of x is . Otherwise, the generality of y will be 0. To avoid these singularities, we normalize the generality of x. Let K ( C , , R ) be a shared conceptualization of a domain, in which x , y C are concepts and Δ K x , y refers to the conceptual distance from x to y. So, x C the generality defined by g ( x ) is represented by Equation (7). Thus, the generality of x will be in the range [ 0 1 ] , where 0 is the maximum generality, and 1 is the minimum. On the other hand, this form of generality is defined as the probability of finding a concept related to x. Then, by using the proposal of Resnik (1995) [43], the IC is defined by Equation (8):
g ( x ) = y C Δ K ( y , x ) y C Δ K ( x , y ) + Δ K ( y , x )
I D I S C ( x ) = log g ( x )
The generality computation needs to know the conceptual distances between any pair of terms. In [68], we presented the DIS-C algorithm to calculate such a distance. The theory behind the DIS-C algorithm is based on analyzing an ontology as a connected graph and computing the weight of each edge by applying the generality definition to each concept (nodes in the graph). We computed the generality to determine the conceptual distance, considering the semantics and intention of the corpora developer to introduce the concepts and their definitions established in the conceptual representation. In conclusion, the nearest concepts are more significant in the conceptualization domain because they explain the corpus. Therefore, the generality of a concept gives information concerning the relationships in the conceptualization, using this approach to define the weighting of each edge.
Due to the conceptual distance calculated with the generality definition, we assumed that those entire nodes (concepts) are similarly generic, and the topology of the conceptual representation is needed to capture the semantics and causality of the corpus. Each degree and vertex is also used as the input and output, respectively. So, the “generality” of each concept and its conceptual distance are computed as follows.
Letting K ( C , , R ) be a conceptualization considering the above definition, the directed graph G K ( V G , A G ) is generated by converting each concept c C in a node into the graph G K : V G = C . Subsequently, for each relationship a ρ b R , where a , b C , the edge ( a , b , ρ ) is incorporated into A G .
The next procedure is to iteratively create from G K , the weighted directed graph Γ K j ( V γ j , A γ j ) . For this purpose, in the j-th iteration, we make V γ j = V G , A γ j = and, for each edge ( a , b , ρ ) A G , edges ( a , b , ω a b j ) and ( b , a , ω b a j ) are incorporated into Γ K j , where ω a b j is the arithmetic mean of the approximation of the conceptual distance from vertex a to vertex b at the j-th iteration. These expressions are computed by applying Equation (9):
ω j ( a , b ) = p w ω o ( a ) g j 1 ( a ) + ω i ( b ) g j 1 ( b ) + 1 p w δ j 1 ρ ,
where p w 0 1 is a variable that specifies how much importance is given to new values, and therein significance given to old values; generally, p w = 1 2 . g j ( x ) is the generality of the vertex x V G at j-th iteration (the value of g j ( x ) is computed by considering the graph Γ K j ). Thus, we establish that x V G , g 0 ( x ) = 1 , i.e., the early value of generality for all nodes is equal to 1. Additionally, the expressions δ j ρ and δ ¯ j ρ are the conceptual distance values of the relations between a and b (onward and backward, respectively), whose values are requested. In the beginning, these conceptual distances are 0, i.e., δ 0 ρ = 0 and δ ¯ 0 ρ = 0 for all ρ .
Thus, ω i ( x ) is the “obtained” value at vertex x. It is also the likelihood of not meeting an edge coming into vertex x, i.e., ω i ( x ) = 1 i ( x ) i ( x ) + o ( x ) . Moreover, ω o ( x ) is the value of “leaving” vertex x, determined by the likelihood of not meeting an edge leaving vertex x, i.e., ω o ( x ) = 1 o ( x ) i ( x ) + o ( x ) , where i ( x ) is the inside degree of vertex x, and o ( x ) is the outside degree of the same vertex x.
With the graph Γ K j , the values of generality for each vertex are computed in the j-th iteration using Equation (7), and considering Δ K ( a , b ) the shortest path from a to b in graph Γ K j .
Furthermore, it computes a new conceptual distance value for each relationship in . This value refers to the mean of distances ω j between edges sharing the same relation. It is obtained by applying Equation (10):
δ j ρ = a , b , ρ ρ * ω a b j | ρ * | δ ¯ j ρ = a , b , ρ ρ * ω b a j | ρ * | ,
where ρ * = a , b , ρ A G is the set of edges that represents a relationship ρ .
The procedure initiates with j = 1 and grows the value of j by one until it satisfies the condition of Equation (11), where ϵ K is the threshold of maximal transition, with the whole procedure depicted in Figure 2:
x V γ j g j ( x ) g j 1 ( x ) 2 V γ j ϵ K
Figure 2. Flow diagram of the DIS-C algorithm.

3.3. Corpus Used for the Testing: Wikipedia and WordNet

The categories and pages are taken as graph nodes to calculate the conceptual distance. The categories are structured hierarchically through category links, which we use as hypernymy or hyponymy relationships. Hyperlinks are employed as generic relationships; since these vary in their nature, there is no information on the specific semantics of each link.
The experiments to assess the proposed method employed the Wikipedia and WordNet corpus to obtain the information content value, constraining the test to particular categories. Concerning Wikipedia, its entire content is available in a specific format, which allows us to copy, modify and redistribute it with few restrictions. Moreover, we used two downloaded functions: the structure of categories and the list of pages and hyperlinks between them. Thus, data are suitable for generating a database for each corpus graph feature. In the first case, there are 1,773,962 categories represented by nodes and 128,717,503 category links defining the edges. In the second case, 8,846,938 pages correspond to nodes, and 318,917,123 links define their edges.
In the case of WordNet, the synsets become nodes of the DIS-C graph, and each one of the relationships is taken into account. The type is labeled later to calculate the semantic value of each type of relationship as is defined in [68]. Thus, information is accessed through a provided API and a dump file that contains the data. This corpus composes a graph of 155,287 nodes and 324,600 edges.

4. Results and Discussion

In [68], the results of testing the DIS-C algorithm using WordNet as a corpus are presented. Thus, we compare our results with other similarity measures in Table 1.
Indeed, diverse difficulties arose in obtaining the results. However, the largest was the corpus size since we discussed the order of various million concepts and relationships. This issue was faced by emptying all information from Wikipedia into a MySQL database, which allowed us to have structured access to the data through a robust search engine. The database scheme occupies 75.1 GB, and the relations (category links and hyperlinks) occupy the most space with 62.7 GB. Regarding the equipment where the tests were executed, we used a MacPro with an Intel Xeon @ 3.5GHz processor, with 16 GB of DDR3 RAM @1866MHz.
On the other hand, Rubenstein and Goodenough (1965) [95] compiled a set of synonymy judgments composed of 65 pairs of nouns. The set composition gathered 51 judges, who placed a score between 0 and 4 for each couple, pointing out the semantic similitude. Afterward, Miller and Charles (1991) [75] performed the same test but only used 30 pairs of nouns selected from the previous register. The experiment split words with high, medium, and low similarity values.
Jarmasz and Szpakowicz (2003) [96] replicated both tests and showed the outcomes of six similarity measures based on the WordNet corpus. The first one was the edge-counting approach, which serves as a baseline, considering that this measure is the easiest and most cognitive method. Hirst et al. (1998) [97] designed a method based on the length of the path and the values concerning its direction. The semantic relationships of WordNet defined these changes.
In the same context, Jiang and Conrath (1997) [41] developed a mixed method related to the improved edge counting approach by the node-based technique for computing the information content stated by Resnik (1995) [43]. Thus, Leacock and Chodorow (1998) [62] summed the length of the path in a set of nodes instead of relationships, and the length was adjusted according to the taxonomy depth. Moreover, Lin (1998) [54] used the fundamental equation of information theory to calculate the semantic similarity. Alternatively, Resnik (1995) [43] computed the information content by the subsumption of concepts in the taxonomy or hierarchy. Those similarity measures and their values obtained by our algorithm (the asymmetry property does not hold for conceptual distance ( a , b C | Δ K ( a , b ) Δ K ( b , a ) ); as a result, we express the conceptual distance from term A to term B (DIS-C(to) column), from term B to term A (DIS-C(from) column), the average of these distances (DIS-C(avg) column), the minimum (DIS-C(min) column), and the maximum (DIS-C(max) column)) are presented in Table 1.
As mentioned previously in the Miller and Charles (1991) [75] study, the analysis consisted of asking 51 individuals to make a judgment concerning the similarity of 30 pairs of words, considering for the evaluation a scale defined between 0 and 4, in which 0 is not at all similar, and 4 is very similar. Several authors have proposed different scales for evaluating similarity; in general, 0 is entirely different, and there can be some positive values for identical concepts. However, the DIS-C method does not calculate similarity (directly); it computes the distance. That is how it takes values that go from 0 (for identical concepts), and it does not have an upper bound since it can take distance values as large as the corpus is vast. For this reason, Table 2 shows the correlation between the values obtained by different methods (including ours) and those obtained by Miller and Charles (1991) [75]. In other words, it shows the correlation coefficient between the human judgments proposed by Miller and Charles (1991) [75] and the values attained by other techniques, including our method and the best result revealed by Jiang et al. (2017) [45]. According to the results, it is appreciated that the proposed approach achieves the best correlation coefficient for the rest of the methods. These outcomes suggest that conceptual distances calculated applying the DIS-C algorithm are more consistent than human judgments.
Table 1. The similarity of pairs of nouns proposed by Miller and Charles (1991) [75].
Table 1. The similarity of pairs of nouns proposed by Miller and Charles (1991) [75].
Word AWord BMiller and Charles (1991) [75]WordNet EdgesHirst et al. (1998) [97]Jiang and Conrath (1997) [41]Leacock and Chodorow (1998) [62]Lin (1998) [54]Resnik (1995) [43]DIS-C(to)DIS-C(from)DIS-C(avg)DIS-C(min)DIS-C(max)
asylummadhouse3.6129.004.000.662.770.9811.281.221.641.431.221.64
birdcock3.0529.006.000.162.770.695.980.630.330.480.330.63
birdcrane2.9727.005.000.142.080.665.981.511.351.431.351.51
boylad3.7629.005.000.232.770.827.770.960.960.960.960.96
brothermonk2.8229.004.000.292.770.9010.490.330.630.480.330.63
carautomobile3.9230.0016.001.003.471.006.341.260.590.920.591.26
cemeterywoodland0.9521.000.000.051.160.070.703.212.492.852.493.21
chordsmile0.1320.000.000.071.070.292.892.673.953.312.673.95
coastforest0.4224.000.000.061.520.121.181.842.892.371.842.89
coasthill0.8726.002.000.151.860.696.381.221.581.401.221.58
coastshore3.7029.004.000.652.770.978.970.330.630.480.330.63
craneimplement1.6826.003.000.091.860.393.441.551.821.691.551.82
foodfruit3.0823.000.000.091.390.120.700.851.581.210.851.58
foodrooster0.8917.000.000.060.830.090.702.101.942.021.942.10
forestgraveyard0.8421.000.000.051.160.070.702.271.551.911.552.27
furnacestove3.1123.005.000.061.390.242.431.260.620.940.621.26
gemjewel3.8430.0016.001.003.471.0012.890.581.310.940.581.31
glassmagician0.1123.000.000.061.390.121.182.082.582.332.082.58
journeycar1.1617.000.000.080.830.000.001.241.591.421.241.59
journeyvoyage3.8429.004.000.172.770.706.060.260.680.470.260.68
ladbrother1.6626.003.000.071.860.272.461.552.161.851.552.16
ladwizard0.4226.003.000.071.860.272.461.552.231.891.552.23
magicianwizard3.5030.0016.001.003.471.009.710.940.940.940.940.94
middaynoon3.4230.0016.001.003.471.0010.580.950.950.950.950.95
monkoracle1.1023.000.000.061.390.232.462.782.492.632.492.78
monkslave0.5526.003.000.061.860.252.461.901.471.691.471.90
noonstring0.0819.000.000.050.980.000.002.492.862.682.492.86
roostervoyage0.0811.000.000.040.470.000.002.533.102.812.533.10
shorewoodland0.6325.002.000.061.670.121.181.921.921.921.921.92
toolimplement2.9529.004.000.552.770.946.000.680.260.470.260.68
Table 2. Correlation between the human judgments and the similarity approaches using the WordNet corpus.
Nevertheless, we validated the computation of the conceptual distance in a larger corpus, such as Wikipedia, by using the 30 pairs of Wikipedia categories proposed by Jiang et al. (2017) [45]. The pairs are presented in Table 3, including their similarity qualification given by a group of people. The results of the calculation of the conceptual distance are also depicted. Moreover, they are represented asymmetrically in Table 1 and shown in Figure 3. It is worth mentioning that Jiang et al. (2017) [45] did not report the corresponding results of similarity values between each pair. In this case, the study presented different methods to calculate the similarity and the correlation of the set of results regarding the reference set (human scores). Table 4 also presents the generality and information content evaluations for the categories proposed in the same work. In this sense, there are no results reported to compare with ours.
Table 3. Set of 30 concepts (Wikipedia categories) presented in [45] with averaged similarity scores.
Figure 3. Similarity scores obtained with the DIS-C for the set of 30 categories presented in [45].
Table 4. The information content and generality for concepts presented in [45].
On the other hand, Jiang et al. (2017) [45] carried out a similar study to that of [75], but they took 30 pairs of the Wikipedia categories. Thus, Table 3 presents these contrasted results with our algorithm’s results. Besides Table 2, we report the correlation values between the human judgments of [45] and the distance values that DIS-C yielded in Table 5.
Table 5. The correlation between the human judgments and the similarity approaches using the Wikipedia corpus.

5. Conclusions

This paper presents the definition and application of computing the conceptual distance for determining the information content in Wikipedia categories. The generality definition proposes to relate the information content and conceptual distance, which is essential to compute the latter. In the literature, this concept has been used to calculate semantic similarity. However, as we argued in our previous research, it is relevant to remind that semantic similarity is not the same as conceptual distance.
We introduced a novel metric called generality, defined as the ratio between a concept’s information and the information it receives. Thus, the proposed DIS-C algorithm calculates each concept’s generality values. Moreover, we considered the conceptual distance between any couple of concepts and the weight related to each type of relationship in the conceptualization.
The results presented in this research work were compared against other state-of-the-art methods. First, the set of words presented by Miller (1995) [37] serves as a point of comparison (baseline) and calibration for our proposed method. Later, using Wikipedia as a corpus, the results were satisfactory and very similar to those obtained using the corpus of WordNet. On the other hand, we used the 30 concepts (Wikipedia categories) presented by Jiang et al. (2017) [45] to evaluate the results with those they proposed in their work, which has been compared with others in the literature. The results were also satisfactory as can be appreciated in the corresponding tables depicted in Section 4.
On the other hand, the early studies have yet to report their results regarding the value of the information content for each concept or category presented in the sets. Thus, we show these results as a novel contribution due to there being no report or any evidence of other results of previous studies in the literature related to this field. So, we cannot compare this particular issue with them.
It is worth mentioning that to obtain these results, we used algorithms to extract the most relevant subgraphs of the huge graph generated with all the categories and Wikipedia pages; added together, there are more than 10 million entities that have to be analyzed, and this task is not feasible to carry out. Therefore, future works are oriented towards analyzing those algorithms, particularly for calculating such subgraphs and their repercussions on the conceptual distance computation and the information content. Moreover, we want to face one of the most exciting problems of the DIS-C algorithm: the computational complexity that is O n 3 . For this reason, we will study alternatives to accelerate the DIS-C method based on optimized and approximation algorithms. According to the experiments, we discovered that the bottleneck can be avoided using the proposed 2-hop coverages, bringing the DIS-C almost linearity. Moreover, our efforts will be centered on providing new visualization schemas for these graphs, considering GPU-based programming and augmented reality approaches to improve the information visualization.

Author Contributions

Conceptualization, R.Q. and M.T.-R.; methodology, R.Q.; software, R.Q. and C.G.S.-M.; validation, R.Q. and M.T.-R.; formal analysis, R.Q. and M.T.-R.; investigation, R.Q.; resources, F.M.-R. and M.S.-P.; data curation, M.S.-P. and F.M.-R.; writing—original draft preparation, R.Q.; writing—review and editing, M.T.-R.; visualization, C.G.S.-M.; supervision, M.T.-R.; project administration, M.S.-P.; funding acquisition, F.M.-R. All authors have read and agreed to the published version of the manuscript.

Funding

Work partially sponsored by Instituto Politécnico Nacional under grants 20231372 and 20230454. It also is sponsored by Consejo Nacional de Humanidades, Ciencias y Tecnologías (CONAHCYT) under grant 7051, and Secretaría de Educación, Ciencia, Tecnología e Innovación (SECTEI).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The databases used in this paper are available at https://dumps.wikimedia.org/backup-index.html accessed on 23 March 2023, and https://wordnet.princeton.edu/download accessed on 12 April 2023.

Acknowledgments

We are thankful to the reviewers for their invaluable and constructive feedback that helped improve the quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Harispe, S.; Sánchez, D.; Ranwez, S.; Janaqi, S.; Montmain, J. A framework for unifying ontology-based semantic similarity measures: A study in the biomedical domain. J. Biomed. Inform. 2014, 48, 38–53. [Google Scholar] [CrossRef] [PubMed]
  2. Goldstone, R.L. Similarity, interactive activation, and mapping. J. Exp. Psychol. Learn. Mem. Cogn. 1994, 20, 3. [Google Scholar] [CrossRef]
  3. Sánchez, D.; Batet, M. A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst. Appl. 2013, 40, 1393–1399. [Google Scholar] [CrossRef]
  4. Sánchez, D.; Solé-Ribalta, A.; Batet, M.; Serratosa, F. Enabling semantic similarity estimation across multiple ontologies: An evaluation in the biomedical domain. J. Biomed. Inform. 2012, 45, 141–155. [Google Scholar] [CrossRef] [PubMed]
  5. Rodríguez, M.; Egenhofer, M. Comparing geospatial entity classes: An asymmetric and context-dependent similarity measure. Int. J. Geogr. Inf. Sci. 2004, 18, 229–256. [Google Scholar] [CrossRef]
  6. Schwering, A.; Raubal, M. Measuring semantic similarity between geospatial conceptual regions. In GeoSpatial Semantics; Springer: Berlin/Heidelberg, Germany, 2005; pp. 90–106. [Google Scholar]
  7. Wang, H.; Wang, W.; Yang, J.; Yu, P.S. Clustering by pattern similarity in large data sets. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, WI, USA, 4–6 June 2002; ACM: New York, NY, USA, 2002; pp. 394–405. [Google Scholar]
  8. Al-Mubaid, H.; Nguyen, H. A cluster-based approach for semantic similarity in the biomedical domain. In Proceedings of the Engineering in Medicine and Biology Society, 2006, EMBS’06, 28th Annual International Conference of the IEEE, New York, NY, USA, 30 August–3 September 2006; pp. 2713–2717. [Google Scholar]
  9. Al-Mubaid, H.; Nguyen, H. Measuring semantic similarity between biomedical concepts within multiple ontologies. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2009, 39, 389–398. [Google Scholar] [CrossRef]
  10. Budan, I.; Graeme, H. Evaluating WordNet-Based Measures of Semantic Distance. Comutational Linguist. 2006, 32, 13–47. [Google Scholar]
  11. Hliaoutakis, A.; Varelas, G.; Voutsakis, E.; Petrakis, E.G.; Milios, E. Information retrieval by semantic similarity. Int. J. Semant. Web Inf. Syst. (IJSWIS) 2006, 2, 55–73. [Google Scholar] [CrossRef]
  12. Kumar, S.; Baliyan, N.; Sukalikar, S. Ontology Cohesion and Coupling Metrics. Int. J. Semant. Web Inf. Syst. (IJSWIS) 2017, 13, 1–26. [Google Scholar] [CrossRef]
  13. Pirrò, G.; Ruffolo, M.; Talia, D. SECCO: On building semantic links in Peer-to-Peer networks. In Journal on Data Semantics XII; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–36. [Google Scholar]
  14. Meilicke, C.; Stuckenschmidt, H.; Tamilin, A. Repairing ontology mappings. In Proceedings of the AAAI, Vancouver, BC, Canada, 22–26 July 2007; Volume 3, p. 6. [Google Scholar]
  15. Tapeh, A.G.; Rahgozar, M. A knowledge-based question answering system for B2C eCommerce. Knowl.-Based Syst. 2008, 21, 946–950. [Google Scholar] [CrossRef]
  16. Patwardhan, S.; Banerjee, S.; Pedersen, T. Using measures of semantic relatedness for word sense disambiguation. In Computational Linguistics and Intelligent Text Processing; Springer: Berlin/Heidelberg, Germany, 2003; pp. 241–257. [Google Scholar]
  17. Sinha, R.; Mihalcea, R. Unsupervised graph-basedword sense disambiguation using measures of word semantic similarity. In Proceedings of the International Conference on Semantic Computing (ICSC 2007), Irvine, CA, USA, 17–19 September 2007; pp. 363–369. [Google Scholar]
  18. Blanco-Fernández, Y.; Pazos-Arias, J.J.; Gil-Solla, A.; Ramos-Cabrer, M.; López-Nores, M.; García-Duque, J.; Fernández-Vilas, A.; Díaz-Redondo, R.P.; Bermejo-Muñoz, J. A flexible semantic inference methodology to reason about user preferences in knowledge-based recommender systems. Knowl.-Based Syst. 2008, 21, 305–320. [Google Scholar] [CrossRef]
  19. Likavec, S.; Osborne, F.; Cena, F. Property-based semantic similarity and relatedness for improving recommendation accuracy and diversity. Int. J. Semant. Web Inf. Syst. (IJSWIS) 2015, 11, 1–40. [Google Scholar] [CrossRef]
  20. Atkinson, J.; Ferreira, A.; Aravena, E. Discovering implicit intention-level knowledge from natural-language texts. Knowl.-Based Syst. 2009, 22, 502–508. [Google Scholar] [CrossRef]
  21. Sánchez, D.; Isern, D. Automatic extraction of acronym definitions from the Web. Appl. Intell. 2011, 34, 311–327. [Google Scholar] [CrossRef]
  22. Stevenson, M.; Greenwood, M.A. A semantic approach to IE pattern induction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, MI, USA, 25–30 June 2005; pp. 379–386. [Google Scholar]
  23. Rissland, E.L. AI and similarity. IEEE Intell. Syst. 2006, 21, 39–49. [Google Scholar] [CrossRef]
  24. Fonseca, F. Ontology-Based Geospatial Data Integration. In Encyclopedia of GIS; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; pp. 812–815. [Google Scholar]
  25. Kastrati, Z.; Imran, A.S.; Yildirim-Yayilgan, S. SEMCON: A semantic and contextual objective metric for enriching domain ontology concepts. Int. J. Semant. Web Inf. Syst. (IJSWIS) 2016, 12, 1–24. [Google Scholar] [CrossRef]
  26. Sánchez, D. A methodology to learn ontological attributes from the Web. Data Knowl. Eng. 2010, 69, 573–597. [Google Scholar] [CrossRef]
  27. Song, W.; Li, C.H.; Park, S.C. Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Syst. Appl. 2009, 36, 9095–9104. [Google Scholar] [CrossRef]
  28. Batet, M.; Sánchez, D.; Valls, A. An ontology-based measure to compute semantic similarity in biomedicine. J. Biomed. Inform. 2011, 44, 118–125. [Google Scholar] [CrossRef]
  29. Couto, F.M.; Silva, M.J.; Coutinho, P.M. Measuring semantic similarity between Gene Ontology terms. Data Knowl. Eng. 2007, 61, 137–152. [Google Scholar] [CrossRef]
  30. Pedersen, T.; Pakhomov, S.V.; Patwardhan, S.; Chute, C.G. Measures of semantic similarity and relatedness in the biomedical domain. J. Biomed. Inform. 2007, 40, 288–299. [Google Scholar] [CrossRef] [PubMed]
  31. Sánchez, D.; Batet, M. Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective. J. Biomed. Inform. 2011, 44, 749–759. [Google Scholar] [CrossRef] [PubMed]
  32. Moreno, M. Similitud Semantica Entre Sistemas de Objetos Geograficos Aplicada a la Generalizacion de Datos Geo-Espaciales. Ph.D. Thesis, Instituto Politécnico Nacional, Ciudad de México, Mexico, 2007. [Google Scholar]
  33. Nedas, K.; Egenhofer, M. Spatial-Scene Similarity Queries. Trans. GIS 2008, 12, 661–681. [Google Scholar] [CrossRef]
  34. Rodríguez, M.A.; Egenhofer, M.J. Determining semantic similarity among entity classes from different ontologies. Knowl. Data Eng. IEEE Trans. 2003, 15, 442–456. [Google Scholar] [CrossRef]
  35. Sheeren, D.; Mustière, S.; Zucker, J.D. A data mining approach for assessing consistency between multiple representations in spatial databases. Int. J. Geogr. Inf. Sci. 2009, 23, 961–992. [Google Scholar] [CrossRef]
  36. Goldstone, R.L.; Medin, D.L.; Halberstadt, J. Similarity in context. Mem. Cogn. 1997, 25, 237–255. [Google Scholar] [CrossRef]
  37. Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
  38. Tversky, A.; Gati, I. Studies of similarity. Cogn. Categ. 1978, 1, 79–98. [Google Scholar]
  39. Chu, H.C.; Chen, M.Y.; Chen, Y.M. A semantic-based approach to content abstraction and annotation for content management. Expert Syst. Appl. 2009, 36, 2360–2376. [Google Scholar] [CrossRef]
  40. Sánchez, D.; Isern, D.; Millan, M. Content annotation for the semantic web: An automatic web-based approach. Knowl. Inf. Syst. 2011, 27, 393–418. [Google Scholar] [CrossRef]
  41. Jiang, J.J.; Conrath, D.W. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics, Madrid, Spain, 7–12 July 1997; pp. 19–33. [Google Scholar]
  42. Wu, Z.; Palmer, M. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, New Mexico, 27–30 June 1994; pp. 133–138. [Google Scholar]
  43. Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. arXiv 1995, arXiv:cmp-lg/9511007. [Google Scholar]
  44. Rada, R.; Mili, H.; Bicknell, E.; Blettner, M. Development and application of a metric on semantic nets. Syst. Man Cybern. IEEE Trans. 1989, 19, 17–30. [Google Scholar] [CrossRef]
  45. Jiang, Y.; Bai, W.; Zhang, X.; Hu, J. Wikipedia-based information content and semantic similarity computation. Inf. Process. Manag. 2017, 53, 248–265. [Google Scholar] [CrossRef]
  46. Mathur, S.; Dinakarpandian, D. Finding disease similarity based on implicit semantic similarity. J. Biomed. Inform. 2012, 45, 363–371. [Google Scholar] [CrossRef] [PubMed]
  47. Batet, M.; Sánchez, D.; Valls, A.; Gibert, K. Semantic similarity estimation from multiple ontologies. Appl. Intell. 2013, 38, 29–44. [Google Scholar] [CrossRef]
  48. Ahsaee, M.G.; Naghibzadeh, M.; Naeini, S.E.Y. Semantic similarity assessment of words using weighted WordNet. Int. J. Mach. Learn. Cybern. 2014, 5, 479–490. [Google Scholar] [CrossRef]
  49. Liu, H.; Bao, H.; Xu, D. Concept vector for semantic similarity and relatedness based on WordNet structure. J. Syst. Softw. 2012, 85, 370–381. [Google Scholar] [CrossRef]
  50. Maguitman, A.G.; Menczer, F.; Erdinc, F.; Roinestad, H.; Vespignani, A. Algorithmic computation and approximation of semantic similarity. World Wide Web 2006, 9, 431–456. [Google Scholar] [CrossRef]
  51. Medelyan, O.; Milne, D.; Legg, C.; Witten, I.H. Mining meaning from Wikipedia. Int. J. Hum.Comput. Stud. 2009, 67, 716–754. [Google Scholar] [CrossRef]
  52. Pirró, G. A semantic similarity metric combining features and intrinsic information content. Data Knowl. Eng. 2009, 68, 1289–1308. [Google Scholar] [CrossRef]
  53. Meng, L.; Huang, R.; Gu, J. A review of semantic similarity measures in wordnet. Int. J. Hybrid Inf. Technol. 2013, 6, 1–12. [Google Scholar]
  54. Lin, D. An information-theoretic definition of similarity. In Proceedings of the ICML, Madison, WI, USA, 24–27 July 1998; Volume 98, pp. 296–304. [Google Scholar]
  55. Resnik, P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. (JAIR) 1999, 11, 95–130. [Google Scholar] [CrossRef]
  56. Sánchez, D.; Batet, M.; Isern, D. Ontology-based information content computation. Knowl. Based Syst. 2011, 24, 297–303. [Google Scholar] [CrossRef]
  57. Seco, N.; Veale, T.; Hayes, J. An intrinsic information content metric for semantic similarity in WordNet. In Proceedings of the ECAI, Valencia, Spain, 22–27 August 2004; Volume 16, p. 1089. [Google Scholar]
  58. Zhou, Z.; Wang, Y.; Gu, J. A new model of information content for semantic similarity in WordNet. In Proceedings of the FGCNS’08, Second International Conference on Future Generation Communication and Networking Symposia, Washington, DC, USA, 13–15 December 2008; Volume 3, pp. 85–89. [Google Scholar]
  59. Sánchez, D.; Batet, M.; Isern, D.; Valls, A. Ontology-based semantic similarity: A new feature-based approach. Expert Syst. Appl. 2012, 39, 7718–7728. [Google Scholar] [CrossRef]
  60. Petrakis, E.G.; Varelas, G.; Hliaoutakis, A.; Raftopoulou, P. X-similarity: Computing semantic similarity between concepts from different ontologies. JDIM 2006, 4, 233–237. [Google Scholar]
  61. Ding, L.; Finin, T.; Joshi, A.; Pan, R.; Cost, R.S.; Peng, Y.; Reddivari, P.; Doshi, V.; Sachs, J. Swoogle: A search and metadata engine for the semantic web. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, 8–13 November 2004; ACM: New York, NY, USA, 2004; pp. 652–659. [Google Scholar]
  62. Leacock, C.; Chodorow, M. Combining Local Context and WordNet Similarity for Word Sense Identification; MIT Press: Cambridge, MA, USA, 1998; Volume 49, pp. 265–283. [Google Scholar]
  63. Li, Y.; Bandar, Z.; McLean, D. An approach for measuring semantic similarity between words using multiple information sources. Knowl. Data Eng. IEEE Trans. 2003, 15, 871–882. [Google Scholar]
  64. Schickel-Zuber, V.; Faltings, B. OSS: A Semantic Similarity Function based on Hierarchical Ontologies. In Proceedings of the IJCAI, Hyderabad, India, 6–12 January 2007; Volume 7, pp. 551–556. [Google Scholar]
  65. Schwering, A. Hybrid model for semantic similarity measurement. In On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE; Springer: Berlin/Heidelberg, Germany, 2005; pp. 1449–1465. [Google Scholar]
  66. Martinez-Gil, J.; Aldana-Montes, J.F. Semantic similarity measurement using historical google search patterns. Inf. Syst. Front. 2013, 15, 399–410. [Google Scholar] [CrossRef][Green Version]
  67. Retzer, S.; Yoong, P.; Hooper, V. Inter-organisational knowledge transfer in social networks: A definition of intermediate ties. Inf. Syst. Front. 2012, 14, 343–361. [Google Scholar] [CrossRef]
  68. Quintero, R.; Torres-Ruiz, M.; Menchaca-Mendez, R.; Moreno-Armendariz, M.A.; Guzman, G.; Moreno-Ibarra, M. DIS-C: Conceptual distance in ontologies, a graph-based approach. Knowl. Inf. Syst. 2019, 59, 33–65. [Google Scholar] [CrossRef]
  69. Torres, M.; Quintero, R.; Moreno-Ibarra, M.; Menchaca-Mendez, R.; Guzman, G. GEONTO-MET: An Approach to Conceptualizing the Geographic Domain. Int. J. Geogr. Inf. Sci. 2011, 25, 1633–1657. [Google Scholar] [CrossRef]
  70. Zadeh, P.D.H.; Reformat, M.Z. Assessment of semantic similarity of concepts defined in ontology. Inf. Sci. 2013, 250, 21–39. [Google Scholar] [CrossRef]
  71. Albertoni, R.; De Martino, M. Semantic similarity of ontology instances tailored on the application context. In On the Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1020–1038. [Google Scholar]
  72. Li, Y.; McLean, D.; Bandar, Z.; O’shea, J.D.; Crockett, K. Sentence similarity based on semantic nets and corpus statistics. Knowl. Data Eng. IEEE Trans. 2006, 18, 1138–1150. [Google Scholar] [CrossRef]
  73. Cilibrasi, R.L.; Vitanyi, P. The google similarity distance. Knowl. Data Eng. IEEE Trans. 2007, 19, 370–383. [Google Scholar] [CrossRef]
  74. Bollegala, D.; Matsuo, Y.; Ishizuka, M. Measuring semantic similarity between words using web search engines. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, AB, Canada, 8–12 May 2007; Volume 7, pp. 757–766. [Google Scholar]
  75. Miller, G.A.; Charles, W.G. Contextual correlates of semantic similarity. Lang. Cogn. Process. 1991, 6, 1–28. [Google Scholar] [CrossRef]
  76. Sánchez, D.; Moreno, A.; Del Vasto-Terrientes, L. Learning relation axioms from text: An automatic Web-based approach. Expert Syst. Appl. 2012, 39, 5792–5805. [Google Scholar] [CrossRef]
  77. Saruladha, K.; Aghila, G.; Bhuvaneswary, A. Information content based semantic similarity for cross ontological concepts. Int. J. Eng. Sci. Technol. 2011, 3, 45–62. [Google Scholar]
  78. Formica, A. Ontology-based concept similarity in formal concept analysis. Inf. Sci. 2006, 176, 2624–2641. [Google Scholar] [CrossRef]
  79. Albacete, E.; Calle-Gómez, J.; Castro, E.; Cuadra, D. Semantic Similarity Measures Applied to an Ontology for Human-Like Interaction. J. Artif. Intell. Res. (JAIR) 2012, 44, 397–421. [Google Scholar] [CrossRef][Green Version]
  80. Goldstone, R. An efficient method for obtaining similarity data. Behav. Res. Methods Instruments Comput. 1994, 26, 381–386. [Google Scholar] [CrossRef]
  81. Niles, I.; Pease, A. Towards a standard upper ontology. In Proceedings of the International Conference on Formal Ontology in Information Systems-Volume, Ogunquit, ME, USA, 17–19 October 2001; ACM: New York, NY, USA, 2001; pp. 2–9. [Google Scholar]
  82. Fellbaum, C. WordNet: An Electronic Database; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  83. Jain, P.; Yeh, P.Z.; Verma, K.; Vasquez, R.G.; Damova, M.; Hitzler, P.; Sheth, A.P. Contextual ontology alignment of lod with an upper ontology: A case study with proton. In The Semantic Web: Research and Applications; Springer: Berlin/Heidelberg, Germany, 2011; pp. 80–92. [Google Scholar]
  84. Héja, G.; Surján, G.; Varga, P. Ontological analysis of SNOMED CT. BMC Med. Inform. Decis. Mak. 2008, 8, S8. [Google Scholar] [CrossRef] [PubMed]
  85. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32, D258–D261. [Google Scholar] [CrossRef] [PubMed]
  86. Gangemi, A.; Guarino, N.; Masolo, C.; Oltramari, A.; Schneider, L. Sweetening ontologies with DOLCE. In Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web; Springer: Berlin/Heidelberg, Germany, 2002; pp. 166–181. [Google Scholar]
  87. Buggenhout, C.V.; Ceusters, W. A novel view on information content of concepts in a large ontology and a view on the structure and the quality of the ontology. Int. J. Med. Inform. 2005, 74, 125–132. [Google Scholar] [CrossRef] [PubMed]
  88. Fellbaum, C. WordNet. In Theory and Applications of Ontology: Computer Applications; Springer: Berlin/Heidelberg, Germany, 2010; pp. 231–243. [Google Scholar]
  89. Ponzetto, S.P.; Strube, M. Knowledge derived from Wikipedia for computing semantic relatedness. J. Artif. Intell. Res. 2007, 30, 181–212. [Google Scholar] [CrossRef]
  90. Ittoo, A.; Bouma, G. Minimally-supervised extraction of domain-specific part–whole relations using Wikipedia as knowledge-base. Data Knowl. Eng. 2013, 85, 57–79. [Google Scholar] [CrossRef]
  91. Kaptein, R.; Kamps, J. Exploiting the category structure of Wikipedia for entity ranking. Artif. Intell. 2013, 194, 111–129. [Google Scholar] [CrossRef]
  92. Nothman, J.; Ringland, N.; Radford, W.; Murphy, T.; Curran, J.R. Learning multilingual named entity recognition from Wikipedia. Artif. Intell. 2013, 194, 151–175. [Google Scholar] [CrossRef]
  93. Sorg, P.; Cimiano, P. Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data Knowl. Eng. 2012, 74, 26–45. [Google Scholar] [CrossRef]
  94. Yazdani, M.; Popescu-Belis, A. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia. Artif. Intell. 2013, 194, 176–202. [Google Scholar] [CrossRef]
  95. Rubenstein, H.; Goodenough, J.B. Contextual correlates of synonymy. Commun. ACM 1965, 8, 627–633. [Google Scholar] [CrossRef]
  96. Jarmasz, M.; Szpakowicz, S. Roget’s Thesaurus and Semantic Similarity. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Online, 1–3 September 2003; pp. 212–219. [Google Scholar]
  97. Hirst, G.; St-Onge, D. Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms; MIT Press: Cambridge, MA, USA, 1998; Volume 305, pp. 305–332. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.