Algorithm for the Accelerated Calculation of Conceptual Distances in Large Knowledge Graphs

: Conceptual distance refers to the degree of proximity between two concepts within a conceptualization. It is closely related to semantic similarity and relationships, but its measurement strongly depends on the context of the given concepts. DIS-C represents an advancement in the computation of semantic similarity/relationships that is independent of the type of knowledge structure and semantic relations when generating a graph from a knowledge base (ontologies, semantic networks, and hierarchies, among others). This approach determines the semantic similarity between two indirectly connected concepts in an ontology by propagating local distances by applying an algorithm based on the All Pairs Shortest Path (APSP) problem. This process is implemented for each pair of concepts to establish the most effective and efﬁcient paths to connect these concepts. The algorithm identiﬁes the shortest path between concepts, which allows for an inference of the most relevant relationships between them. However, one of the critical issues with this process is computational complexity, combined with the design of APSP algorithms, such as Dijkstra, which is O (cid:0) n 3 (cid:1) . This paper studies different alternatives to improve the DIS-C approach by adapting approximation algorithms, focusing on Dijkstra, pruned Dijkstra, and sketch-based methods, to compute the conceptual distance according to the need to scale DIS-C to analyze very large graphs; therefore, reducing the related computational complexity is critical. Tests were performed using different datasets to calculate the conceptual distance when using the original version of DIS-C and when using the inﬂuence area of nodes. In situations where time optimization is necessary for generating results, using the original DIS-C model is not the optimal method. Therefore, we propose a simpliﬁed version of DIS-C to calculate conceptual distances based on centrality estimation. The obtained results for the simple version of DIS-C indicated that the processing time decreased 2.381 times when compared to the original DIS-C version. Additionally, for both versions of DIS-C (normal and simple), the APSP algorithm decreased the computational cost when using a two-hop coverage-based approach.


Introduction
The spread of information and communication technologies has significantly increased the amount of available information.Thus, we can decode this information through concepts.Conceptual matching is essential for human and machine reasoning, enabling problem-solving and data-driven inference.Researchers studying the human brain use the term "concept" to describe a unit of meaning stored in memory.The ability to relate concepts and generate knowledge is inherent to humans.Giving computers this capability is a significant advance in computer science.
Transferring the relationship between concepts from the human domain to the computational domain is complex.Sociocultural and language aspects are closely related to the types of concepts acquired by human beings.Determining the similarity between concepts so that a computer can process them is one of the most fundamental and relevant problems in various research fields.Over time, people have explored multiple structures and methods to represent the semantic relationships underlying conceptual distance.The graph is one of the most commonly used abstract data types to represent this information, designed to model connections between people, objects, or entities.Nodes and edges are the two main elements of a graph.Moreover, graphs have specific properties that make them unique and important in different applications.
According to Mejia Sanchez-Bermejo (2013) [1], measuring the conceptual distance and semantic similarity between words has been studied for many years in the fields of linguistic computing and artificial intelligence (AI) because it represents a generic procedure in a wide variety of applications, such as natural language processing, word disambiguation, detection and error correction in text documents, and text classification, among others.Based on psychological assessments, semantic similarity refers to how humans categorize and organize objects or entities [2].Semantic similarity is a metric determining how closely two terms representing objects from a conceptualization are related.It is determined by analyzing whether the terms share a common meaning [3].
Reducing the computational cost of computing conceptual distances for each pair of concepts in massive graphs will accelerate the evaluation of similarity.As a result, it will have practical benefits and contribute to the development of more efficient applications to solve practical problems, for instance, reducing the computational time required to search for pairs of documents with high semantic similarity in large datasets composed of millions of documents and analyzing the transport network obtained automatically using AI techniques, as described by Chen et al. (2022) [4].It also extends the understanding of the phenomenon of semantic similarity in computing systems based on conceptual graphs.The research and development of efficient algorithms that reduce computation time by combining approximation or analytical techniques is essential to advance the study of semantic similarity.
In recent years, conceptual graphs generated from ontologies have gained importance.These graphs are strongly connected, where each concept is a vertex, and the two edges represent each relationship (one in each direction).Consequently, it is possible to apply various graph theory search algorithms to study and analyze the relations represented in a conceptual graph.DIS-C Quintero et al. (2019) [5] is a method with which to measure the conceptual distance between concepts.It requires a conversion of a conceptualization into a conceptual graph by calculating the All Pairs Shortest Path (APSP) through the graph.Thus, shortest path algorithms are crucial for solving various optimization problems.The primary purpose of these algorithms is to find the path with the minimum cost or distance between a given pair of vertices in a graph or network.The study of shortest path algorithms and the proposal of different approaches to reduce computational cost have been largely studied, as per the research works proposed by Dreyfus (1969) [6], Gallo and Pallottino (1988) [7], Magzhan and Jani (2013) [8], and Madkour et al. (2017) [9].
Different research works have proposed several techniques and optimizations for efficient, shortest path algorithm performance to handle massive data.Fuhao and Jiping (2009) [10] proposed a novel algorithm using adjacent nodes that was more effective at analyzing networks with massive spatial data.Chakaravarthy et al. (2016) [11] introduced a scalable parallel algorithm that significantly reduces internode communication traffic and achieves high processing rates on large graphs.Yang et al. (2019) [12] focused on routing optimization algorithms based on node-compression in big data environments, addressing the common problem of finding the shortest path in a limited time and the number of connected nodes.Liu et al. (2019) [13] presented a navigation algorithm based on a navigator data structure that effectively navigates the shortest path with high probability in massive complex networks.
On the other hand, research has been developed that generates complex graphs with information that represents vehicle paths or sensor networks, such as the work of Ma et al. (2021) [14].Li et al. (2023) [15] investigated a massive MIMO uplink system, where a transmitter with two antennas has to upload data in real time to a BS with a more significant number of antennas.Techniques and algorithms are required for high-speed and accurate graph processing in their different forms.
Unfortunately, the shortest path algorithm can propagate local distances to determine the distance between two concepts that are not directly connected by a relation in the ontology.This process is applied to each pair of concepts, allowing semantic proximity to be established.However, these techniques compute such distances at a high computational rate since traditional algorithms, like Dijkstra, have a complexity of O n 3 , which is disadvantageous for massive graphs.The computational complexity of DIS-C is O n 2 log(n) (using the Floyd-Warshall algorithm); therefore, using techniques such as DIS-C in its original version is inadequate to handle very large graphs directly.This paper proposes the use of other approaches to solve the APSP calculation.The pruned Dijkstra and the Sketch-based algorithms were used in the first approach.Considering that the analysis of the obtained results includes different datasets and that the generated graphs are sparse, a centrality-based strategy is presented to reduce the computational complexity.
The manuscript is organized as follows: Section 2 summarizes the state-of-the-art related to this field.Section 3 presents an analysis of the complexity of the DIS-C algorithm to reference those parts that need to be accurately improved, the algorithms proposed to enhance performance, and the demonstration of its precision and the associated computational complexity.In Section 4, we establish the execution scenarios, which include the total number and type of tests, present the experimental results, and discuss their significance.Finally, Section 5 outlines the conclusion and future work.

Semantic Similarity
Conceptual distance is closely related to semantic relationship and semantic similarity.The semantic relationship, on the other hand, is not necessarily conditioned by a taxonomic relationship; in general terms, these relationships are given by any relationship between concepts, such as meronymy, antonymy, functionality, cause-effect, holonymy, hyponymy, hyperonymy, and so on.Moreover, we define conceptual distance as the space that separates two concepts within a conceptualization; in other words, it is the degree of proximity between concepts Quintero et al. (2023) [3].We can establish a rule that indicates conceptual closeness in terms of how close to 0 (in terms of distance) the assigned value is.However, computing such distances is entirely dependent on the context of the concepts.In order to calculate it, the concepts must first be represented in a model that allows for the expression of the semantic relations and defining the measurement metric from the different underlying structures and representations.
In the network model, ontologies are used to address the problem of formally representing the semantics of information and its relationships Meersman (1999) [16].The mentioned structures evaluate the conceptual distance between terms.The present work considers the assumption that an ontology can be represented as a strongly connected graph (conceptual graph); in this context, some authors have proposed many approaches to evaluate the closeness between concepts.The contributions of Bondy (1982) [17], West et al. (2001) [18], Bollobás (1998) [19], and Gross et al. (2018) [20] describe the basic concepts, mathematical foundations, and representative problems of graph theory.
On the other hand, since their conception in 2012 by Google [21], knowledge graphs have been the research subject in several articles.This term has been frequently used in academic and business contexts, typically associated with semantic web technologies, linked data, large-scale data analytics, and cloud computing Ehrlinger and Wöß (2016) [22].
The works presented by Fensel et al. (2020) [23] and Ehrlinger and Wöß (2016) [22] focused on reviewing several definitions of a knowledge graph to develop a formal definition.A Knowledge graph describes structured information and can be applied to specific domains such as answering questions, recommending, and retrieving information Zou (2020) [24].In addition, Pujara et al. (2013) [25] present a method for transforming uncertain extractions about entities and their relationships into a knowledge graph by filtering noise, inferring missing information, and identifying the relevant candidate facts.
In the literature, the authors have identified four groups of methods, models, and techniques for computing semantic similarity: (a) edge counting-based, (b) feature-based, (c) information content-based, and (d) hybrids.In this work, information content-based methods are used to determine semantic similarity based on their direct relationship to the measurement of conceptual distance.The underlying knowledge base provides these methods with a structured representation of terms or concepts connected by semantic relationships.In addition, it proposes an unambiguous semantic measure since it considers the real meaning of the terms Sanchez et al. (2012) [26].
Edge counting-based methods consider an ontology a connected graph, where the nodes are concepts, and the edges are relationships (almost always taxonomic).The number of edges separating two concepts determines the similarity between them.Table 1 summarizes some representative edge counting-based methods.Furthermore, it describes the expression used to calculate the similarity (sim)/distance (dis) and a summary of the variables involved.λ is the length of the shortest path between a and b, h is the minimum depth of LCS (the more specific concept that is an ancestor of a and b) in the hierarchy, and α ≥ 0 and β > 0 are parameters that scale the contribution of the length and depth of the shortest path, respectively.

Shenoy et al. (2012) [31]
L is the shortest distance between a and b, calculated by taking into account the direction of the edges.Each vertical direction is assigned a value of 1; one is added for every direction change.N is the depth of the entire tree.N a and N b are the distances from the root to the concepts a and b, respectively.γ is 1 for the adjacent concepts and 0 for the rest.
Feature-based methods are based on the Tversky (1977) [32] similarity model, obtained from set theory, which assesses the similarity between concepts based on their properties, subtracts the common and uncommon features of the terms, and attempts to overcome the limitations of edge count-based measures regarding the fact that taxonomic links in an ontology do not necessarily represent uniform distances; Sanchez et al. (2012) [26].These types of measurement have the potential to be applied in the calculation of semantic similarity supported by crossed ontological structures.The main limitation is that it requires ontologies or knowledge sources with semantic features, such as dictionaries, and most ontologies rarely contain semantic features other than relations.
The Lesk study [33] proposed the homonymous computing method, which consists of calculating the Lesk measure to disambiguate words in texts.The work presented by Banerjee and Pedersen (2003) [34] is an extension of this and used the terms or synsets provided by WordNet.However, in the modern world, there are multiple sources of information, such as social networks and internet blogs, which provide a large amount of information.For this reason, other works have used different sources of knowledge, such as Wikipedia.In this context, the study of Jiang et al. (2015) [35] evaluated the existing specific methods based on the features, using a formal representation of concepts.
Information content-based approaches improve some of the limitations of edgecounting-based methods.These methods are based on computing information content (IC) and similarity, (sim).[5] proposed the DIS-C method to compute the conceptual distance between two concepts in an ontology based on a fundamental approach that uses the topology of an ontology to compute the weight of relationships between concepts.DIS-C can incorporate various relationships between concepts, such as meronymy, hyponymy, antonymy, functionality, and causality.
The DIS-C method considers the conceptual distance as the space separating two concepts within a conceptualization represented as a directed graph, and it is related to the difference in the information content of the concepts in their definitions.The conceptual structure is irrelevant since the method assigns a distance value to each relationship type and transforms this into a weighted directed graph; as a consequence, DIS-C can be applied to any conceptual structure.The only requirement for the algorithm is that the conceptualization be a 3-tuple of elements K = (C, , R), where C is a set of concepts, is the set of relation types, and R is the set of relationships in the conceptualization.In summary, the method is described through the following steps: 1.
Assign to each relation ρ ∈ as an arbitrary conceptual weight defined in each direction of the relationship.

2.
A directed and weighted graph (conceptual graph) is created, where the vertices are the concepts contained in the conceptualization, and the edges are the relationships that connect each pair of concepts.Each edge has its counterpart in the opposite direction with a different weighting.Finally, to compute the weighting, the generality of each concept is necessary.

3.
Calculate the APSP length between each pair of vertices, i.e., diffuse the conceptual distance to all concepts.

Reference Expression Description
Resnik (1995) [36] sim resnik (a, b) Based on the concept of the least common subsumer (LCS), information content (IC) is calculated when the terms share an LCS.A high IC value indicates that the term is more specific and clearly describes a concept with less ambiguity.
Jiang and Conrath (1997 It focuses on determining the link strength of an edge connecting a parent node to a child node.These taxonomic links between concepts are reinforced by the difference between a concept's IC and its LCS. Where α > 0 is a constant and S(x) is the set of meanings of the concept x.

Jiang et al. (2017) [39]
Where h(a) is the set of hyponyms for a, p g (c) is the set of pages in category c, and C A is the set of categories.
The second proposed approach is the combination of IC by using category structure and the extension of ontology-based methods.
Generalization of the Zhou et al. ( 2008) [40] approach.Where γ is an adjustment factor for the weight of the two features involved in the IC calculation, d(a) is the depth of the leaf a, and d max is the maximum depth of a leaf.
Generalization of the Sanchez et al. ( 2011) [41] approach.Where l(a) is the set of leaves of a in the category hierarchy, H(a) is the set of hypernyms, and l max is the maximum number of leaves in the hierarchy.

All Pairs Shortest Path Problem (APSP)
Computing the shortest path between each pair of concepts in large graphs is computationally expensive.This type of problem is known in graph theory as "All Pairs Shortest Path" (APSP) [42].In [5], we used two well-known algorithms: the Floyd-Warshall and Johnson-Dijkstra algorithms.The Floyd-Warshall algorithm is based on transitive Boolean matrix closure [43], with a computational complexity of O n 3 (n is the number of nodes, and m is the number of edges in the graph).On the other hand, Johnson-Dijkstra has a complexity O mn + n 2 log(n) (using Fibonacci heaps); however, it requires that there be no negative cycle in the graph, and when m ∈ O n 2 , the complexity is equal to O n 3 [44].
When using accelerated matrix multiplication algorithms, the cost is O n 2+ , < 1 [45].Techniques, such as divide and conquer [46], or metaheuristic algorithms [47] are also employed.Another approach is to use new-generation hardware, such as the GPU, to parallelize existing algorithms and reduce runtime [48].According to Reddy (2016) [49], most algorithms are based on the following two types of computational models:

•
Addition comparison model: Assume that the inputs are real weighted graphs, where the only operations allowed on real data are comparison and addition.

•
Random access machine (RAM) model: Shortest path algorithms assume that the inputs are weighted graphs of integers manipulated by addition, subtraction, comparison, shift, and various logical operations [50] (the most commonly used model).
Despite the significant interest in the APSP problem, there have only been minor advances in computational complexity over O n 3 for general graphs based on distance product computation, which is closely related to matrix multiplication.Although subcubic algorithms exist, they only exploit the properties of specific graphs or advantage sparse graphs, so there is no real subcubic combinatorial algorithm.Table 3 describes the algorithms developed with their computational complexity over time, and Figure 1 shows a comparison of the execution time of each algorithm.
Although it is possible to solve the APSP problem in polynomial time, its application in large graphs is not practical.For this reason, it is interesting to analyze heuristic and metaheuristic methods and approximation algorithms with regard to this problem.The clear advantage of this approach is the speed gain, whereas the disadvantage is the accuracy of the computation.In the literature, the authors have employed some relevant investigations into this class of algorithms with regard to the APSP problem: genetic, evolutionary, and optimization algorithms per ant colony.
Attiratanasunthron and Fakcharoenphol (2008) [51] presented a rigorous analysis of the running time of ACO in the shortest path problem.The n-ANT method is inspired by the 1-ANT [52] and AntNet [53] algorithms.Horoba and Sudholt (2009) [54] formalized and optimized the n-ANT algorithm to solve the APSP.The M APSP algorithm is a generalization of M SDSP presented in the same research study.This algorithm decodes the single-source shortest path problem.The computational complexity of , where ∆ is the maximum degree of the graph, l is the maximum number of edges in any shortest path, and l * := max{l, ln(n)}.
Other research studies that improve the asymptotic limit concerning the O n 3 are Baswana et al. (2009) [68] and Yuster (2012) [69].The first applied it to unweighted and undirected graphs and introduced the first algorithm, (n 2 ), to compute the APSP with a 3-approximation, meaning that for each pair of vertices u, v ∈ V, the distance is less than or equal to 3 * δ(u, v), where δ(u, v) is the shortest distance of the path.In addition, two algorithms are presented for a 2-approximation, with a complexity of O m 2/3 n log(n) + n 2 and O n 2 log(n) 3/2 , and the shortest distance being 2δ(u, v) + 1 and 2δ(u, v) + 3, respectively.In both cases, there is a predefined error bound.
Finally, there is the sketch-based approach, which requires two extensive processes.The first is to preprocess, in which the result is a data structure that allows the distance between two nodes to be consulted.This data structure and a distance oracle were introduced in [70].The second process is related to queries, which is a case of making queries to the oracle, and it returns, in a specified time limit, the result of the shortest distance between two nodes.
In comparison to identical algorithms, these algorithms reduce the time by several orders of magnitude.Das Sarma et al. (2010) [71] is one of the first studies that applied this approach to complex graphs with hundreds of millions of nodes and billions of edges.The description of the algorithm is the following: sample a small number of node sets from the graph, called seed node sets.Then, for each node of the graph, the closest seed within each of these seeds is found.The sketch of a node is the set of the closest seeds and the distance to them.Thus, given a pair of nodes, u and v, the distance between them can be estimated by searching for a common seed in their sketches.However, there are no theoretical guarantees on the degree of optimization of the method for directed graphs.The results showed a better average close to the actual distance when applied to directed graphs.
Wang et al. ( 2021) [72] focused on finding all the shortest paths in large-scale undirected graphs.The principal difference, regarding the work of Das Sarma et al. (2010) [71], is that Wang et al. (2021) [72] considered the computational complexity needed to perform the preliminary computation.The standard method employed in this problem involves applying a BFS algorithm to the seed node set.It is known that the complexity of BFS is O m + n when using an adjacency list or O n 2 for an adjacency matrix.This complexity is a drawback when processing large graphs.For this reason, they investigated whether it is possible to speed up the preliminary computation using the [73] pruned reference point-labeling technique, i.e., applying a pruned BFS.The specific results of this stage showed linear time in terms of its construction, i.e., O m .The results indicated a significant improvement, especially in terms of runtime for large graphs.

Materials and Methods
The DIS-C presented by Quintero et al. (2019) [5] uses two algorithms: a basic one, which calculates the conceptual distance within a conceptualization, K = (C, , R), and the weights δ ρ for each relationship type ρ ∈ Re.This algorithm converts the conceptualization through which we compute the conceptual distance by calculating the APSP in graph G.The solution described in this paper combines and evaluates several approaches to optimize the APSP.The second algorithm enables automatic weighting, which allows the conceptual distance to be computed automatically without having to provide the weights of the relationship types.Figure 2 shows the flowchart corresponding to the DIS-C algorithm with automatic weighting (where i(a) is the entry degree of node a, and o(a) is the exit degree of node a; w i (a) and w o are the costs of entering and leaving node a, respectively; g(a) is the generality of node a; V γ j and A γ j are the nodes and edges, respectively, of the graph Γ j K , which is the graph corresponding to the conceptualization K in the j-th iteration.Thus, w j (a, b) is the cost of the edge from node a to node b; p w is a geometric weighting factor.Furthermore, M Γ j K is the APSP distance matrix, and ρ * is the set of edges representing the relation ρ.Finally, K is the convergence threshold of the algorithm), which will be referred to simply as DIS-C in this paper.Although the DIS-C algorithm consists of several processes, in terms of computational complexity, it is related to the execution of the APSP.
At this stage, we can assume that the computational complexity of the DIS-C algorithm for a conceptualization with n concepts is O n 2 log n (A detailed analysis of this topic is presented in [74]).

The Pruned Dijkstra Algorithm
The algorithm implemented is described in Algorithm 1, and the index construction algorithm is presented in Algorithm 2. Appendix A explains the pruned Dijkstra algorithm.We executed the pruned Dijkstra in the order v 1 , v 2 , . . ., v n , and the sequence of vertices is arbitrary but relevant for good algorithm performance.Ideally, starting from the central vertices, many shortest paths will cross them.There are many strategies to order the vertices; one of the most common is based on ordering by centrality.In this work, we choose to order the vertices by the degree of centrality.The graphs generated by the semantic networks behave like networks without scale, where high-degree vertices provide high connectivity to the network.In this way, the shortest path between two vertices is usually through these nodes.

return L end
In order to formalize the proposed approach, it must be demonstrated that the algorithm is correct and produces the theoretically proposed result.In order to prove the correctness of the method, it suffices to show that the algorithm computes a two-hop coverage, i.e., Q(s, t, L k ) = d G (s, t) for any s, t ∈ V.This statement is equivalent to proving that, when given any vertex, s ∈ V, all other vertices belong to the input and output coverages of vertex s.Let L k be the index built without the pruning process; so, it is a two-hop coverage; L k is the index generated by applying the pruning process.Since there exists a node, the goal is to show that v j also exists in L k (s) in and L k (t) out .Theorem 1.For k = 1, . . ., n, and s, t assuming there is a path between them, and j be the smallest number, such that It must show that (v j , d G (v j , s)) and (v j , d G (v j , t)) are contained in L k (s) in and L k (t) out , respectively; that is, Q(s, t, L k ) = Q(s, t, L k ).In order to prove that v j ∈ L k (t) out , first consider, for any i < j, in that v i ∈ P G (s, v j ), where P G (a, b) is a path between a and b.Suppose v i ∈ P G (v j , t); through inequality, we obtain the following: it contradicts the fact that j is the smallest number.Therefore, v i / ∈ P G (v j , t) for any i < j.Now, we prove that (v j , d G (v j , t)) ∈ L k (t) out .Suppose we perform the j-th pruned Dijkstra iteration from node v j to construct index L out j .Let t ∈ P G (v j , t).Since there is no v i ∈ P G (v j , t) for i < j, this means that there is no vertex in the i-th pruned Dijkstra iteration that covers the shortest path from v j to t, and, therefore, Q(v j , t, L j−1 ) > d G (v j , t).Consequently, vertex t is visited without pruning, and Finally, suppose we execute the j-th pruned Dijkstra iteration from v j , but this time, we construct L in j .If s and t are reachable in G, it means that there is a path between v j and s when considering the input edges or, equivalently, there is P G (s, v j ).Since we have already proved that v i / ∈ P G (v j , t) for i < j, this means that v j is a neighbor of s or v j = s, and since v j is the initial vertex of the path P G (v j , t), then v j is the neighbor with the least distance sharing s.In all cases, Q(s, v j , L j−1 ) > d G (s, v j ); therefore, the node s is visited without pruning (remember that it considers the input edges).Consequently, As mentioned earlier, the order in which the nodes are processed affects the performance of the algorithm.Is there an order of execution such that the construction time is minimized?The answer is 'yes', but finding such an order is challenging.So, there are two approaches: the first is to find the minor nodes that cover all the shortest paths, which is a much bigger problem since it is an instance of the minimum vertex coverage problem, which is NP-hard.The second approach is a parameterized perspective; specifically, the parameter explored is the width of the tree.It also relies on the process of the centroid decomposition of a tree [75].This approach derives the following lemma; the formal proof is given in [73].
Lemma 1.Let w be the width of the graph of G.There is a vertex order where the pruned Dijkstra method preprocesses in O wm log(n) + w 2 n log 2 (n) time, stores an index in O wn log(n) space, and answers each query in O w log(n) time.
Considering that this research is not looking for minimum two-hop coverage, using the upper bound of O wm log(n) + w 2 n log 2 (n) for the algorithm can be considered an inconsistency.However, there is no guarantee that the selected order (degree centrality) is the same as the order required to achieve this computational complexity.We conclude that, according to our implementation, it cannot be faster than that theoretically obtained using the order with which we achieve O wm log(n) + w 2 n log 2 (n) .This means that there is a lower bound, Ω(wm log(n) + w 2 n log 2 (n)), in the implementation.In addition, when referring to the Dijkstra algorithm, we can hypothesize that the construction of the coverage using the pruned process, regardless of the order of execution, is much smaller than for the construction of the naive coverage in highly connected graphs, i.e., that O mn + n 2 log(n) .
Thus, we concluded that this lemma indicates an optimal order for processing vertices.Unfortunately, it is extremely complicated to find it.

The Sketch-Based Algorithms
It is necessary to compute two sketches for each node.The first SKETCH out (u) and SKETCH in (u) represent the distance from node u ∈ V to the nearest seed and inverse.This is carried out by running the search algorithm two different times, with one employing the input edges and the other using the output edges.We can randomly choose S when considering some criteria to guide the choice; for example, take into account discarding those nodes for which the degree is less than two.Alternatively, the structure of the graph and its specific properties (The strategy for choosing the seed nodes is to emphasize the nodes with the highest value of generality in the conceptual graph; this implies that the most general concepts will be chosen) can be employed.Appendix B provides some complementary aspects of sketch-based algorithms.Algorithm 3 describes the proposed pseudocode for generating the sketches of each node.

S end
For each sample, the closest seed node and its distance are obtained for each node.It is carried out by employing the classic Dijkstra multi-source algorithm.In this form, the seed nodes are computed for each u ∈ V (see Algorithm 4).

S end
In each iteration, we generate a sketch from an offline sample; therefore, at the end of the algorithm, there will be k log(n) seeds for each node.The second process of the method is the online one.The calculation of the approximate minimum distance between nodes u and v is carried out by looking at the distance from u and v to any node w that appears in both S in u and S out v .So, we compute the path length from u to v by adding the distance from u to w plus the distance from w to v. The minimum distance is taken if there are two or more shared nodes w.Algorithm 5 shows the proposal to perform this calculation.In the sketch-based method, there is no accuracy test for general graphs.The method does not guarantee exact node labeling to determine the distance between all vertex pairs.On the other hand, one of the objectives of this paper is to improve the computational complexity of the original algorithm (which is O n 3 ).The following theorem demonstrates that the sketch-based method accomplishes this aim.
Theorem 2. For a set of nodes, S ⊆ V, such that the set size is 2 w for 0 ≤ w ≤ n; the complexity of the algorithm is O w(m + n log(n)) .
Proof.We know that the sketch-based method selects a value, w, based on the number of nodes in the graph, i.e., w = f (V) ≤ |V|.Then, given w, Q i subsets of nodes are created for i = 0, 1, 2, . . ., log n, such that the size of each one is For each element of the Q i subsets, the multi-source Dijkstra algorithm is applied, with the purpose of finding, for all the nodes of G, the closest node of the Q i set.The multi-source Dijkstra has a complexity of O m + n log(n) ; for each subset of nodes, it is necessary to apply a single execution of Dijkstra algorithm (regardless of the size).Therefore, since there are w subsets of nodes, w Dijkstra processes must be performed from multiple sources, i.e., in total, the complexity is O w(m + n log(n)) .
For the implementation employed in this work, we determined w by using the logarithmic function w = log 2 (n), where n = |V|.It ensured that |S| < |V|.Therefore, for the implementation performed, the method has a complexity of O log 2 (n)(m + n log 2 (n)) .

Results
This section describes the results of the different experiments carried out.In the first section, the performance metrics of the generated graphs are analyzed, and the control algorithm is compared with the proposed one.The second part analyzes the results of the semantic metrics.

The Datasets
The used datasets contain pairs of words assigned to a numerical value by human evaluation.These sets are divided into two categories: those that measure semantic similarity and those that measure semantic relationship.These sets are described in Table 4. Thus, this process is executed by a script and, for each dataset, selects a few words, generating the conceptual graph that connects them by applying the DIS-C algorithm to find the conceptual distances.We repeat this process with all word pairs, and finally, all graphs generated from a set of words are merged to obtain a graph with all word pairs in the set.We also apply the DIS-C algorithm to combine the graph that contains the conceptual distances.The purpose of applying DIS-C to those graphs holding a few words is to measure the runtime with different sizes of graphs and, therefore, obtain a significantly larger record.

Evaluation Metrics
The evaluation metrics used for this research are divided into performance and semantics.The first corresponds to the runtime measurement, which is related to the number of nodes and edges of the graph.The semantic metrics measure the precision in estimating similarity, using the Pearson correlation coefficient (Equation ( 1)), the Spearman rank correlation coefficient (Equation ( 2)), and the harmonic score (Equation ( 3)). (1) The Pearson correlation compares human similarity vectors and has been widely used to evaluate methods that estimate similarity between words and concepts.The Spearman rank correlation provides an invariant ranking and allows for a comparison of the underlying quality of word similarity measures.Additionally, harmonic scoring that combines Pearson and Spearman correlations to provide a unique weighted score for evaluating word similarity methods has been used.These correlation measurements are essential for evaluating and comparing similarity methods in the literature.

Generated Graphs and Performance
For each dataset, we generated a graph connecting each pair of words.Later, we computed the corresponding unified graphs.Table 4 shows the size of these graphs.The last row of the table (labeled 'All') refers to the union of all graphs; in addition, the algorithms were applied to this graph.
The following is a series of charts on the running time of each proposed algorithm.Figure 3 shows the runtime of the control algorithm, the pruned Dijkstra, and sketch-based algorithms.At specific points, the runtime increases in an "abnormal" way; this is caused by a parameter that differs from the size of the graph; specifically, it is the convergence threshold of the algorithm.It indicates that more iterations are required for this dataset, in particular, the graph structure, to converge with the threshold.
The convergence threshold is a function of the total generality of the graph, i.e., the sum of all generalities of each node.The total generality of the graph depends on the distance value between each pair of nodes provided by the APSP algorithm; for this reason, the algorithms that obtained the shortest actual distance always reach the convergence threshold in the same number of iterations.In this case, it is the control algorithm and the pruned Dijkstra.On the other hand, the sketch-based algorithm varies the number of iterations to reach the threshold, but in general, it cannot be less than the number achieved by the control algorithm.The reason for this is that the control algorithm determines the precise shortest distances; thus, the total generality of the graph is the smallest possible.The generality of the graph cannot be less than the minimum generality since it is known that the sketch-based algorithm (and any other approximation algorithm) cannot compute a shorter distance than the real one.
In Figure 3, it is possible to recognize that most of the proposed algorithms have a longer runtime than the control algorithm; this is primarily attributed to graph density; that is, the generated graphs have few edges and can be considered sparse graphs.Notice that only the sketch-based algorithm with k = 1 is faster than the control algorithm.However, the acceleration is not significant.In order to significantly improve the runtime, it must be reduced by at least one order of magnitude, considering that the runtime in the studied results is measured in days.Since the proposed algorithms are divided into two significant processes (construction and query), the runtimes of these two processes are measured separately.Figure 4 depicts this specific operation.The query time is the most important constraint in the algorithms.When analyzing the complexity of the query process, we found that the complexity of the sketch-based algorithm depends on the size of the coverages, and these are a function of k, so the response when increasing the value of k is predictable, as shown in Figure 4.In the case of the pruned Dijkstra algorithm, the coverage size is huge for more general nodes and decreases as a function of such generality.It induces some nodes to find their shortest distance, which takes longer; this pattern speeds up the overall query time.Although a binary search can be used instead of a linear one, it does not underestimate the quadratic factor inherent in the APSP, the behavior of which is reflected in the results.Considering this inconvenience, we conjecture that graph generality can be computed using only node coverages.This implies that solving the APSP to determine the generality is unnecessary, eliminating the quadratic factor of the APSP.Based on the previous assumption, a framework for computing generality was established that only considers node coverage because generality was defined as the average of the conceptual distances from concept x to all other concepts, divided by the sum of the average conceptual distances and all concepts in the ontology.In the case of pruned Dijkstra, the more general concepts contain more significant coverage; their influence area is much larger than the less general concepts.This area can be used in a form where the generality of a node depends on itself.The principle of generality is respected by considering the information that other concepts provide for concept x (the distances of the coverage L in x ) and the information that concept x itself provides to its related concepts (the distances of the coverage L out x ).We can consider it as a generality measure based on the extended neighborhood of the nodes represented by their coverages.This is possible because the covers map the topology of the graph.So, we performed the generality computation, as per Equation (4).Henceforth, versions of DIS-C that use this type of generality are denoted as simple.
In Figure 5, it is possible to study the total cost of running the simple algorithms in comparison with the others.An evident acceleration is obtained concerning the first versions of the proposed algorithms.
In the case of the simple pruned Dijkstra algorithm, there is a significant improvement in the runtime, reducing it by an order of magnitude from days to hours for the most extensive graph.In the case of the sketch-based algorithms, there is a more predictive, straight-line relationship with k, i.e., the runtime for k = 2 is two times lower than for k = 1, and so on.
An effective speedup is observed when using the simple version of the algorithms.Similarly, the use of the proposed algorithms in their original form is not recommended in a time-limited context.It is important to emphasize that all algorithms do not use any heuristic; therefore, all are made precise by employing the same input, and the results are obtained consistently.Consequently, a statistically significant analysis does not allow for the detection of differences in the proposed algorithms; in other words, we have the same input, and there is no sensed or monitored data to apply descriptive or inference statistics.The following section examines the usefulness of conceptual distance in practical situations, how much precision is lost in the computation, and whether this is compensated for by the acceleration obtained.

Conceptual Distance Results
In this section, the results of the conceptual distance obtained in the different datasets evaluated are analyzed, considering the previously proposed metrics.Correlation coefficients are frequently applied in knowledge representation systems; therefore, these coefficients are employed to compare the obtained results with other approaches.Before analyzing the group of datasets, we analyzed the results of a single dataset: MC30.The results of the pruned Dijkstra algorithm correspond directly to the results of the control algorithm because the process obtains the identical shortest distances.The sketch-based algorithm adequately adjusts to the conceptual control distance results; the accuracy of these results depends, essentially, on the value of k; at higher values, the difference from the original results is decreased.This behavior enables us to detect a high correlation with the control algorithm and, in general, with the other versions of the algorithms.
Figures 6 and 7 show the correlation matrix of Person and Spearman, respectively.The upper part of the diagonal represents the correlation between the words in the a − b direction (conceptual distance from the word "a" to the word "b").In contrast, the lower half of the diagonal is the correlation of b − a. Appendix C provides both the Pearson and Spearman correlation results.The matrices reveal a high correlation for all algorithms.Although the results are excellent in all cases, the correlation values related to the control algorithms are the most relevant for the analysis; these correspond to row 1 and column 1, with values of a − b and b − a, respectively.By correlating the results of each algorithm with the control algorithm, an analysis of the error of each algorithm is implicitly performed.Therefore, as the correlation value is higher in the experiment, the error is decreased, and vice versa, with the advantage of using a metric limited to a specific interval.
In the first case, the lowest correlation value (0.94) corresponds to the sketch-based algorithm with k = 1; however, the correlation increases according to the value of k.The same behavior is replicated in the Spearman matrix and the results of the direction b − a of the concepts.When considering the pruned Dijkstra algorithms, the correlation for the standard version is 1, and for the simple version, it obtained values close to 1 (0.99).These results indicate that the use of the algorithms in the standard or simple version is indifferent because, in both cases, the obtained correlation is very similar or, in some cases, identical.
All the algorithms produced a correlation that was considered high (greater than 0.7).For both types of correlations, the algorithm with the best performance was pruned Dijkstra.The algorithm that produced the lowest correlation was the sketch-based algorithm, with k = 1, and values ranging between 0.77 and 0.81; even if they cannot be considered bad results, they show disparity when compared to all the others.If the analysis is focused on the sketch-based algorithms, the pattern observed for the MC30 dataset can be verified, i.e., increasing the value of k also increases the probability of obtaining a better correlation.
Finally, the harmonic values are shown in Table 5; these values unify the performance obtained by both correlations.As mentioned in the previous sections, this table shows that the pruned Dijkstra algorithm achieves the highest correlation values.In contrast, the sketch-based algorithms perform better depending on the given value of its k parameter.

Conclusions and Future Work
This paper investigates the optimization of conceptual distance computation in an ontology using the DIS-C technique.When considering the case of computing conceptual distances by employing a method that solves the APSP problem, both the pruned Dijkstra and the sketch-based algorithms can substitute the Dijkstra algorithm.However, when analyzing the execution times obtained for the test datasets, the use of these two algorithms (pruned Dijkstra and sketch-based) is not recommended for situations where time constraints exist because the growth of the function (which estimates the execution time based on the number of nodes in the graph) is above that obtained by applying the standard Dijkstra algorithm.Thus, to reduce the computational cost, a simple version of DIS-C was proposed by replacing the use of the APSP as a subprocess for propagating the conceptual distances between all pairs of concepts to compute generality, taking the vertex cover (sometimes referred to as node cover) into account.Based on this simple version, the experiments show that the processing time is up to 2.381 times faster than the original DIS-C approach.One of the main purposes of this paper was to maintain a reduced loss of precision in distance computation when compared to the original version of DIS-C, with correlation values of 1.0.The simple version can be useful in practical applications, and if the highest precision in distance calculation is required, this version using pruned Dijkstra is the best option.Moreover, using the complete topology of a graph to compute the conceptual distance in the DIS-C model is excessive and requires high computational cost.Therefore, we have experimentally demonstrated that the vertex cover is a robust approximation for computing the generality.
On the other hand, if speed is a priority, the simple version of the sketch-based algorithm with moderate values (greater than 1 but less than 20) for the parameter k is a good alternative.We must also consider the size of the conceptualization; for sizes of less than 1000 concepts, the type of algorithm is indifferent, and the use of Dijkstra is a better option since the computation time is not excessive and ensures the calculation, as proposed by the DIS-C model.An interesting way to take advantage of the speed of the sketch-based algorithm with low parameters (less than five) for k is when it is necessary to quickly perform tests on systems using DIS-C or directly when we want to compare the method with others.Implementing the sketch-based approach can provide a reference guide; for example, when good results are obtained with the simple version, it is reasonable to assume that the same results are obtained by executing the original DIS-C approach (when using the Dijkstra algorithm).The previous statement assumes that precision is the most crucial factor and that computation time is not a primary constraint.On the other hand, if time is a limitation, the pruned Dijkstra algorithm can be used as a replacement for Dijkstra, considering its high correlation.
Future work will be oriented toward evaluating the performance of these algorithms on other types of massive graphs, such as knowledge graphs, and analyzing the performance of the simple DIS-C approach with graphics unit processing (GPU) implementations.In addition, new advances in microprocessor architecture design, such as high-performance cores and high-efficiency cores, might be an exciting and suitable alternative to improve performance and computational time.Moreover, additional implementations to test this approach in other application domains involving information theory could be considered to evaluate the effectiveness of the proposed method.Figure A1 shows an example of the algorithm applied to the graph depicted in the figure, assuming that the order of execution is 1, 8, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12.The first pruned Dijkstra from vertex 1 visits all other vertices forwards (Figure A2a) and backwards (Figure A2b).In the next iteration from vertex 8 (Figure A2c), when vertex 1 is visited, given that Q(8, 1, L 1 ) = 1 = δ(8, 1) = 1, vertex 1 is pruned and neither of its neighbors is visited again; the same occurs in the opposite direction (see Figure A2d).With these two executions, it is possible to cover all the shortest paths, and after executing the process on the remaining vertices, they only refer to themselves.

Appendix C. Pearson and Spearman Correlation Results
According to the described experiments, Tables A1 and A2 describe the results of the Pearson correlation obtained by applying all the algorithms and their configurations to the 11 datasets.In addition, Tables A3 and A4 show the values of the Spearman correlation.
Rada et al. (1989) [27] dis rada (a, b) = length of the path from a to b Wu and Palmer (1994) [28] sim wu (a, b) = 2N c N a +N b +2N c a and b are concepts within the hierarchy; c is a less common super concept of a and b.N a is the number of nodes on the path from a to c, N b is the number of nodes in the path from b to c, and N c is the number of nodes in the path from c to the hierarchy root.Hirst and Stonge (1995) [29] sim hirst (a, b) = C − λ(a, b) − K * C h (a, b) C and K are constants, λ(a, b) is the length of the shortest path between a and b, and C h (a, b) is the number of times that the path changes direction.Li et al. (2003) [30] sim li (a, b) = e −αλ(a,b) e βh −e −βh e βh +e −βh

Figure 1 .
Figure 1.Comparison between the execution times of the APSP algorithms.This figure shows the number of operations (y-axis) required for each algorithm as the number of nodes increases (x-axis) [55-65].

Algorithm 5 : 4 foreach w ∈ s u do 5 if w ∈ s v then 6 dAlgorithm 6 :
Sketches_DIS-C: k sketches onlineFunction sketchesDistance(u, v, S): ← min (d, s u (w) + s v (w)) solve the APSP problem, it is necessary to create another algorithm that is executed for each pair of nodes (Algorithm 6).Sketches_DIS-C: k sketches online APSP Function sketchesDistance(u, v, S):

Figure 4 .
Figure 4. Comparative offline and online processes using pruned Dijkstra and sketch-based algorithms.

Figure 5 .
Figure 5. Simple DIS-C runtime for the following algorithms: Dijkstra, pruned Dijkstra simple, and sketch-based simple.

S k e t c h e s k = 1 5 S k e t c h e s k = 1 S k e t c h e s k = 2 S k e t c h e s k = 5 S k e t c h e s k = 1 0 S k e t c h e s k = 2 0 Figure 6 .
Figure 6.The Pearson correlation matrix for the MC30 dataset.

1 S k e t c h e s k = 1 5 S k e t c h e s k = 1 S k e t c h e s k = 2 S k e t c h e s k = 5 S k e t c h e s k = 1 0 S k e t c h e s k = 2 0 PFigure 7 .
Figure 7.The Spearman correlation matrix for the MC30 dataset.

Figure A1 .Figure A3 .
Figure A1.Example of a graph for the pruned Dijkstra algorithm.

Table 1 .
Edge counting-based methods.dis ≡ conceptual distance; sim ≡ similarity between two concepts: a and b.
Table 2 summarizes the IC-based methods.In addition, this table describes the expressions used to calculate IC and briefly explains the variables involved.Hybrid methods combine two or more techniques for calculating semantic similarity.Quintero et al. (2019)

Table 3 .
Analytical runtimes for different APSP algorithms.

continue end 10 else 11 if query
(u, u k , L) ≤ d u

then 12 continue 13 end 14 end
/* There is no previous path, so the current path is added to the index of the node being processed, and ordinary Dijkstra execution is performed.

end end 15 end 16 end
and duv < S v then S v ← d uv push(Q, v, d uv )

Table 4 .
List of datasets used in the evaluation process.

Table 5 .
Harmonic values of the correlations.

Table A1 .
The Pearson correlation of all datasets in the direction a − b.

Table A2 .
The Pearson correlation of all datasets in the direction b − a.

Table A3 .
The Spearman correlation of all datasets in the direction a − b.

Table A4 .
The Spearman correlation of all datasets in the direction b − a.