Next Article in Journal
Driving across Markets: An Analysis of a Human–Machine Interface in Different International Contexts
Previous Article in Journal
Correction: Yi et al. SFS-AGGL: Semi-Supervised Feature Selection Integrating Adaptive Graph with Global and Local Information. Information 2024, 15, 57
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation

1
College of Information Technology, Shanghai Jian Qiao University, 1111 Hucheng Ring Road, Pudong New Area, Shanghai 201306, China
2
School of Computer Science and Technology, Donghua University, 2999 Renmin North Road, Songjiang District, Shanghai 201620, China
3
School of Information Management, Shanghai Lixin University of Accounting and Finance, 2800 Wenxiang Road, Songjiang District, Shanghai 201209, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2024, 15(6), 348; https://doi.org/10.3390/info15060348
Submission received: 19 March 2024 / Revised: 1 May 2024 / Accepted: 4 June 2024 / Published: 12 June 2024

Abstract

:
SimRank is a widely used metric for evaluating vertex similarity based on graph topology, with diverse applications such as large-scale graph mining and natural language processing. The objective of the single-source and top-k SimRank query problem is to retrieve the kvertices with the largest SimRank to the source vertex. However, existing algorithms suffer from inefficiency as they require computing SimRank for all vertices to retrieve the top-k results. To address this issue, we propose an algorithm named HitSimthat utilizes a branch and bound strategy for the single-source and top-k query. HitSim initially partitions vertices into distinct sets based on their shortest-meeting lengths to the source vertex. Subsequently, it computes an upper bound of SimRank for each set. If the upper bound of a set is no larger than the minimum value of the current top-k results, HitSim efficiently batch-prunes the unpromising vertices within the set. However, in scenarios where the graph becomes dense, certain sets with large upper bounds may contain numerous vertices with small SimRank, leading to redundant overhead when processing these vertices. To address this issue, we propose an optimized algorithm named HitSim-OPT that computes the upper bound of SimRank for each vertex instead of each set, resulting in a fine-grained and efficient pruning process. The experimental results conducted on six real-world datasets demonstrate the performance of our algorithms in efficiently addressing the single-source and top-k query problem.

1. Introduction

In various real-world scenarios, the measurement of similarity between entities is often crucial. For instance, in recommendation systems, predicting potential friendships based on the similarity between individuals in a social network [1,2]; or recommending items to users based on their behavior and preferences [3]. In security systems, analyzing email similarity to detect spam messages [4]; or analyzing account similarity to identify fraudulent transactions [5]. Among existing similarity computation methods, those based on entity linkage relationships are most commonly employed. Among these, SimRank is a widely used model for computing vertex similarity based on the directed graph topology. SimRank was introduced by Jeh and Widom [6] in 2002, and it is formulated based on two intuitive statements: (1) If two entities (vertices) are referenced by similar entities (i.e., in a directed graph, if the in-neighbors of two different vertices are similar or identical), then those two entities are also considered similar, and (2) an entity is most similar to itself. The classical algorithm, named power method [6], along with its variations [7], serves as the foundation for computing SimRank between each two vertices. However, these algorithms suffer from the time and space complexity of Ω ( n 2 ) , where n represents the total number of vertices in the graph G, as there exist Ω ( n 2 ) vertex pairs in G. To tackle this challenge, Ref. [8] proposed the single-source and top-k query problem. This problem focuses on efficiently retrieving the k vertices with the largest SimRank to a specific source vertex. Existing approaches [8,9,10,11,12,13,14,15,16] for the single-source and top-k query problem primarily utilize the random walk method, resulting in notable improvements in both time and space efficiency. Ref. [10] proposed two heuristic algorithms employing truncated random walk and prioritized propagation strategies. Ref. [11] devised an index for SimRank computation based on c -walk, despite the notable overhead in terms of space and preprocessing. Ref. [12] proposed an index-free algorithm named ProbeSim that outperforms the index-based approaches. The state-of-the-art algorithm CrashSim [13] improves the computational efficiency of SimRank in ProbeSim by truncating walk lengths.
Challenges. In the single-source and top-k query problem, existing approaches require computing the SimRank for all the vertices and subsequently sorting the top-k results in descending order based on their SimRank. However, computing SimRank for vertices requires sampling a large number of random walks, which is a time-consuming operation. Moreover, in real-world applications, the desired result scale k specified by users is often much smaller than the vertex scale n of the network. Therefore, computing SimRank for vertices with negligible SimRank value is redundant.
Our approach. Motivated by the above observations, we propose a novel algorithm called HitSim for the single-source and top-k query problem, which employs a branch and bound strategy. Specifically, by leveraging the inherent property that the SimRank of a vertex will decrease as its meeting length increases, HitSim partitions vertices into distinct sets based on their shortest-meeting lengths (Definition 5) and computes the upper bound of SimRank for each set. To reduce the redundant computation for the vertices with negligible SimRank, HitSim preferentially processes the vertices within the set with a larger upper bound. If the upper bound of a set is less than the minimum SimRank of the current results, the vertices within the set can be efficiently pruned in batch.
However, in scenarios where the graph becomes dense, the number of vertices within the same set can grow rapidly. This may result in a scenario where vertices with the same shortest-meeting lengths but significantly different SimRank are partitioned into the same set. Consequently, the efficiency will decrease since the algorithm may preferentially process the vertices with the shortest-meeting lengths but with a small SimRank. To address this issue of inefficient pruning in HitSim, we propose an optimized algorithm called HitSim-OPT. HitSim-OPT computes the upper bound of SimRank for each vertex, allowing for fine-grained pruning of vertices. Similar to HitSim, HitSim-OPT maintains the minimum SimRank of the current top-k results and prunes unpromising vertices by comparing their upper bounds with the minimum value. Our contributions are as follows:
  • We propose an efficient algorithm, named HitSim, based on a branch and bound strategy to answer the single-source and top-k query. HitSim can efficiently prune the unpromising vertices within a set in batch, reducing redundant computation and improving overall efficiency.
  • We further propose an optimized algorithm, named HitSim-OPT. By computing the upper bound of SimRank for each vertex, HitSim-OPT performs a fine-grained pruning strategy, resulting in further improvements in efficiency.
  • We conduct experiments on six real-world datasets. The experimental results show that our algorithms can efficiently answer the single-source and top-k query.
Organization. The rest of this paper is summarized as follows. Section 2 provides some preliminaries. In Section 3, we give a review of existing works. In Section 4, we propose an efficient algorithm based on a branch and bound strategy for the single-source and top-k query. In Section 5, we further propose an optimized algorithm. Section 6 shows the experimental results. Section 7 concludes this paper.

2. Preliminaries

In this section, we formally introduce the notation and definitions. Mathematical notations used throughout this paper are summarized in Table 1.
Definition 1
(SimRank). Given two vertices u and v in directed graph G = ( V , E ) , the SimRank of u and v, denoted as s ( u , v ) , is defined as
s ( u , v ) = 1 , u = v p I ( u ) q I ( v ) c × s ( p , q ) | I ( u ) | × | I ( v ) | , u v
where I ( u ) denotes the set of in-neighbors of u, and  c ( 0 , 1 ) is a decay factor [12,13].
For example, in  G 1 of Figure 1, if the decay factor c = 0.64 , according to Equation (1), we have s ( v 0 , v 0 ) = 1 . The SimRank between v 1 and v 2 is s ( v 1 , v 2 ) = p I ( v 1 ) q I ( v 2 ) c × s ( p , q ) | I ( v 1 ) | × | I ( v 2 ) | = 0.64 . Similarly, s ( v 1 , v 3 ) = s ( v 2 , v 3 ) = 0.64 , s ( v 4 , v 5 ) = s ( v 5 , v 6 ) = 0.486 , s ( v 4 , v 6 ) = 0.467 , and the SimRank between other vertex pairs is 0.
Definition 2
(Reverse random walk). Given vertex u in directed graph G, a reverse random walk from u is a sequence of vertices W ( u ) = ( u 0 , u 1 , u 2 . . . ) , such that u i + 1 ( i 0 ) is selected uniformly at random from the in-neighbors of u i .
Definition 3
( c -walk [11]). Let c denote the decay factor, a  c -walk in G is defined such that (1) in each step of the reverse random walk, we have 1 c probability of stopping; (2) for the remaining c probability, one of the in-neighbors of the current vertex is selected uniformly at random as the next step. We denote a c -walk starting from u as W ( u ) = ( u 0 , u 1 , . . . , u i , . . . ) , where u 0 = u .
According to Definition 3, Ref. [11] also defined the SimRank estimation s ¯ ( u , v ) as the total probability that c -walk W ( u ) starting from u meets c -walk W ( v ) starting from v, i.e.,  s ¯ ( u , v ) = Pr [ W ( u ) and W ( v ) meet ] .
Definition 4
(First-meeting probability [12]). Given a reverse path c -walk W ( u ) = ( u 0 , . . . , u i ) starting from u ( u 0 = u ) and v V ( v u 0 ) , the first-meeting probability of v with respect to W ( u ) is defined as
P ( v , W ( u ) ) = Pr W ( v ) [ v i = u i , v i 1 u i 1 , . . . , v 0 u 0 ] .
where W ( v ) = ( v 0 , . . . , v i , . . . ) is a random c -walk starting from v 0 = v .
According to Definition 3 and 4, Ref. [12] defined s ¯ ( u , v ) as the total probability that c -walk W ( u ) starting from u and c -walk W ( v ) starting from v first meet at each vertex u i , i.e.,
s ¯ ( u , v ) = Pr [ W ( u ) and W ( v ) meet ] = i Pr [ W ( u ) and W ( v ) first meet at u i ] .
Definition 5
(Shortest-meeting length). Given any two c -walks W ( v ) = ( v 0 , . . . , v t , . . . ) starting from v ( v 0 = v ) and W ( u ) = ( u 0 , . . . , u t , . . . ) starting from u ( u 0 = u ) , u V and v V ( v u ) , the shortest-meeting length t between u and v is defined as the least t such that the probability of meeting after t steps is nonzero.
For example, in Figure 1, any two c -walks W ( v 4 ) starting from v 4 and W ( v 5 ) starting from v 5 may first meet at v 1 or v 2 with walking one step, at  v 0 with walking two steps. Then, the shortest-meeting length between them is 1.
Definition 6
(Approximate single-source SimRank query). In the graph G = ( V , E ) , given source u, the average absolute error ε allowed in SimRank computation, and the failure probability δ, for any vertex v ( v u ) , of an approximate single-source SimRank query returns an estimated SimRank s ¯ ( u , v ) to the ground-truth SimRank s ( u , v ) , which satisfies
Pr { s ( u , v ) s ¯ ( u , v ) ε } 1 δ .
Problem Statement. (Approximate single-source and top-k query) In graph G = ( V , E ) , given source vertex u, decay factor c, and integer k, return the top-k vertices V k ( u ) = { v i u , 1 i k } with the largest SimRank.

3. Related Works

In this section, we review the state-of-the-art algorithms for the single-source and top-k query problem, which are based on the widely used random walk method.
Ref. [11] first proposed the SLING algorithm, which utilizes the c -walk to compute SimRank. Subsequently, several studies [9,11,14] adopted the Monte Carlo (MC) method to sample c -walks for each vertex v and source u a certain number of times. By counting the number of times they meet ( c n t ) out of the total number of walks sampled ( n u m ), the SimRank is obtained as s ¯ ( u , v ) = c n t / n u m . After computing the SimRanks with source u for all vertices and sorting them in descending order, the algorithms return the top-k results. Ref. [15] proposed the TSF algorithm based on the MC method. TSF constructs an index comprising one-way graphs that contain the coupling of random walks of length T from each vertex. This approach helps reduce storage space requirements. To address the computational cost of simulations, Ref. [16] proposed an index-based algorithm called READS. The index consists of compressed c -walks, which significantly improves the efficiency of queries. In the case of MC-based algorithms, there is a problem that when the SimRank of many vertices is negligible, resulting in the c -walks within a limited length not meeting the path of the source u, then the processing of these vertices is redundant. To tackle this issue, an index-free algorithm called ProbeSim was proposed by Ref. [12]. Instead of sampling c -walks from each vertex v to determine whether v and source u can meet at any u i within the c -walk W ( u ) from u, ProbeSim conducts a graph traversal from each u i to identify vertices that have a non-negligible probability of walking to u i . This process is repeated for n r iterations to obtain results with approximate guarantees. By avoiding the processing of unpromising vertices, this approach significantly reduces computational overhead. However, it requires generating a large number of probing trees to determine whether W ( u ) can meet every v at each step of the c -walk W ( u ) starting from u.
To overcome the mentioned issue, Ref. [13] proposed an improved algorithm called CrashSim that builds upon the principles of ProbeSim. The main idea behind CrashSim is to consider the SimRank as the average probability of two c -walks meeting within a limited length. CrashSim first computes a reverse reachable tree of source vertex u with a limited length of c -walk, denoted as l m a x . It then iteratively generates a c -walk for each vertex v and determines whether it can meet the limited c -walk path from u with a non-negligible probability. This process is repeated for n r iterations to obtain approximate results with certain guarantees. By traversing the reverse reachable tree for only the source vertex u instead of exploring the entire graph for each vertex u i V u , CrashSim significantly reduces redundant computations.
However, the above mentioned algorithms require computing the SimRank for each vertex u i V u , sorting them in descending order, and returning the top-k results. Such a method results in significant redundant computational overhead when querying for a small value of k, considering that k is typically much smaller than the total number of vertices n in real-world applications.

4. HitSim

Although the state-of-the-art algorithm CrashSim reduces redundant computation by traversing the reverse reachable tree for only source vertex u, it still generates c -walks for each vertex v to determine whether it can meet the c -walks from u within a limited length. However, since the parameter k is typically much smaller than the total number of vertices n, processing unpromising vertices with negligible SimRank values is redundant. To overcome this issue, we propose an efficient algorithm called HitSim. It utilizes a branch and bound strategy to batch-prune unpromising vertices to avoid the redundant processing. The algorithm is implemented in three steps, as follows.
Step 1: Computing the reverse reachable tree U. In the first step of HitSim, we perform the same approach as CrashSim [13]. This involves computing the reverse reachable tree from the source vertex u and generating a matrix U. Each element U s t e p ( v ) in the matrix represents the probability of the c -walk W ( u ) stopping at vertex v by walking s t e p steps.
Example 1.
We illustrate the computation of the reverse reachable tree U using the graph G 2 shown in Figure 2. Given the source v 0 , we set the limited length l m a x of c -walk W ( v 0 ) to 4 and decay factor c to 0.25 ( c = 0.5 ). The algorithm computes the reverse reachable tree of v 0 , which is shown in Figure 3. Note that the in-neighbor of vertex v that is equal to the parent of v is ignored to avoid recomputing the probability due to the cycle in the graph [13]. Simultaneously, it computes the probability of the c -walk stopping at different vertices with different lengths. For level 0, it sets the probability U 0 ( v 0 ) = 1 . Next, for level 1, U 1 ( v 1 ) = U 0 ( v 0 ) · c I ( v 1 ) = 1 · 0.5 2 = 0.25 , U 1 ( v 2 ) = U 0 ( v 0 ) · c I ( v 2 ) = 1 · 0.5 3 = 0.167 . Similarly, for level 2, U 2 ( v 4 ) = 0.0625 , U 2 ( v 1 ) = 0.0417 , U 2 ( v 3 ) = 0.0417 , and for level 3, U 3 ( v 7 ) = 0.0156 , U 3 ( v 0 ) = 0.0104 , U 3 ( v 4 ) = 0.0104 , U 3 ( v 1 ) = 0.0104 .
Step 2: Partitioning vertices. Based on CrashSim, assuming the xth sampling, if a c -walk starting from v, i.e.,  W ( v ) = ( v 0 , . . . , v t , . . . , v i ) ( v = v 0 , t [ 1 , i ] , 1 i l m a x ) , first meets the c -walk W ( u ) starting from source u at v t , then
s ¯ x ( u , v ) = U 0 ( v 0 ) + . . . + U t ( v t ) + . . . + U i ( v i ) .
Lemma 1.
Given a graph G = ( V , E ) and a source vertex u, if there exists a vertex v ( v u ) , and its shortest-meeting length is t ( 1 t l m a x ) , then the upper bound of SimRank for v, denoted as B v , is l = t l m a x p l , where p l represents the maximum probability in level l of the reverse reachable tree of the source u.
Proof of Lemma 1. 
Since the shortest-meeting length of v is t ( 1 t l m a x ) , any c -walk W x ( v ) = ( v 0 , . . . , v t , . . . , v i ) ( v 0 = v , t [ 1 , i ] , 1 i l m a x ) starting from v will not meet the c -walk W x ( u ) starting from source u before v t . Then, we have
U 0 ( v 0 ) = . . . = U t 1 ( v t 1 ) = 0 .
Furthermore, p l is the maximum probability in level l of the reverse reachable tree of source u. Then,
U t ( v t ) p t , . . . , U i ( v i ) p i , ( t [ 1 , i ] , 1 i l m a x ) .
Based on Equations (5)–(7), for any c -walk W x ( v ) = ( v 0 , . . . , v t , . . . , v i ) ( v 0 = v , t [ 1 , i ] , 1 i l m a x ) starting from v, we have
s ¯ x ( u , v ) l = t i p l , ( t [ 1 , i ] , 1 i l m a x ) .
Since i l m a x , then l = t i p l l = t l m a x p l . The average value after n r trials is
s ¯ ( u , v ) l = t l m a x p l .
Thus, for any vertex v, the upper bound of SimRank is B v = l = t l m a x p l , where t is the shortest-meeting length, i.e., Lemma 1 holds.    □
According to Lemma 1, for any two different vertices v and w ( v u , w u ) , if the shortest-meeting length of v and w are t v and t w , respectively, then B v = l = t v l m a x p l and B w = l = t w l m a x p l . If  t v < t w , then B v = l = t v l m a x p l > B w = l = t w l m a x p l . From the above analysis, we have Observation 1 as follows:
Observation 1.
For any vertex v, the smaller the shortest-meeting length, the larger the upper bound B v .
According to Lemma 1, vertices with the same shortest-meeting length share the same upper bound. By partitioning the vertices into distinct sets based on their shortest-meeting lengths, we ensure that vertices within the same set share the same upper bound, i.e., for any vertex v in the same set M, B v = B M , where B M is the upper bound of SimRank for set M. If we identify a set M satisfying B M s ¯ m i n , where s ¯ m i n is the minimum SimRank value of the current results, we can safely prune these unpromising vertices within M batches. According to Observation 1, we should preferentially process sets with smaller shortest-meeting lengths. Such preprocessing allows the algorithm to preferentially compute SimRank for vertices with a larger SimRank upper bound, leading to the effective pruning of unpromising vertices.
Step 3: Computing s ¯ ( u , v ) and maintaining the top-k results. Following step 2, HitSim generates c -walks and computes the SimRank only for the vertices within the sets whose upper bounds exceed s ¯ m i n . Simultaneously, HitSim maintains the top-k results and the current minimum SimRank s ¯ m i n by sorting the results based on their SimRank in descending order.
Since step 1 is identical to CrashSim, we will omit its detailed description. The detailed descriptions of steps 2 (Section 4.1) and 3 (Section 4.2) are as follows.

4.1. Partitioning Vertices

We now describe the ParVer algorithm, which is used to partition vertices into distinct sets based on their shortest-meeting lengths with the source vertex. Given a graph G = ( V , E ) , a source vertex u V , a limited length l m a x of c -walk, and a probability matrix U, ParVer returns a vertex set M t and the upper bound of SimRank B M t for each t = 1 , . . . , l m a x , where t is the shortest-meeting length of the vertices within M t . Notably, M t does not include any vertices from the set M t 1 .
The pseudo-code of ParVer is depicted in Algorithm 1. It initializes a hashset M t for each t = 1 , . . . , l m a x to store the vertices with shortest-meeting lengths equal to t (line 1). To avoid revisiting vertices of the same level, it requires a hashset D u t for each t = 1 , . . . , l m a x to store the visited vertices (line 2). The probabilities p t and B M t are initialized to store the maximum value of U t and the upper bound of SimRank of M t for each t = 1 , . . . , l m a x (line 3). To identify all the vertices within M t for each t = 1 , . . . , l m a x , it iterates over each vertex t p r in level t of the reverse reachable tree of source u (lines 4 to 20). For each vertex t p r in level t of the reverse reachable tree of source u, it records the current maximum probability p t in level t (line 6). Then, it performs forward walks of t steps from t p r to obtain each vertex v whose shortest-meeting length is t (lines 7 to 20). Specifically, it first initializes a queue Q to store the vertices that t p r visits in t-step forward walks. Then, it puts t p r into Q and D u i , where i represents the current level. Next, it forward walks from the vertex t p r and records the size of Q. With each iteration, the step i is reduced by 1. Then, it visits Q and pops the top element q of Q. For each out-neighbor v of q, if the step of the current forward walk does not exceed the maximum walk length i and v has not been visited before, then v is inserted directly into D u i and Q. If v does not exist in any set that stores vertices with shortest-meeting lengths from 0 to t 1 and v u , it is added to M t . Then, all the vertices within M t for each t = 1 , . . . , l m a x are obtained. Afterward, it computes the upper bound B M t based on Lemma 1 (lines 21 to 22). Finally, ParVer returns the vertex set M t and its upper bound B M t for each t = 1 , . . . , l m a x (line 23).
Algorithm 1: ParVer
Information 15 00348 i001
Example 2.
Considering the graph G 2 shown in Figure 2, we illustrate step 2, which involves the partitioning of vertices. Assume that the source vertex is v 0 . For simplicity, we set l m a x = 3 and the decay factor c = 0.25   ( c = 0.5 ) . In step 1, it computes the reverse reachable tree U, as shown in Example 1. Continuing from Example 1, in step 2, it partitions vertices of G 2 into distinct sets based on their shortest-meeting lengths with source v 0 . Starting with level 1 of the reverse reachable tree, it forward walks 1 step from the first vertex v 1 . Consequently, it inserts v 1 ’s out-neighbors v 2 , v 3 , and  v 4 into set M 1 . It omits v 0 since v 0 is the source vertex. Next, it forward walks one step from the second vertex v 2 in level 1. It inserts v 2 ’s out-neighbors v 5 and v 6 into set M 1 and omits v 3 since v 3 is already in M 1 . Thus, we have M 1 = { v 2 , v 3 , v 4 , v 5 , v 6 } , and the maximum probability is p 1 = m a x { U 1 ( v 1 ) , U 1 ( v 2 ) } = 0.25 . Moving to level 2, it forward walks 2 steps from v 4 , v 1 , and  v 3 , respectively. This leads to M 2 = { v 1 , v 7 }. Note that, v 2 , v 3 , v 4 , v 5 , and  v 6 are not inserted into M 2 since they are already in M 1 . The maximum probability is computed as p 2 = m a x { U 2 ( v 4 ) , U 2 ( v 1 ) , U 2 ( v 3 ) } = 0.0625. Lastly, in level 3, it forward walks three steps from v 7 , v 0 , v 4 , and  v 1 , respectively. As a result, M 3 = , indicating that no new vertices are added to M 3 . The maximum probability is computed as p 3 = { U 3 ( v 7 ) , U 3 ( v 0 ) , U 3 ( v 4 ) , U 3 ( v 1 ) } = 0.0156 . Finally, according to Lemma 1, the SimRank upper bounds of set M 1 and M 2 are computed as B M 1 = l = 1 3 p l = 0.25 + 0.0625 + 0.0156 = 0.3281 and B M 2 = l = 2 3 p l = 0.0625 + 0.0156 = 0.0781 , respectively.

4.2. Computing s ¯ ( u , v ) and Maintaining the Top-k Results

The complete pseudo-code of HitSim is illustrated in Algorithm 2. Given a graph G = ( V , E ) , a source u V , an integer k, a decay factor c, a limited length l m a x of c -walk, an average absolute error ε , and a failure probability δ , HitSim returns the queue Q of the top-k results. It first invokes the revReach algorithm (referenced as CrashSim) to construct the reverse reachable tree of source u and return a matrix U (line 1). Then, it invokes the ParVer algorithm (referenced as Algorithm 1) to partition vertices into distinct sets (line 2). Step 3, which involves computing s ¯ ( u , v ) and maintaining the top-k results, starts from line 3 of Algorithm 2.
In step 3, the algorithm initializes a priority queue Q to store the results (line 4). Based on Observation 1, for any vertex v, a smaller shortest-meeting length corresponds to a larger B v . Thus, HitSim processes the sets in ascending order of their shortest-meeting lengths t in order to obtain the top-k results as early as possible (lines 5 to 20). Specifically, for each vertex v in M t , it runs n r = 3 c ε 2 log n δ independent trials (lines 6 to 16). The computation of n r , which represents the minimum number of iterations that guarantees an error less than ε with at least 1 δ , is defined by [13]. During each iteration, it generates a c -walk starting from v and limits the length of the walk to l m a x (lines 8 to 9). Then, for the c -walk with length i, where i [ 1 , l m a x ] , it accumulates the total first-meeting probability of this walk meeting the c -walk starting from u at the ith element of W ( v ) (lines 10 to 12). After completing the n r trials, it computes the average of the results to obtain the final SimRank s ¯ ( u , v ) = 1 n r s ¯ x ( u , v ) and inserts ( v , s ¯ ( u , v ) ) into Q (lines 14 to 15). When the size of the current results is k, it checks whether the upper bound B M t + 1 of the next set to be processed is larger than s ¯ m i n . If it is, HitSim continues generating c -walks for the vertices within M t + 1 . Otherwise, it terminates and returns the current top-k results (lines 17 to 19).
Algorithm 2: HitSim
Information 15 00348 i002
Example 3.
Continuing from Example 2, we set k to 1; the objective of  HitSim  is to return the top-1 result. The algorithm begins by processing the first vertex v 2 within the set M 1 = { v 2 , v 3 , v 4 , v 5 , v 6 } and B M 1 = 0.3281 . Suppose that at the xth trial, it generates a c -walk W ( v 2 ) = ( v 2 , v 3 , v 1 , v 0 ) starting from v 2 . It computes the SimRank as s ¯ x ( v 0 , v 2 ) = U 0 ( v 2 ) + U 1 ( v 3 ) + U 2 ( v 1 ) + U 3 ( v 0 ) = 0 + 0 + 0.0417 + 0.0104 = 0.0521 . After conducting a total of n r iterations, it computes the average value of s ¯ ( v 0 , v 2 ) = 1 n r s ¯ x ( v 0 , v 2 ) . Once all the vertices in M 1 have been processed, it compares the minimum value of SimRank s ¯ m i n with the upper bound B M 2 = 0.0781 of the set M 2 . If  B M 2 s ¯ m i n , HitSim batch-prunes all the vertices in M 2 .
The SimRanks between v 0 and any other vertices are listed in Table 2, and they have been computed by [6] within a 10 5 error. From Table 2 we can see that, due to the branch and bound strategy, HitSim is able to efficiently batch-prune vertices such as v 1 and v 7 within M 2 , whose SimRanks are notably less compared to the rest.

4.3. Analysis

The time and space complexity of HitSim can be analyzed in steps 1, 2, and 3.
  • Step 1 (computing the reverse reachable tree U): In the worst case, it requires traversing each edge once and storing each vertex, resulting in a time complexity of O ( m ) and the space complexity is O ( n ) .
  • Step 2 (partitioning vertices): This step involves traversing each vertex of the reverse reachable tree and storing all the vertex sets. The worst case is that we need to traverse every edge in the graph by visiting the out-neighbors of each vertex, and the worst-case time complexity is O ( m ) . The space complexity is O ( n ) .
  • Step 3 (computing s ¯ ( u , v ) and maintaining the top-k results): In the worst case, it requires running n r trials of generating a c -walk for each vertex, with a limited length of l m a x . Additionally, it requires maintaining a priority queue Q of size k. Thus, the time complexity is O ( n · l m a x · 3 c ε 2 log n δ + n log k ) , and space complexity is O ( n + k ) .
In summary, the total time complexity of HitSim is O ( 2 m + n · l m a x · 3 c ε 2 log n δ + n log k ) , and the total space complexity is O ( n + k ) .

5. Optimization

The batch-pruning method employed in HitSim aims to prune unpromising vertices within the sets whose upper bounds of SimRank are less than the minimum SimRank of the current results. However, in certain cases, such as Example 3, where k is set to 2, the batch-pruning strategy may fail. In the example, after computing the SimRank for all vertices in M 1 , the current top-2 results are v 3 and v 4 , with a minimum SimRank s ¯ m i n of approximately 0.074, as shown in Table 2. Comparing s ¯ m i n with the upper bound B M 2 = 0.0781 , HitSim continues to compute the SimRank for vertices v 1 and v 7 within M 2 , indicating a failure of the batch-pruning method.
The reason for this failure is that in dense graphs, vertices may have a large number of in-neighbors. Vertices with the same shortest-meeting length but significantly different SimRanks can be partitioned into the same set, leading to imprecise pruning. For instance, in  G 2 , the reverse reachable tree of v 7 (shown in Figure 4) reveals that v 7 cannot meet source v 0 at v 4 in level 2, where U 2 ( v 4 ) is the maximum value in level 2. Similarly, v 7 cannot meet source v 0 at v 7 in level 3, where U 3 ( v 7 ) is the maximum value in level 3. Consequently, although  v 7 is partitioned into M 2 , its upper bound B M 2 , computed by accumulating the maximum values from level 2 and level 3, significantly exceeds its actual upper bound B v 7 . When processing such sets (e.g., M 2 ), it becomes inevitable to compute the SimRank for the vertices with a small SimRank (e.g., v 7 ), leading to the batch pruning being ineffective.
To address the issue, we propose an optimized algorithm called HitSim-OPT. HitSim-OPT aims to perform fine-grained and effective vertex pruning by computing a tighter upper bound of SimRank for each individual vertex, instead of that for each set. The optimized algorithm consists of the following three steps.
Step 1: Computing the reverse reachable tree U. The details of this step are omitted as they are the same as step 1 of HitSim.
Step 2: Computing the upper bound of SimRank for each vertex. According to Lemma 1, the upper bound of SimRank for any vertex v is computed by B v = l = t l m a x p l , where t represents the shortest-meeting length and p l denotes the maximum probability in level l of the reverse reachable tree of u. To refine the upper bound B v , we introduce a modification by replacing p l with p l ( v ) , which represents the maximum probability in level l of the reverse tree of v. Since each vertex t p r in the reverse reachable tree of v is the potential vertex within the c -walks W ( v ) starting from v, l = t l m a x p l ( v ) provides a more precise bounding than l = t l m a x p l .
By utilizing the inequality l = t l m a x p l ( v ) l = 0 l m a x p l ( v ) , we can ensure B v l = 0 l m a x p l ( v ) . Consequently, we can carefully modify the upper bound for each vertex v as
B v = l = 0 l m a x p l ( v ) .
Step 3: Maintaining the top-k results. Similar to step 3 of HitSim, HitSim-OPT maintains the top-k results and the current minimum SimRank s ¯ m i n by sorting the results based on their SimRank in descending order. For each vertex v, HitSim-OPT determines whether to further compute its actual SimRank by comparing B v with s ¯ m i n . If  B v is less than s ¯ m i n , HitSim-OPT prunes v.
Since step 1 is identical to that of HitSim, we will omit its detailed description. The detailed descriptions of the algorithms in steps 2 (Section 5.1) and 3 (Section 5.2) are as follows.

5.1. Computing Upper Bound of SimRank for Each Vertex

A straightforward approach is to construct a reverse reachable tree for each vertex v and compute the maximum probability of v meeting the source vertex u at each level of u’s reverse reachable tree, as computed by Equation (10). Taking v 7 from Figure 2 as an example, we construct the reverse reachable tree of v 7 , as shown in Figure 4. At level 1, the maximum probability is 0 because v 7 cannot meet the source vertex v 0 with one step (see the reverse reachable tree of source v 0 in Figure 3). In level 2, the maximum probability is 0.0417 as v 7 can meet v 0 at v 1 with two steps. Similarly, in level 3, the maximum probability is 0.0104 as v 7 can meet v 0 at either v 1 or v 0 with three steps. By accumulating the maximum probabilities of the three levels, we obtain B v 7 = 0 + 0.0417 + 0.0104 = 0.0521 .
However, constructing a reverse reachable tree for each vertex in the straightforward approach suffers from significant challenges in terms of time and space complexity. To address this challenge, we further propose a more efficient method called UBV (Upper Bound of Vertex) based on dynamic programming to compute the upper bound of SimRank for each vertex v. It utilizes an array d p w of length l m a x for each potential meeting point w to store the maximum probabilities at levels 0 to l m a x . Such processing allows the algorithm to only traverse source u’s reverse reachable tree once instead of constructing a reverse reachable tree for each vertex. For each potential meeting point w in level t ( 0 t l m a x ) , d p w ( i ) ( 0 i l m a x ) represents the maximum probability that v can meet u at level i. UBV dynamically computes d p w ( i ) for each w starting from the bottom of u’s reverse reachable tree. By the end of the traversal of level 0, a set of d p w values can be obtained, where the vertices w in level 0 are those that can meet u within l m a x steps. The upper bounds of SimRank for these vertices are computed as i = 0 l m a x d p w ( i ) .
Lemma 2.
For each potential meeting point w, the state array d p w in level t is computed by the following equation.
d p w ( i ) = 0 , i < t U t ( w ) , i = t max { d p t p r ( i ) } , i > t
where U t ( w ) is the probability of source u traversing and stopping at vertex w, and  t p r is the in-neighbor of vertex w.
Proof of Lemma 2. 
(1) When i < t , the  c -walk starting from w at level t cannot meet any vertex at level i in the reverse reachable tree of source vertex u, thus d p w ( i ) = 0 ; (2) When i = t , the  c -walk starting from w at level t happens to meet w itself, thus d p w ( i ) = U t ( w ) ; (3) When i < t , the  c -walk starting from w at level t must pass through one of its in-neighbors, denoted as t p r . In this case, d p w ( i ) is equal to the maximum probability among its in-neighbors meeting at level i, i.e.,  d p w ( i ) = max d p t p r ( i ) .    □
The pseudo-code of UBV is illustrated in Algorithm 3. UBV initializes an array d p t p r for each t p r in the reverse reachable tree of source vertex u from the bottom to the top (lines 4 to 9). Subsequently, it updates d p v for each out-neighbor of t p r based on Equation (11) (lines 10 to 20). Note that the vertices v at level 0 represent the vertices that can meet u within l m a x steps, and  i = 0 l m a x d p v ( i ) denotes the upper bounds of SimRank (lines 22 to 26).
Algorithm 3: UBV
Information 15 00348 i003
Example 4.
We use G 2 in Figure 2 to illustrate step 2 ofHitSim-OPT, i.e., the computation of the SimRank upper bound for each vertex. Given source v 0 , for simplicity, we set l m a x = 4 and the decay factor c = 0.25 ( c = 0.5 ) . Continuing with Example 2,HitSim-OPTcomputes the upper bound of SimRank for each vertex. Specifically,UBVinitializes a hash map H t for each level t to store the array’s d p . Starting from the bottom of u’s reverse reachable tree, it sets d p v 7 ( 3 ) = U 3 ( v 7 ) , d p v 0 ( 3 ) = U 3 ( v 0 ) , d p v 4 ( 3 ) = U 3 ( v 4 ) , and  d p v 1 ( 3 ) = U 3 ( v 1 ) based on Equation (11). After the first iteration ( t = l m a x ) , the result H 3 is shown in Figure 5. When t = 2 ,UBVfirst sets d p v 4 = U 2 ( v 4 ) , d p v 1 = U 2 ( v 1 ) , and  d p v 3 = U 2 ( v 3 ) . Then,UBVcomputes d p for all the out-neighbors of each t p r H 3 . After this iteration, H 2 is obtained. Similarly, when t = 1 and 0, H 1 and H 0 can be obtained. Finally,UBVcomputes i = 0 l m a x d p v ( i ) for each v H 0 . The upper bounds of all vertices are B v 2 = 0.3281 , B v 3 = 0.3281 , B v 4 = 0.3281 , B v 5 = 0.2451 , B v 6 = 0.2191 , B v 1 = 0.0521 , B v 7 = 0.0521 .

5.2. Maintaining the Top-k Results

To obtain the top-k results, the algorithm requires computing SimRank for vertices whose upper bounds of SimRank are larger than the current minimum SimRank s ¯ m i n . However, the computation of SimRank involves a large number of iterations for generating c -walks to obtain an average value. Ref.  [13] defines the minimum number of iterations, denoted as n r , that guarantees an error less than ε with at least 1 δ . However, when  n r increases, the number of iterations for generating c -walks also increases. This can lead to significant time consumption when computing SimRank for the vertices with negligible SimRank. Here, we introduce an effective method to avoid the redundant generation for such vertices. Based on Lemma 3, at the xth iteration, we assume that the SimRanks of the remaining n r x iterations are all computed as the upper bound B v , if the average value of total n r iterations is not larger than the current minimum SimRank s ¯ m i n ; then, we can safely skip the computation of the remaining n r x iterations and prune this vertex. This enables us to reduce computational overhead and improve efficiency.
Lemma 3.
During the computation of SimRank for vertex v, at the xth iteration for generating a c -walk, if 
i = 1 x s ¯ i ( u , v ) + ( n r x ) B v n r s ¯ m i n
then the remaining n r x iterations can be safely skipped.
Proof of Lemma 3. 
Since B v is the upper bound of the SimRank of v, for any yth trial, we have
s ¯ y ( u , v ) B v
Furthermore,
s ¯ ( u , v ) = i = 1 n r s ¯ i ( u , v ) n r
Then, we have
s ¯ ( u , v ) = i = 1 x s ¯ i ( u , v ) + j = n r x n r s ¯ j ( u , v ) n r i = 1 x s ¯ i ( u , v ) + ( n r x ) B v n r
Based on Equation (15), if Equation (12) is satisfied, it implies that s ¯ ( u , v ) is less than s ¯ m i n . Therefore, the remaining n r x trials can be safely skipped.    □
The complete pseudo-code of HitSim-OPT is illustrated in Algorithm 4. It first invokes the revReach algorithm (referenced as CrashSim [13]) to construct the reverse reachable tree of source u and return a matrix U (line 1). Subsequently, it invokes the UBV algorithm to compute the upper bound B v for each vertex v (line 2).
Algorithm 4: HitSim-OPT
Information 15 00348 i004
In step 3, it maintains a priority queue Q of size k to store the top-k results (lines 4 to 22). Specifically, it determines whether to prune v to avoid generating c -walks for v by comparing B v with the top element s ¯ m i n in Q. If  B v s ¯ m i n , v can be pruned (lines 6 to 8). If v is not pruned, the algorithm generates a c -walk for n r trials to compute s ¯ ( u , v ) for v (lines 9 to 20). During the n r trials, if the current i = 1 x s ¯ ( u , v ) satisfies Lemma 3, the generation process breaks (lines 15 to 17).

5.3. Analysis

The time and space complexity of HitSim-OPT can be analyzed in steps 1, 2, and 3.
  • Step 1 (computing the reverse reachable tree U). It is the same as step 1 of HitSim.
  • Step 2 (computing the upper bound of SimRank for each vertex). Computing an array of l m a x for all vertices is essential, resulting in a time complexity of O ( m · l m a x ) and a space complexity of O ( n · l m a x ) .
  • Step 3 (maintaining the top-k results). It is the same as step 3 of HitSim.
Based on the above analysis, the total time complexity of HitSim-OPT is O ( m + m · l m a x + n · l m a x · 3 c ε 2 log n δ + n log k ) , and space complexity is O ( 2 n + k + n · l m a x ) .

6. Experiments

6.1. Experimental Setup

We conduct extensive experiments to evaluate the performance of our algorithms. The algorithms evaluated in our experiments are summarized as follows:
  • MC [14];
  • CrashSim [13];
  • HitSim: Algorithm 2;
  • HitSim-OPT: Algorithm 4.
We conduct all the experiments on an Ubuntu machine with Intel(R) Core(TM) i7-12700 CPU 2.10 GHz and 64 G memory. All of the algorithms are implemented in C++.
Datasets. We used six datasets to evaluate the performance of all the algorithms. The dataset soc-Epinions (http://konect.cc/networks/ (accessed on 20 September 2022)) represents the trust network from the online social network Epinions; emai-EuAll (http://snap.stanford.edu/data/ (accessed on 1 December 2021)) is the email communication network of a large European institution; amazon (http://konect.cc/networks/ (accessed on 20 September 2022)) is the network of items on Amazon; wiki-topcats (http://snap.stanford.edu/data/ (accessed on 1 December 2021)) is a web graph of Wikipedia hyperlinks; soc-LiveJournal (http://snap.stanford.edu/data/ (accessed on 1 December 2021)) is a free online community with almost 10 million members; wikipedia-link-en (http://konect.cc/networks/ (accessed on 20 September 2022)) consists of the wikilinks of Wikipedia in the English language. These datasets represent relevant knowledge in various fields. Detailed statistics of these datasets are summarized in Table 3, where | V | , | E | , and  d ¯ denote the number of vertices, the number of edges, and the average of degrees.
Parameters. Consistent with previous studies [3,7,13,17], we set the decay factor c to 0.6 and l m a x to 5. To evaluate the performance under different error guarantees, we vary the parameter ε to achieve overall absolute error guarantees of 0.1, 0.05, 0.025, and 0.0125, while maintaining a failure probability of δ = 0.01 .
Metrics. For evaluating the quality of results for each single-source and top-k query from the source vertex u, we employ two metrics: A E @ k and P r e c i s i o n @ k . The  A E @ k metric, representing the average absolute error, is computed as A E @ k = 1 k i = 1 k s ( u , v ) s ¯ ( u , v ) , where s ( u , v ) is the true SimRank score between vertices u and v, and  s ¯ ( u , v ) is the SimRank estimation. The P r e c i s i o n @ k metric measures the proportion of correctly identified results among the top-k results. It is computed as P r e c i s i o n @ k = V k V k / k , where V k represents the list of top-k vertices returned by the algorithm being evaluated, and  V k represents the ground-truth top-k results.

6.2. Performance of Algorithms

Absolute error in querying (AE@50). In this experiment, we set k to 50 and vary ε from 0.1 to 0.05, 0.025, and 0.0125. Figure 6 illustrates the trade-offs between A E @ 50 and the query time of each algorithm. We can observe that as  ε decreases the value of A E @ 50 decreases. However, the query time increases for each algorithm. The curves in the graph show a near-linear relationship, suggesting that the algorithms achieve a faster query time when ε is large.
Additionally, when comparing algorithms with the same parameters, both HitSim and HitSim-OPT demonstrate similar average A E values compared to CrashSim, while requiring less query time. This can be attributed to the fact that CrashSim computes SimRank for all vertices to obtain the top-k results, whereas HitSim only computes SimRank for vertices within sets whose upper bounds of SimRank exceed s ¯ m i n , and HitSim-OPT computes SimRank only for vertices with larger upper bounds.
It is worth noting that the results of MC may be unstable when n r (the number of iterations) is not sufficiently large, as MC requires generating a large number of c -walks for each vertex v to determine if it meets source u.
Precision in querying (Precision@50). In this experiment, we fix the value of k to 50 and vary the parameter ε from 0.1 to 0.05, 0.025, and 0.0125. The trade-off between P r e c i s i o n @ 50 and query time is illustrated in Figure 7. We observe that, as  ε decreases, both P r e c i s i o n @ 50 and the query time increase. When ε = 0.0125 , all algorithms achieve a precision of nearly 1. The curves in the graph show a near-linear relationship, indicating that algorithms achieve a faster query time when ε is large.
Moreover, when comparing algorithms with the same parameters, both HitSim and HitSim-OPT demonstrate similar precision values compared to CrashSim, while requiring less query time. The reason is the same as the last experiment. The results of MC may be unstable when n r (the number of iterations) is not sufficiently large.
Running time. In this experiment, we compare the running time of querying the top-250 results using the same parameters for each algorithm. Figure 8 shows the running time of each algorithm. The results demonstrate that on all datasets HitSim and HitSim-OPT exhibit faster performance compared to MC and CrashSim. Specifically, HitSim is approximately 7 times faster than MC on average and 3 times faster than CrashSim. Similarly, HitSim-OPT is approximately 30 times faster than MC on average and 11 times faster than CrashSim.
The main reason is that CrashSim processes all vertices, whereas HitSim only processes the vertices within sets whose upper bounds of SimRank exceed s ¯ m i n , and HitSim-OPT processes only vertices with larger upper bounds. As mentioned in the previous analysis, the running time of generating c -walks for vertices dominates the overall performance. Therefore, processing fewer vertices results in less time cost for HitSim and HitSim-OPT compared to MC and CrashSim.
To validate the correctness, we also test the number of vertices processed by each algorithm, as shown in Table 4. The results confirm that MC and CrashSim process all the vertices, while HitSim and HitSim-OPT process significantly fewer vertices due to their batch-pruning methods. Additionally, the number of vertices processed by MC or CrashSim is approximately 4 times more than HitSim on average and 16 times more than HitSim-OPT. The results demonstrate that HitSim and HitSim-OPT can efficiently return the single-source and top-k query.
Performance of HitSim and HitSim-OPT. In this experiment, we compare the performance of the preprocessing and query steps of HitSim and HitSim-OPT under the parameters k = 250 and ε = 0.025 . As described in Section 4, HitSim and HitSim-OPT return the top-k results in three steps. We consider the first two steps as preprocessing and the third step as the query phase.
Figure 9 illustrates the performance comparison between the preprocessing and query steps of HitSim and HitSim-OPT on all datasets. The results indicate that Hitsim-OPT outperforms HitSim. Notably, the preprocessing step of HitSim-OPT is slower than that of HitSim. This is attributed to the fact that HitSim-OPT requires constructing d p arrays of length l m a x for all vertices on the reverse reachable tree to obtain their SimRank upper bounds, whereas HitSim only requires visiting all vertices on the reverse reachable tree. The query performance of HitSim-OPT demonstrates a significant improvement over HitSim. This enhancement can be attributed to the finer-grained pruning rules of HitSim-OPT, allowing it to achieve efficiency by processing fewer vertices. However, in large-scale dense graphs such as SL and WE, many vertices’ upper bounds may be quite similar, resulting in a large number of vertices meeting the processing condition (upper bound B > s ¯ m i n ). Nonetheless, during the processing phase, HitSim-OPT leverages its efficient pruning rule (Lemma 3) to early-terminate c -walks, which is the most time-consuming operation. As a result, many vertices do not need to complete all the n r iterations of c -walks, thereby reducing the computational overhead. For instance, on SL, only 597 vertices complete all n r iterations of c -walks, while 3,841,961 − 597 = 3,841,364 vertices execute 1 to n r iterations of c -walks. Similarly, on WE, only 849 vertices complete all n r iterations of c -walks, while 5,147,351 − 849 = 5,146,502 vertices execute 1 to n r 1 iterations of c -walks.
Scalability. In this experiment, we test the scalability of HitSim and HitSim-OPT. We generate four subgraphs by randomly sampling 20–80% of the edges from the WE dataset. We test the running time of HitSim and HitSim-OPT on WE with fixed parameters c = 0.6 , ε = 0.025 , and δ = 0.1 . Figure 10 shows the results of the scalability test. We can see that both HitSim and HitSim-OPT show near-scalability as the number of edges E increases from 20% to 100%, and as the value of k ranges from 20 to 2000. Additionally, we note that as k increases, the running time of both HitSim and HitSim-OPT also increases. This is because the batch-pruning strategy is triggered later as k increases.

7. Conclusions

In this paper, we study the single-source and top-k SimRank query problem. We first propose an efficient algorithm called HitSim that utilizes a branch-and-bound strategy. HitSim partitions vertices into distinct sets based on their shortest-meeting lengths to the source vertex. Subsequently, it computes the upper bound of SimRank for each set. By batch-pruning vertices within the same set whose upper bound is less than the minimum SimRank of the current results, HitSim significantly enhances computational efficiency. Furthermore, we propose an optimized algorithm called HitSim-OPT, which employs a fine-grained pruning strategy. HitSim-OPT computes the upper bound of SimRank for each vertex, thereby improving pruning efficiency. Our experimental results on six real-world datasets demonstrate that, while maintaining comparable precision and absolute error to CrashSim, HitSim achieves an average speedup of 3 times compared to CrashSim, and HitSim-OPT achieves an average speedup of 11 times.

Author Contributions

Conceptualization, J.B. and J.Z.; methodology, J.B. and M.M.; software, M.M. and S.C.; validation, J.B. and J.Z.; formal analysis, M.D. and M.M.; investigation, J.B., M.M. and S.C.; data curation, J.B.; writing—original draft preparation, M.M. and J.B.; writing—review and editing, J.B. and J.Z.; supervision, J.Z., M.D. and Z.C.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by grants from the Natural Science Foundation of China (No.: 62372101, 61873337, 62272097).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available on request from the first author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jin, R.; Lee, V.E.; Hong, H. Axiomatic ranking of network role similarity. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; pp. 922–930. [Google Scholar] [CrossRef]
  2. Liben-Nowell, D.; Kleinberg, J.M. The link-prediction problem for social networks. J. Assoc. Inf. Sci. Technol. 2007, 58, 1019–1031. [Google Scholar] [CrossRef]
  3. Antonellis, I.; Garcia-Molina, H.; Chang, C. Simrank++: Query rewriting through link analysis of the click graph. Proc. VLDB Endow. 2008, 1, 408–421. [Google Scholar] [CrossRef]
  4. Spirin, N.; Han, J. Survey on web spam detection: Principles and algorithms. SIGKDD Explor. 2011, 13, 50–64. [Google Scholar] [CrossRef]
  5. Rothe, S.; Schütze, H. CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, MD, USA, 22–27 June 2014;: Long Papers; Volume 1, pp. 1392–1402. [Google Scholar] [CrossRef]
  6. Jeh, G.; Widom, J. SimRank: A measure of structural-context similarity. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; pp. 538–543. [Google Scholar] [CrossRef]
  7. Lizorkin, D.; Velikhov, P.E.; Grinev, M.N.; Turdakov, D. Accuracy estimate and optimization techniques for SimRank computation. VLDB J. 2010, 19, 45–66. [Google Scholar] [CrossRef]
  8. Tao, W.; Yu, M.; Li, G. Efficient Top-K SimRank-based Similarity Join. Proc. VLDB Endow. 2014, 8, 317–328. [Google Scholar] [CrossRef]
  9. Fogaras, D.; Rácz, B. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web, WWW 2005, Chiba, Japan, 10–14 May 2005; pp. 641–650. [Google Scholar] [CrossRef]
  10. Lee, P.; Lakshmanan, L.V.S.; Yu, J.X. On Top-k Structural Similarity Search. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA, 1–5 April 2012; pp. 774–785. [Google Scholar] [CrossRef]
  11. Tian, B.; Xiao, X. SLING: A Near-Optimal Index Structure for SimRank. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1859–1874. [Google Scholar] [CrossRef]
  12. Liu, Y.; Zheng, B.; He, X.; Wei, Z.; Xiao, X.; Zheng, K.; Lu, J. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs. Proc. VLDB Endow. 2017, 11, 14–26. [Google Scholar] [CrossRef]
  13. Li, M.; Choudhury, F.M.; Borovica-Gajic, R.; Wang, Z.; Xin, J.; Li, J. CrashSim: An Efficient Algorithm for Computing SimRank over Static and Temporal Graphs. In Proceedings of the 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, 20–24 April 2020; pp. 1141–1152. [Google Scholar] [CrossRef]
  14. Liu, Y.; Zou, L.; Ge, Q.; Wei, Z. SimTab: Accuracy-Guaranteed SimRank Queries through Tighter Confidence Bounds and Multi-Armed Bandits. Proc. VLDB Endow. 2020, 13, 2202–2214. [Google Scholar] [CrossRef]
  15. Shao, Y.; Cui, B.; Chen, L.; Liu, M.; Xie, X. An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs. Proc. VLDB Endow. 2015, 8, 838–849. [Google Scholar] [CrossRef]
  16. Jiang, M.; Fu, A.W.; Wong, R.C.; Wang, K. READS: A Random Walk Approach for Efficient and Accurate Dynamic SimRank. Proc. VLDB Endow. 2017, 10, 937–948. [Google Scholar] [CrossRef]
  17. Wei, Z.; He, X.; Xiao, X.; Wang, S.; Liu, Y.; Du, X.; Wen, J. PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1042–1059. [Google Scholar] [CrossRef]
Figure 1. Directed graph G 1 .
Figure 1. Directed graph G 1 .
Information 15 00348 g001
Figure 2. Directed graph G 2 .
Figure 2. Directed graph G 2 .
Information 15 00348 g002
Figure 3. Reverse reachable tree of v 0 .
Figure 3. Reverse reachable tree of v 0 .
Information 15 00348 g003
Figure 4. Reverse reachable tree of v 7 .
Figure 4. Reverse reachable tree of v 7 .
Information 15 00348 g004
Figure 5. Array d p of each vertex.
Figure 5. Array d p of each vertex.
Information 15 00348 g005
Figure 6. The performance of A E @ 50 .
Figure 6. The performance of A E @ 50 .
Information 15 00348 g006
Figure 7. The performance of P r e c i s i o n @ 50 .
Figure 7. The performance of P r e c i s i o n @ 50 .
Information 15 00348 g007
Figure 8. Running time of top-250 results query under the same parameters.
Figure 8. Running time of top-250 results query under the same parameters.
Information 15 00348 g008
Figure 9. Preprocessing and query of HitSim and Hitsim-OPT.
Figure 9. Preprocessing and query of HitSim and Hitsim-OPT.
Information 15 00348 g009
Figure 10. Scalability tests of HitSim and HitSim-OPT by varying E from 20% to 100% and k from 20 to 2000 on WE.
Figure 10. Scalability tests of HitSim and HitSim-OPT by varying E from 20% to 100% and k from 20 to 2000 on WE.
Information 15 00348 g010
Table 1. Summary of notation.
Table 1. Summary of notation.
NotationDescription
GDirected graph
V , E The vertex and edge sets of G
n , m The number of vertices and edges in G, n = | V | , m = | E |
I ( u ) , O u t ( u ) Sets of in-neighbors and out-neighbors of vertex u
l m a x Limited length of c -walk
cDecay factor in the definition of SimRank
s ( u , v ) SimRank of two vertices u and v
s ¯ ( u , v ) A SimRank estimation of s ( u , v )
s ¯ m i n The minimum value of SimRank of current results
B v , B M The upper bound of SimRank of vertex v, set M
W ( u ) A reverse c -walk starting from u
ε The difference between s ¯ ( u , v ) and s ( u , v )
δ The failure probability
Table 2. SimRank with respect to v 0 .
Table 2. SimRank with respect to v 0 .
Vertex v 0 v 1 v 2 v 3 v 4 v 5 v 6 v 7
s ( v 0 , ) 1.00.00640.0480.1320.0740.040.0480.0064
Table 3. Statistics of datasets.
Table 3. Statistics of datasets.
Dataset | V | | E | d ¯
soc-Epinions(SE)75,879508,8376.71
emai-EuAll(EE)265,214420,0451.58
amazon(AZ)403,3943,387,3888.40
wiki-topcats(WC)1,791,48928,511,08715.92
soc-LiveJournal(SL)4,847,57168,993,77314.23
wikipedia-link-en(WE)13,593,032437,217,42432.16
Table 4. The number of vertices processed. The column total represents the total number of vertices processed by HitSim-OPT, while the column n r iteration represents the number of vertices that completed all n r iterations of c -walks.
Table 4. The number of vertices processed. The column total represents the total number of vertices processed by HitSim-OPT, while the column n r iteration represents the number of vertices that completed all n r iterations of c -walks.
DatasetMCCrashSimHitSimHitSim-OPT
Total n r Iteration
soc-Epinions (SE)75,87975,87941,65921,847833
emai-EuAll (EE)265,214265,21432,50418,919695
amazon (AZ)403,394403,39459,4845434423
wiki-topcats (WC)1,791,4891,791,4891,290,8771,057,051872
soc-LiveJournal (SL)4,847,5714,847,5711,863,0633,841,961597
wikipedia-link-en (WE)13,593,03213,593,0322,683,6985,147,351849
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bai, J.; Zhou, J.; Chen, S.; Du, M.; Chen, Z.; Min, M. HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation. Information 2024, 15, 348. https://doi.org/10.3390/info15060348

AMA Style

Bai J, Zhou J, Chen S, Du M, Chen Z, Min M. HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation. Information. 2024; 15(6):348. https://doi.org/10.3390/info15060348

Chicago/Turabian Style

Bai, Jing, Junfeng Zhou, Shuotong Chen, Ming Du, Ziyang Chen, and Mengtao Min. 2024. "HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation" Information 15, no. 6: 348. https://doi.org/10.3390/info15060348

APA Style

Bai, J., Zhou, J., Chen, S., Du, M., Chen, Z., & Min, M. (2024). HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation. Information, 15(6), 348. https://doi.org/10.3390/info15060348

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop