1. Introduction
In various real-world scenarios, the measurement of similarity between entities is often crucial. For instance, in recommendation systems, predicting potential friendships based on the similarity between individuals in a social network [
1,
2]; or recommending items to users based on their behavior and preferences [
3]. In security systems, analyzing email similarity to detect spam messages [
4]; or analyzing account similarity to identify fraudulent transactions [
5]. Among existing similarity computation methods, those based on entity linkage relationships are most commonly employed. Among these, SimRank is a widely used model for computing vertex similarity based on the directed graph topology. SimRank was introduced by Jeh and Widom [
6] in 2002, and it is formulated based on two intuitive statements: (1) If two entities (vertices) are referenced by similar entities (i.e., in a directed graph, if the in-neighbors of two different vertices are similar or identical), then those two entities are also considered similar, and (2) an entity is most similar to itself. The classical algorithm, named
power method [
6], along with its variations [
7], serves as the foundation for computing SimRank between each two vertices. However, these algorithms suffer from the time and space complexity of
, where
n represents the total number of vertices in the graph
G, as there exist
vertex pairs in
G. To tackle this challenge, Ref. [
8] proposed the single-source and top-k query problem. This problem focuses on efficiently retrieving the
k vertices with the largest SimRank to a specific source vertex. Existing approaches [
8,
9,
10,
11,
12,
13,
14,
15,
16] for the single-source and top-k query problem primarily utilize the random walk method, resulting in notable improvements in both time and space efficiency. Ref. [
10] proposed two heuristic algorithms employing truncated random walk and prioritized propagation strategies. Ref. [
11] devised an index for SimRank computation based on
-walk, despite the notable overhead in terms of space and preprocessing. Ref. [
12] proposed an index-free algorithm named
ProbeSim that outperforms the index-based approaches. The state-of-the-art algorithm
CrashSim [
13] improves the computational efficiency of SimRank in
ProbeSim by truncating walk lengths.
Challenges. In the single-source and top-k query problem, existing approaches require computing the SimRank for all the vertices and subsequently sorting the top-k results in descending order based on their SimRank. However, computing SimRank for vertices requires sampling a large number of random walks, which is a time-consuming operation. Moreover, in real-world applications, the desired result scale k specified by users is often much smaller than the vertex scale n of the network. Therefore, computing SimRank for vertices with negligible SimRank value is redundant.
Our approach. Motivated by the above observations, we propose a novel algorithm called HitSim for the single-source and top-k query problem, which employs a branch and bound strategy. Specifically, by leveraging the inherent property that the SimRank of a vertex will decrease as its meeting length increases, HitSim partitions vertices into distinct sets based on their shortest-meeting lengths (Definition 5) and computes the upper bound of SimRank for each set. To reduce the redundant computation for the vertices with negligible SimRank, HitSim preferentially processes the vertices within the set with a larger upper bound. If the upper bound of a set is less than the minimum SimRank of the current results, the vertices within the set can be efficiently pruned in batch.
However, in scenarios where the graph becomes dense, the number of vertices within the same set can grow rapidly. This may result in a scenario where vertices with the same shortest-meeting lengths but significantly different SimRank are partitioned into the same set. Consequently, the efficiency will decrease since the algorithm may preferentially process the vertices with the shortest-meeting lengths but with a small SimRank. To address this issue of inefficient pruning in HitSim, we propose an optimized algorithm called HitSim-OPT. HitSim-OPT computes the upper bound of SimRank for each vertex, allowing for fine-grained pruning of vertices. Similar to HitSim, HitSim-OPT maintains the minimum SimRank of the current top-k results and prunes unpromising vertices by comparing their upper bounds with the minimum value. Our contributions are as follows:
We propose an efficient algorithm, named HitSim, based on a branch and bound strategy to answer the single-source and top-k query. HitSim can efficiently prune the unpromising vertices within a set in batch, reducing redundant computation and improving overall efficiency.
We further propose an optimized algorithm, named HitSim-OPT. By computing the upper bound of SimRank for each vertex, HitSim-OPT performs a fine-grained pruning strategy, resulting in further improvements in efficiency.
We conduct experiments on six real-world datasets. The experimental results show that our algorithms can efficiently answer the single-source and top-k query.
Organization. The rest of this paper is summarized as follows.
Section 2 provides some preliminaries. In
Section 3, we give a review of existing works. In
Section 4, we propose an efficient algorithm based on a branch and bound strategy for the single-source and top-k query. In
Section 5, we further propose an optimized algorithm.
Section 6 shows the experimental results.
Section 7 concludes this paper.
3. Related Works
In this section, we review the state-of-the-art algorithms for the single-source and top-k query problem, which are based on the widely used random walk method.
Ref. [
11] first proposed the
SLING algorithm, which utilizes the
-walk to compute SimRank. Subsequently, several studies [
9,
11,
14] adopted the
Monte Carlo (MC) method to sample
-walks for each vertex
v and source
u a certain number of times. By counting the number of times they meet (
) out of the total number of walks sampled (
), the SimRank is obtained as
. After computing the SimRanks with source
u for all vertices and sorting them in descending order, the algorithms return the top-k results. Ref. [
15] proposed the
TSF algorithm based on the
MC method.
TSF constructs an index comprising
one-way graphs that contain the coupling of random walks of length
T from each vertex. This approach helps reduce storage space requirements. To address the computational cost of simulations, Ref. [
16] proposed an index-based algorithm called
READS. The index consists of compressed
-walks, which significantly improves the efficiency of queries. In the case of
MC-based algorithms, there is a problem that when the SimRank of many vertices is negligible, resulting in the
-walks within a limited length not meeting the path of the source
u, then the processing of these vertices is redundant. To tackle this issue, an index-free algorithm called
ProbeSim was proposed by Ref. [
12]. Instead of sampling
-walks from each vertex
v to determine whether
v and source
u can meet at any
within the
-walk
from
u,
ProbeSim conducts a graph traversal from each
to identify vertices that have a non-negligible probability of walking to
. This process is repeated for
iterations to obtain results with approximate guarantees. By avoiding the processing of unpromising vertices, this approach significantly reduces computational overhead. However, it requires generating a large number of probing trees to determine whether
can meet every
v at each step of the
-walk
starting from
u.
To overcome the mentioned issue, Ref. [
13] proposed an improved algorithm called
CrashSim that builds upon the principles of
ProbeSim. The main idea behind
CrashSim is to consider the SimRank as the average probability of two
-walks meeting within a limited length.
CrashSim first computes a reverse reachable tree of source vertex
u with a limited length of
-walk, denoted as
. It then iteratively generates a
-walk for each vertex
v and determines whether it can meet the limited
-walk path from
u with a non-negligible probability. This process is repeated for
iterations to obtain approximate results with certain guarantees. By traversing the reverse reachable tree for only the source vertex
u instead of exploring the entire graph for each vertex
,
CrashSim significantly reduces redundant computations.
However, the above mentioned algorithms require computing the SimRank for each vertex , sorting them in descending order, and returning the top-k results. Such a method results in significant redundant computational overhead when querying for a small value of k, considering that k is typically much smaller than the total number of vertices n in real-world applications.
4. HitSim
Although the state-of-the-art algorithm CrashSim reduces redundant computation by traversing the reverse reachable tree for only source vertex u, it still generates -walks for each vertex v to determine whether it can meet the -walks from u within a limited length. However, since the parameter k is typically much smaller than the total number of vertices n, processing unpromising vertices with negligible SimRank values is redundant. To overcome this issue, we propose an efficient algorithm called HitSim. It utilizes a branch and bound strategy to batch-prune unpromising vertices to avoid the redundant processing. The algorithm is implemented in three steps, as follows.
Step 1: Computing the reverse reachable tree U. In the first step of
HitSim, we perform the same approach as
CrashSim [
13]. This involves computing the reverse reachable tree from the source vertex
u and generating a matrix
U. Each element
in the matrix represents the probability of the
-walk
stopping at vertex
v by walking
steps.
Example 1. We illustrate the computation of the reverse reachable tree U using the graph shown in Figure 2. Given the source , we set the limited length of -walk to 4 and decay factor c to 0.25 (). The algorithm computes the reverse reachable tree of , which is shown in Figure 3. Note that the in-neighbor of vertex v that is equal to the parent of v is ignored to avoid recomputing the probability due to the cycle in the graph [13]. Simultaneously, it computes the probability of the -walk stopping at different vertices with different lengths. For level 0, it sets the probability . Next, for level 1, , . Similarly, for level 2, , and for level 3, . Step 2: Partitioning vertices. Based on
CrashSim, assuming the
xth sampling, if a
-walk starting from
v, i.e.,
, first meets the
-walk
starting from source
u at
, then
Lemma 1. Given a graph and a source vertex u, if there exists a vertex , and its shortest-meeting length is , then the upper bound of SimRank for v, denoted as , is , where represents the maximum probability in level l of the reverse reachable tree of the source u.
Proof of Lemma 1. Since the shortest-meeting length of
v is
, any
-walk
) starting from
v will not meet the
-walk
starting from source
u before
. Then, we have
Furthermore,
is the maximum probability in level
l of the reverse reachable tree of source
u. Then,
Based on Equations (
5)–(
7), for any
-walk
) starting from
v, we have
Since
, then
. The average value after
trials is
Thus, for any vertex
v, the upper bound of SimRank is
, where
t is the shortest-meeting length, i.e., Lemma 1 holds. □
According to Lemma 1, for any two different vertices v and , if the shortest-meeting length of v and w are and , respectively, then and . If , then . From the above analysis, we have Observation 1 as follows:
Observation 1. For any vertex v, the smaller the shortest-meeting length, the larger the upper bound.
According to Lemma 1, vertices with the same shortest-meeting length share the same upper bound. By partitioning the vertices into distinct sets based on their shortest-meeting lengths, we ensure that vertices within the same set share the same upper bound, i.e., for any vertex v in the same set M, , where is the upper bound of SimRank for set M. If we identify a set M satisfying , where is the minimum SimRank value of the current results, we can safely prune these unpromising vertices within M batches. According to Observation 1, we should preferentially process sets with smaller shortest-meeting lengths. Such preprocessing allows the algorithm to preferentially compute SimRank for vertices with a larger SimRank upper bound, leading to the effective pruning of unpromising vertices.
Step 3: Computing and maintaining the top-k results. Following step 2, HitSim generates -walks and computes the SimRank only for the vertices within the sets whose upper bounds exceed . Simultaneously, HitSim maintains the top-k results and the current minimum SimRank by sorting the results based on their SimRank in descending order.
Since step 1 is identical to
CrashSim, we will omit its detailed description. The detailed descriptions of steps 2 (
Section 4.1) and 3 (
Section 4.2) are as follows.
4.1. Partitioning Vertices
We now describe the ParVer algorithm, which is used to partition vertices into distinct sets based on their shortest-meeting lengths with the source vertex. Given a graph , a source vertex , a limited length of -walk, and a probability matrix U, ParVer returns a vertex set and the upper bound of SimRank for each , where t is the shortest-meeting length of the vertices within . Notably, does not include any vertices from the set .
The pseudo-code of
ParVer is depicted in Algorithm 1. It initializes a hashset
for each
to store the vertices with shortest-meeting lengths equal to
t (line 1). To avoid revisiting vertices of the same level, it requires a hashset
for each
to store the visited vertices (line 2). The probabilities
and
are initialized to store the maximum value of
and the upper bound of SimRank of
for each
(line 3). To identify all the vertices within
for each
, it iterates over each vertex
in level
t of the reverse reachable tree of source
u (lines 4 to 20). For each vertex
in level
t of the reverse reachable tree of source
u, it records the current maximum probability
in level
t (line 6). Then, it performs forward walks of
t steps from
to obtain each vertex
v whose shortest-meeting length is
t (lines 7 to 20). Specifically, it first initializes a queue
Q to store the vertices that
visits in
t-step forward walks. Then, it puts
into
Q and
, where
i represents the current level. Next, it forward walks from the vertex
and records the size of
Q. With each iteration, the step
i is reduced by 1. Then, it visits
Q and pops the top element
q of
Q. For each out-neighbor
v of
q, if the step of the current forward walk does not exceed the maximum walk length
i and
v has not been visited before, then
v is inserted directly into
and
Q. If
v does not exist in any set that stores vertices with shortest-meeting lengths from 0 to
and
, it is added to
. Then, all the vertices within
for each
are obtained. Afterward, it computes the upper bound
based on Lemma 1 (lines 21 to 22). Finally,
ParVer returns the vertex set
and its upper bound
for each
(line 23).
Algorithm 1: ParVer |
|
Example 2. Considering the graph shown in Figure 2, we illustrate step 2, which involves the partitioning of vertices. Assume that the source vertex is . For simplicity, we set and the decay factor . In step 1, it computes the reverse reachable tree U, as shown in Example 1. Continuing from Example 1, in step 2, it partitions vertices of into distinct sets based on their shortest-meeting lengths with source . Starting with level 1 of the reverse reachable tree, it forward walks 1 step from the first vertex . Consequently, it inserts ’s out-neighbors , and into set . It omits since is the source vertex. Next, it forward walks one step from the second vertex in level 1. It inserts ’s out-neighbors and into set and omits since is already in . Thus, we have , and the maximum probability is . Moving to level 2, it forward walks 2 steps from , and , respectively. This leads to }. Note that, , and are not inserted into since they are already in . The maximum probability is computed as = 0.0625. Lastly, in level 3, it forward walks three steps from , and , respectively. As a result, , indicating that no new vertices are added to . The maximum probability is computed as . Finally, according to Lemma 1, the SimRank upper bounds of set and are computed as and , respectively. 4.2. Computing and Maintaining the Top-k Results
The complete pseudo-code of HitSim is illustrated in Algorithm 2. Given a graph , a source , an integer k, a decay factor c, a limited length of -walk, an average absolute error , and a failure probability , HitSim returns the queue Q of the top-k results. It first invokes the revReach algorithm (referenced as CrashSim) to construct the reverse reachable tree of source u and return a matrix U (line 1). Then, it invokes the ParVer algorithm (referenced as Algorithm 1) to partition vertices into distinct sets (line 2). Step 3, which involves computing and maintaining the top-k results, starts from line 3 of Algorithm 2.
In step 3, the algorithm initializes a priority queue
Q to store the results (line 4). Based on Observation 1, for any vertex
v, a smaller shortest-meeting length corresponds to a larger
. Thus,
HitSim processes the sets in ascending order of their shortest-meeting lengths
t in order to obtain the top-k results as early as possible (lines 5 to 20). Specifically, for each vertex
v in
, it runs
independent trials (lines 6 to 16). The computation of
, which represents the minimum number of iterations that guarantees an error less than
with at least
, is defined by [
13]. During each iteration, it generates a
-walk starting from
v and limits the length of the walk to
(lines 8 to 9). Then, for the
-walk with length
i, where
, it accumulates the total first-meeting probability of this walk meeting the
-walk starting from
u at the
ith element of
(lines 10 to 12). After completing the
trials, it computes the average of the results to obtain the final SimRank
and inserts
into
Q (lines 14 to 15). When the size of the current results is
k, it checks whether the upper bound
of the next set to be processed is larger than
. If it is,
HitSim continues generating
-walks for the vertices within
. Otherwise, it terminates and returns the current top-k results (lines 17 to 19).
Algorithm 2: HitSim |
|
Example 3. Continuing from Example 2, we set k to 1; the objective of
HitSim
is to return the top-1 result. The algorithm begins by processing the first vertex within the set and . Suppose that at the xth trial, it generates a -walk starting from . It computes the SimRank as . After conducting a total of iterations, it computes the average value of . Once all the vertices in have been processed, it compares the minimum value of SimRank with the upper bound of the set . If , HitSim batch-prunes all the vertices in .
The SimRanks between
and any other vertices are listed in
Table 2, and they have been computed by [
6] within a
error. From
Table 2 we can see that, due to the branch and bound strategy,
HitSim is able to efficiently batch-prune vertices such as
and
within
, whose SimRanks are notably less compared to the rest.
5. Optimization
The batch-pruning method employed in
HitSim aims to prune unpromising vertices within the sets whose upper bounds of SimRank are less than the minimum SimRank of the current results. However, in certain cases, such as Example 3, where
k is set to 2, the batch-pruning strategy may fail. In the example, after computing the SimRank for all vertices in
, the current top-2 results are
and
, with a minimum SimRank
of approximately 0.074, as shown in
Table 2. Comparing
with the upper bound
,
HitSim continues to compute the SimRank for vertices
and
within
, indicating a failure of the batch-pruning method.
The reason for this failure is that in dense graphs, vertices may have a large number of in-neighbors. Vertices with the same shortest-meeting length but significantly different SimRanks can be partitioned into the same set, leading to imprecise pruning. For instance, in
, the reverse reachable tree of
(shown in
Figure 4) reveals that
cannot meet source
at
in level 2, where
is the maximum value in level 2. Similarly,
cannot meet source
at
in level 3, where
is the maximum value in level 3. Consequently, although
is partitioned into
, its upper bound
, computed by accumulating the maximum values from level 2 and level 3, significantly exceeds its actual upper bound
. When processing such sets (e.g.,
), it becomes inevitable to compute the SimRank for the vertices with a small SimRank (e.g.,
), leading to the batch pruning being ineffective.
To address the issue, we propose an optimized algorithm called HitSim-OPT. HitSim-OPT aims to perform fine-grained and effective vertex pruning by computing a tighter upper bound of SimRank for each individual vertex, instead of that for each set. The optimized algorithm consists of the following three steps.
Step 1: Computing the reverse reachable tree U. The details of this step are omitted as they are the same as step 1 of HitSim.
Step 2: Computing the upper bound of SimRank for each vertex. According to Lemma 1, the upper bound of SimRank for any vertex v is computed by , where t represents the shortest-meeting length and denotes the maximum probability in level l of the reverse reachable tree of u. To refine the upper bound , we introduce a modification by replacing with , which represents the maximum probability in level l of the reverse tree of v. Since each vertex in the reverse reachable tree of v is the potential vertex within the -walks starting from v, provides a more precise bounding than .
By utilizing the inequality
, we can ensure
. Consequently, we can carefully modify the upper bound for each vertex
v as
Step 3: Maintaining the top-k results. Similar to step 3 of HitSim, HitSim-OPT maintains the top-k results and the current minimum SimRank by sorting the results based on their SimRank in descending order. For each vertex v, HitSim-OPT determines whether to further compute its actual SimRank by comparing with . If is less than , HitSim-OPT prunes v.
Since step 1 is identical to that of
HitSim, we will omit its detailed description. The detailed descriptions of the algorithms in steps 2 (
Section 5.1) and 3 (
Section 5.2) are as follows.