HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation

Bai, Jing; Zhou, Junfeng; Chen, Shuotong; Du, Ming; Chen, Ziyang; Min, Mengtao

doi:10.3390/info15060348

Open AccessArticle

HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation

by

Jing Bai

^1,2,†,

Junfeng Zhou

^2,*,†

,

Shuotong Chen

^2,†,

Ming Du

^2,†,

Ziyang Chen

^3,† and

Mengtao Min

^2,†

¹

College of Information Technology, Shanghai Jian Qiao University, 1111 Hucheng Ring Road, Pudong New Area, Shanghai 201306, China

²

School of Computer Science and Technology, Donghua University, 2999 Renmin North Road, Songjiang District, Shanghai 201620, China

³

School of Information Management, Shanghai Lixin University of Accounting and Finance, 2800 Wenxiang Road, Songjiang District, Shanghai 201209, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2024, 15(6), 348; https://doi.org/10.3390/info15060348

Submission received: 19 March 2024 / Revised: 1 May 2024 / Accepted: 4 June 2024 / Published: 12 June 2024

Download

Browse Figures

Versions Notes

Abstract

SimRank is a widely used metric for evaluating vertex similarity based on graph topology, with diverse applications such as large-scale graph mining and natural language processing. The objective of the single-source and top-k SimRank query problem is to retrieve the kvertices with the largest SimRank to the source vertex. However, existing algorithms suffer from inefficiency as they require computing SimRank for all vertices to retrieve the top-k results. To address this issue, we propose an algorithm named HitSimthat utilizes a branch and bound strategy for the single-source and top-k query. HitSim initially partitions vertices into distinct sets based on their shortest-meeting lengths to the source vertex. Subsequently, it computes an upper bound of SimRank for each set. If the upper bound of a set is no larger than the minimum value of the current top-k results, HitSim efficiently batch-prunes the unpromising vertices within the set. However, in scenarios where the graph becomes dense, certain sets with large upper bounds may contain numerous vertices with small SimRank, leading to redundant overhead when processing these vertices. To address this issue, we propose an optimized algorithm named HitSim-OPT that computes the upper bound of SimRank for each vertex instead of each set, resulting in a fine-grained and efficient pruning process. The experimental results conducted on six real-world datasets demonstrate the performance of our algorithms in efficiently addressing the single-source and top-k query problem.

Keywords:

1. Introduction

In various real-world scenarios, the measurement of similarity between entities is often crucial. For instance, in recommendation systems, predicting potential friendships based on the similarity between individuals in a social network [1,2]; or recommending items to users based on their behavior and preferences [3]. In security systems, analyzing email similarity to detect spam messages [4]; or analyzing account similarity to identify fraudulent transactions [5]. Among existing similarity computation methods, those based on entity linkage relationships are most commonly employed. Among these, SimRank is a widely used model for computing vertex similarity based on the directed graph topology. SimRank was introduced by Jeh and Widom [6] in 2002, and it is formulated based on two intuitive statements: (1) If two entities (vertices) are referenced by similar entities (i.e., in a directed graph, if the in-neighbors of two different vertices are similar or identical), then those two entities are also considered similar, and (2) an entity is most similar to itself. The classical algorithm, named power method [6], along with its variations [7], serves as the foundation for computing SimRank between each two vertices. However, these algorithms suffer from the time and space complexity of

Ω (n^{2})

, where n represents the total number of vertices in the graph G, as there exist

Ω (n^{2})

vertex pairs in G. To tackle this challenge, Ref. [8] proposed the single-source and top-k query problem. This problem focuses on efficiently retrieving the k vertices with the largest SimRank to a specific source vertex. Existing approaches [8,9,10,11,12,13,14,15,16] for the single-source and top-k query problem primarily utilize the random walk method, resulting in notable improvements in both time and space efficiency. Ref. [10] proposed two heuristic algorithms employing truncated random walk and prioritized propagation strategies. Ref. [11] devised an index for SimRank computation based on

\sqrt{c}

-walk, despite the notable overhead in terms of space and preprocessing. Ref. [12] proposed an index-free algorithm named ProbeSim that outperforms the index-based approaches. The state-of-the-art algorithm CrashSim [13] improves the computational efficiency of SimRank in ProbeSim by truncating walk lengths.

Challenges. In the single-source and top-k query problem, existing approaches require computing the SimRank for all the vertices and subsequently sorting the top-k results in descending order based on their SimRank. However, computing SimRank for vertices requires sampling a large number of random walks, which is a time-consuming operation. Moreover, in real-world applications, the desired result scale k specified by users is often much smaller than the vertex scale n of the network. Therefore, computing SimRank for vertices with negligible SimRank value is redundant.

Our approach. Motivated by the above observations, we propose a novel algorithm called HitSim for the single-source and top-k query problem, which employs a branch and bound strategy. Specifically, by leveraging the inherent property that the SimRank of a vertex will decrease as its meeting length increases, HitSim partitions vertices into distinct sets based on their shortest-meeting lengths (Definition 5) and computes the upper bound of SimRank for each set. To reduce the redundant computation for the vertices with negligible SimRank, HitSim preferentially processes the vertices within the set with a larger upper bound. If the upper bound of a set is less than the minimum SimRank of the current results, the vertices within the set can be efficiently pruned in batch.

However, in scenarios where the graph becomes dense, the number of vertices within the same set can grow rapidly. This may result in a scenario where vertices with the same shortest-meeting lengths but significantly different SimRank are partitioned into the same set. Consequently, the efficiency will decrease since the algorithm may preferentially process the vertices with the shortest-meeting lengths but with a small SimRank. To address this issue of inefficient pruning in HitSim, we propose an optimized algorithm called HitSim-OPT. HitSim-OPT computes the upper bound of SimRank for each vertex, allowing for fine-grained pruning of vertices. Similar to HitSim, HitSim-OPT maintains the minimum SimRank of the current top-k results and prunes unpromising vertices by comparing their upper bounds with the minimum value. Our contributions are as follows:

We propose an efficient algorithm, named HitSim, based on a branch and bound strategy to answer the single-source and top-k query. HitSim can efficiently prune the unpromising vertices within a set in batch, reducing redundant computation and improving overall efficiency.
We further propose an optimized algorithm, named HitSim-OPT. By computing the upper bound of SimRank for each vertex, HitSim-OPT performs a fine-grained pruning strategy, resulting in further improvements in efficiency.
We conduct experiments on six real-world datasets. The experimental results show that our algorithms can efficiently answer the single-source and top-k query.

Organization. The rest of this paper is summarized as follows. Section 2 provides some preliminaries. In Section 3, we give a review of existing works. In Section 4, we propose an efficient algorithm based on a branch and bound strategy for the single-source and top-k query. In Section 5, we further propose an optimized algorithm. Section 6 shows the experimental results. Section 7 concludes this paper.

2. Preliminaries

In this section, we formally introduce the notation and definitions. Mathematical notations used throughout this paper are summarized in Table 1.

Definition 1

(SimRank). Given two vertices u and v in directed graph

G = (V, E)

, the SimRank of u and v, denoted as

s (u, v)

, is defined as

s (u, v) = \{\begin{matrix} 1, & u = v \\ \sum_{p \in I (u)} \sum_{q \in I (v)} \frac{c \times s (p, q)}{| I (u) | \times | I (v) |}, & u \neq v \end{matrix}

(1)

where

I (u)

denotes the set of in-neighbors of u, and

c \in (0, 1)

is a decay factor [12,13].

For example, in

G_{1}

of Figure 1, if the decay factor

c = 0.64

, according to Equation (1), we have

s (v_{0}, v_{0}) = 1

. The SimRank between

v_{1}

and

v_{2}

is

s (v_{1}, v_{2}) = \sum_{p \in I (v_{1})} \sum_{q \in I (v_{2})} \frac{c \times s (p, q)}{| I (v_{1}) | \times | I (v_{2}) |} = 0.64 .

Similarly,

s (v_{1}, v_{3}) = s (v_{2}, v_{3}) = 0.64

,

s (v_{4}, v_{5}) = s (v_{5}, v_{6}) = 0.486

,

s (v_{4}, v_{6}) = 0.467

, and the SimRank between other vertex pairs is 0.

Definition 2

(Reverse random walk). Given vertex u in directed graph G, a reverse random walk from u is a sequence of vertices

W (u) = (u_{0}, u_{1}, u_{2} . . .)

, such that

u_{i + 1} (i \geq 0)

is selected uniformly at random from the in-neighbors of

u_{i}

.

Definition 3

(

\sqrt{c}

-walk [11]). Let c denote the decay factor, a

\sqrt{c}

-walk in G is defined such that (1) in each step of the reverse random walk, we have

1 - \sqrt{c}

probability of stopping; (2) for the remaining

\sqrt{c}

probability, one of the in-neighbors of the current vertex is selected uniformly at random as the next step. We denote a

\sqrt{c}

-walk starting from u as

W (u) = (u_{0}, u_{1}, . . ., u_{i}, . . .)

, where

u_{0} = u

.

According to Definition 3, Ref. [11] also defined the SimRank estimation

\bar{s} (u, v)

as the total probability that

\sqrt{c}

-walk

W (u)

starting from u meets

\sqrt{c}

-walk

W (v)

starting from v, i.e.,

\bar{s} (u, v) = \Pr [W (u) and W (v) meet]

.

Definition 4

(First-meeting probability [12]). Given a reverse path

\sqrt{c}

-walk

W (u) = (u_{0}, . . ., u_{i})

starting from

u (u_{0} = u)

and

v \in V (v \neq u_{0})

, the first-meeting probability of v with respect to

W (u)

is defined as

P (v, W (u)) = \Pr_{W (v)} [v_{i} = u_{i}, v_{i - 1} \neq u_{i - 1}, . . ., v_{0} \neq u_{0}] .

(2)

where

W (v) = (v_{0}, . . ., v_{i}, . . .)

is a random

\sqrt{c}

-walk starting from

v_{0} = v

.

According to Definition 3 and 4, Ref. [12] defined

\bar{s} (u, v)

as the total probability that

\sqrt{c}

-walk

W (u)

starting from u and

\sqrt{c}

-walk

W (v)

starting from v first meet at each vertex

u_{i}

, i.e.,

\bar{s} (u, v) = \Pr [W (u) and W (v) meet] = \sum_{i} \Pr [W (u) and W (v) first meet at u_{i}] .

(3)

Definition 5

(Shortest-meeting length). Given any two

\sqrt{c}

-walks

W (v) = (v_{0}, . . ., v_{t}, . . .)

starting from

v (v_{0} = v)

and

W (u) = (u_{0}, . . ., u_{t}, . . .)

starting from

u (u_{0} = u)

,

u \in V

and

v \in V (v \neq u)

, the shortest-meeting length t between u and v is defined as the least t such that the probability of meeting after t steps is nonzero.

For example, in Figure 1, any two

\sqrt{c}

-walks

W (v_{4})

starting from

v_{4}

and

W (v_{5})

starting from

v_{5}

may first meet at

v_{1}

or

v_{2}

with walking one step, at

v_{0}

with walking two steps. Then, the shortest-meeting length between them is 1.

Definition 6

(Approximate single-source SimRank query). In the graph

G = (V, E)

, given source u, the average absolute error ε allowed in SimRank computation, and the failure probability δ, for any vertex

v (v \neq u)

, of an approximate single-source SimRank query returns an estimated SimRank

\bar{s} (u, v)

to the ground-truth SimRank

s (u, v)

, which satisfies

\Pr {∣ s (u, v) - \bar{s} (u, v) ∣ \leq ε} \geq 1 - δ .

(4)

Problem Statement. (Approximate single-source and top-k query) In graph

G = (V, E)

, given source vertex u, decay factor c, and integer k, return the top-k vertices

V_{k} (u) = {v_{i} \neq u, 1 \leq i \leq k}

with the largest SimRank.

3. Related Works

In this section, we review the state-of-the-art algorithms for the single-source and top-k query problem, which are based on the widely used random walk method.

Ref. [11] first proposed the SLING algorithm, which utilizes the

\sqrt{c}

-walk to compute SimRank. Subsequently, several studies [9,11,14] adopted the Monte Carlo (MC) method to sample

\sqrt{c}

-walks for each vertex v and source u a certain number of times. By counting the number of times they meet (

c n t

) out of the total number of walks sampled (

n u m

), the SimRank is obtained as

\bar{s} (u, v) = c n t / n u m

. After computing the SimRanks with source u for all vertices and sorting them in descending order, the algorithms return the top-k results. Ref. [15] proposed the TSF algorithm based on the MC method. TSF constructs an index comprising one-way graphs that contain the coupling of random walks of length T from each vertex. This approach helps reduce storage space requirements. To address the computational cost of simulations, Ref. [16] proposed an index-based algorithm called READS. The index consists of compressed

\sqrt{c}

-walks, which significantly improves the efficiency of queries. In the case of MC-based algorithms, there is a problem that when the SimRank of many vertices is negligible, resulting in the

\sqrt{c}

-walks within a limited length not meeting the path of the source u, then the processing of these vertices is redundant. To tackle this issue, an index-free algorithm called ProbeSim was proposed by Ref. [12]. Instead of sampling

\sqrt{c}

-walks from each vertex v to determine whether v and source u can meet at any

u_{i}

within the

\sqrt{c}

-walk

W (u)

from u, ProbeSim conducts a graph traversal from each

u_{i}

to identify vertices that have a non-negligible probability of walking to

u_{i}

. This process is repeated for

n_{r}

iterations to obtain results with approximate guarantees. By avoiding the processing of unpromising vertices, this approach significantly reduces computational overhead. However, it requires generating a large number of probing trees to determine whether

W (u)

can meet every v at each step of the

\sqrt{c}

-walk

W (u)

starting from u.

To overcome the mentioned issue, Ref. [13] proposed an improved algorithm called CrashSim that builds upon the principles of ProbeSim. The main idea behind CrashSim is to consider the SimRank as the average probability of two

\sqrt{c}

-walks meeting within a limited length. CrashSim first computes a reverse reachable tree of source vertex u with a limited length of

\sqrt{c}

-walk, denoted as

l_{m a x}

. It then iteratively generates a

\sqrt{c}

-walk for each vertex v and determines whether it can meet the limited

\sqrt{c}

-walk path from u with a non-negligible probability. This process is repeated for

n_{r}

iterations to obtain approximate results with certain guarantees. By traversing the reverse reachable tree for only the source vertex u instead of exploring the entire graph for each vertex

u_{i} \in V ∖ u

, CrashSim significantly reduces redundant computations.

However, the above mentioned algorithms require computing the SimRank for each vertex

u_{i} \in V ∖ u

, sorting them in descending order, and returning the top-k results. Such a method results in significant redundant computational overhead when querying for a small value of k, considering that k is typically much smaller than the total number of vertices n in real-world applications.

4. HitSim

Although the state-of-the-art algorithm CrashSim reduces redundant computation by traversing the reverse reachable tree for only source vertex u, it still generates

\sqrt{c}

-walks for each vertex v to determine whether it can meet the

\sqrt{c}

-walks from u within a limited length. However, since the parameter k is typically much smaller than the total number of vertices n, processing unpromising vertices with negligible SimRank values is redundant. To overcome this issue, we propose an efficient algorithm called HitSim. It utilizes a branch and bound strategy to batch-prune unpromising vertices to avoid the redundant processing. The algorithm is implemented in three steps, as follows.

Step 1: Computing the reverse reachable tree U. In the first step of HitSim, we perform the same approach as CrashSim [13]. This involves computing the reverse reachable tree from the source vertex u and generating a matrix U. Each element

U_{s t e p} (v)

in the matrix represents the probability of the

\sqrt{c}

-walk

W (u)

stopping at vertex v by walking

s t e p

steps.

Example 1.

We illustrate the computation of the reverse reachable tree U using the graph

G_{2}

shown in Figure 2. Given the source

v_{0}

, we set the limited length

l_{m a x}

of

\sqrt{c}

-walk

W (v_{0})

to 4 and decay factor c to 0.25 (

\sqrt{c} = 0.5

). The algorithm computes the reverse reachable tree of

v_{0}

, which is shown in Figure 3. Note that the in-neighbor of vertex v that is equal to the parent of v is ignored to avoid recomputing the probability due to the cycle in the graph [13]. Simultaneously, it computes the probability of the

\sqrt{c}

-walk stopping at different vertices with different lengths. For level 0, it sets the probability

U_{0} (v_{0}) = 1

. Next, for level 1,

U_{1} (v_{1}) = U_{0} (v_{0}) \cdot \frac{\sqrt{c}}{∣ I (v_{1}) ∣} = 1 \cdot \frac{0.5}{2} = 0.25

,

U_{1} (v_{2}) = U_{0} (v_{0}) \cdot \frac{\sqrt{c}}{∣ I (v_{2}) ∣} = 1 \cdot \frac{0.5}{3} = 0.167

. Similarly, for level 2,

U_{2} (v_{4}) = 0.0625, U_{2} (v_{1}) = 0.0417, U_{2} (v_{3}) = 0.0417

, and for level 3,

U_{3} (v_{7}) = 0.0156, U_{3} (v_{0}) = 0.0104, U_{3} (v_{4}) = 0.0104, U_{3} (v_{1}) = 0.0104

.

Step 2: Partitioning vertices. Based on CrashSim, assuming the xth sampling, if a

\sqrt{c}

-walk starting from v, i.e.,

W (v) = (v_{0}, . . ., v_{t}, . . ., v_{i}) (v = v_{0}, t \in [1, i], 1 \leq i \leq l_{m a x})

, first meets the

\sqrt{c}

-walk

W (u)

starting from source u at

v_{t}

, then

{\bar{s}}_{x} (u, v) = U_{0} (v_{0}) + . . . + U_{t} (v_{t}) + . . . + U_{i} (v_{i}) .

(5)

Lemma 1.

Given a graph

G = (V, E)

and a source vertex u, if there exists a vertex

v (v \neq u)

, and its shortest-meeting length is

t (1 \leq t \leq l_{m a x})

, then the upper bound of SimRank for v, denoted as

B_{v}

, is

\sum_{l = t}^{l_{m a x}} p_{l}

, where

p_{l}

represents the maximum probability in level l of the reverse reachable tree of the source u.

Proof of Lemma 1.

Since the shortest-meeting length of v is

t (1 \leq t \leq l_{m a x})

, any

\sqrt{c}

-walk

W_{x} (v) = (v_{0}, . . ., v_{t}, . . ., v_{i}) (v_{0} = v, t \in [1, i], 1 \leq i \leq l_{m a x}

) starting from v will not meet the

\sqrt{c}

-walk

W_{x} (u)

starting from source u before

v_{t}

. Then, we have

U_{0} (v_{0}) = . . . = U_{t - 1} (v_{t - 1}) = 0 .

(6)

Furthermore,

p_{l}

is the maximum probability in level l of the reverse reachable tree of source u. Then,

U_{t} (v_{t}) \leq p_{t}, . . ., U_{i} (v_{i}) \leq p_{i}, (t \in [1, i], 1 \leq i \leq l_{m a x}) .

(7)

Based on Equations (5)–(7), for any

\sqrt{c}

-walk

W_{x} (v) = (v_{0}, . . ., v_{t}, . . ., v_{i}) (v_{0} = v, t \in [1, i], 1 \leq i \leq l_{m a x}

) starting from v, we have

{\bar{s}}_{x} (u, v) \leq \sum_{l = t}^{i} p_{l}, (t \in [1, i], 1 \leq i \leq l_{m a x}) .

(8)

Since

i \leq l_{m a x}

, then

\sum_{l = t}^{i} p_{l} \leq \sum_{l = t}^{l_{m a x}} p_{l}

. The average value after

n_{r}

trials is

\bar{s} (u, v) \leq \sum_{l = t}^{l_{m a x}} p_{l} .

(9)

Thus, for any vertex v, the upper bound of SimRank is

B_{v} = \sum_{l = t}^{l_{m a x}} p_{l}

, where t is the shortest-meeting length, i.e., Lemma 1 holds. □

According to Lemma 1, for any two different vertices v and

w (v \neq u, w \neq u)

, if the shortest-meeting length of v and w are

t_{v}

and

t_{w}

, respectively, then

B_{v} = \sum_{l = t_{v}}^{l_{m a x}} p_{l}

and

B_{w} = \sum_{l = t_{w}}^{l_{m a x}} p_{l}

. If

t_{v} < t_{w}

, then

B_{v} = \sum_{l = t_{v}}^{l_{m a x}} p_{l} > B_{w} = \sum_{l = t_{w}}^{l_{m a x}} p_{l}

. From the above analysis, we have Observation 1 as follows:

Observation 1.

For any vertex v, the smaller the shortest-meeting length, the larger the upper bound

B_{v}

.

According to Lemma 1, vertices with the same shortest-meeting length share the same upper bound. By partitioning the vertices into distinct sets based on their shortest-meeting lengths, we ensure that vertices within the same set share the same upper bound, i.e., for any vertex v in the same set M,

B_{v} = B_{M}

, where

B_{M}

is the upper bound of SimRank for set M. If we identify a set M satisfying

B_{M} \leq {\bar{s}}_{m i n}

, where

{\bar{s}}_{m i n}

is the minimum SimRank value of the current results, we can safely prune these unpromising vertices within M batches. According to Observation 1, we should preferentially process sets with smaller shortest-meeting lengths. Such preprocessing allows the algorithm to preferentially compute SimRank for vertices with a larger SimRank upper bound, leading to the effective pruning of unpromising vertices.

Step 3: Computing $\bar{s} (u, v)$ and maintaining the top-k results. Following step 2, HitSim generates

\sqrt{c}

-walks and computes the SimRank only for the vertices within the sets whose upper bounds exceed

{\bar{s}}_{m i n}

. Simultaneously, HitSim maintains the top-k results and the current minimum SimRank

{\bar{s}}_{m i n}

by sorting the results based on their SimRank in descending order.

Since step 1 is identical to CrashSim, we will omit its detailed description. The detailed descriptions of steps 2 (Section 4.1) and 3 (Section 4.2) are as follows.

4.1. Partitioning Vertices

We now describe the ParVer algorithm, which is used to partition vertices into distinct sets based on their shortest-meeting lengths with the source vertex. Given a graph

G = (V, E)

, a source vertex

u \in V

, a limited length

l_{m a x}

of

\sqrt{c}

-walk, and a probability matrix U, ParVer returns a vertex set

M_{t}

and the upper bound of SimRank

B_{M_{t}}

for each

t = 1, . . ., l_{m a x}

, where t is the shortest-meeting length of the vertices within

M_{t}

. Notably,

M_{t}

does not include any vertices from the set

M_{t - 1}

.

The pseudo-code of ParVer is depicted in Algorithm 1. It initializes a hashset

M_{t}

for each

t = 1, . . ., l_{m a x}

to store the vertices with shortest-meeting lengths equal to t (line 1). To avoid revisiting vertices of the same level, it requires a hashset

D u_{t}

for each

t = 1, . . ., l_{m a x}

to store the visited vertices (line 2). The probabilities

p_{t}

and

B_{M_{t}}

are initialized to store the maximum value of

U_{t}

and the upper bound of SimRank of

M_{t}

for each

t = 1, . . ., l_{m a x}

(line 3). To identify all the vertices within

M_{t}

for each

t = 1, . . ., l_{m a x}

, it iterates over each vertex

t p r

in level t of the reverse reachable tree of source u (lines 4 to 20). For each vertex

t p r

in level t of the reverse reachable tree of source u, it records the current maximum probability

p_{t}

in level t (line 6). Then, it performs forward walks of t steps from

t p r

to obtain each vertex v whose shortest-meeting length is t (lines 7 to 20). Specifically, it first initializes a queue Q to store the vertices that

t p r

visits in t-step forward walks. Then, it puts

t p r

into Q and

D u_{i}

, where i represents the current level. Next, it forward walks from the vertex

t p r

and records the size of Q. With each iteration, the step i is reduced by 1. Then, it visits Q and pops the top element q of Q. For each out-neighbor v of q, if the step of the current forward walk does not exceed the maximum walk length i and v has not been visited before, then v is inserted directly into

D u_{i}

and Q. If v does not exist in any set that stores vertices with shortest-meeting lengths from 0 to

t - 1

and

v \neq u

, it is added to

M_{t}

. Then, all the vertices within

M_{t}

for each

t = 1, . . ., l_{m a x}

are obtained. Afterward, it computes the upper bound

B_{M_{t}}

based on Lemma 1 (lines 21 to 22). Finally, ParVer returns the vertex set

M_{t}

and its upper bound

B_{M_{t}}

for each

t = 1, . . ., l_{m a x}

(line 23).

Algorithm 1: ParVer

Example 2.

Considering the graph

G_{2}

shown in Figure 2, we illustrate step 2, which involves the partitioning of vertices. Assume that the source vertex is

v_{0}

. For simplicity, we set

l_{m a x} = 3

and the decay factor

c = 0.25

(\sqrt{c} = 0.5)

. In step 1, it computes the reverse reachable tree U, as shown in Example 1. Continuing from Example 1, in step 2, it partitions vertices of

G_{2}

into distinct sets based on their shortest-meeting lengths with source

v_{0}

. Starting with level 1 of the reverse reachable tree, it forward walks 1 step from the first vertex

v_{1}

. Consequently, it inserts

v_{1}

’s out-neighbors

v_{2}, v_{3}

, and

v_{4}

into set

M_{1}

. It omits

v_{0}

since

v_{0}

is the source vertex. Next, it forward walks one step from the second vertex

v_{2}

in level 1. It inserts

v_{2}

’s out-neighbors

v_{5}

and

v_{6}

into set

M_{1}

and omits

v_{3}

since

v_{3}

is already in

M_{1}

. Thus, we have

M_{1} = {v_{2}, v_{3}, v_{4}, v_{5}, v_{6}}

, and the maximum probability is

p_{1} = m a x {U_{1} (v_{1}), U_{1} (v_{2})} = 0.25

. Moving to level 2, it forward walks 2 steps from

v_{4}, v_{1}

, and

v_{3}

, respectively. This leads to

M_{2} = {v_{1}, v_{7}

}. Note that,

v_{2}, v_{3}, v_{4}, v_{5}

, and

v_{6}

are not inserted into

M_{2}

since they are already in

M_{1}

. The maximum probability is computed as

p_{2} = m a x {U_{2} (v_{4}), U_{2} (v_{1}), U_{2} (v_{3})}

= 0.0625. Lastly, in level 3, it forward walks three steps from

v_{7}, v_{0}, v_{4}

, and

v_{1}

, respectively. As a result,

M_{3} = \emptyset

, indicating that no new vertices are added to

M_{3}

. The maximum probability is computed as

p_{3} = {U_{3} (v_{7}), U_{3} (v_{0}), U_{3} (v_{4}), U_{3} (v_{1})} = 0.0156

. Finally, according to Lemma 1, the SimRank upper bounds of set

M_{1}

and

M_{2}

are computed as

B_{M_{1}} = \sum_{l = 1}^{3} p_{l} = 0.25 + 0.0625 + 0.0156 = 0.3281

and

B_{M_{2}} = \sum_{l = 2}^{3} p_{l} = 0.0625 + 0.0156 = 0.0781

, respectively.

4.2. Computing $\bar{s} (u, v)$ and Maintaining the Top-k Results

The complete pseudo-code of HitSim is illustrated in Algorithm 2. Given a graph

G = (V, E)

, a source

u \in V

, an integer k, a decay factor c, a limited length

l_{m a x}

of

\sqrt{c}

-walk, an average absolute error

ε

, and a failure probability

δ

, HitSim returns the queue Q of the top-k results. It first invokes the revReach algorithm (referenced as CrashSim) to construct the reverse reachable tree of source u and return a matrix U (line 1). Then, it invokes the ParVer algorithm (referenced as Algorithm 1) to partition vertices into distinct sets (line 2). Step 3, which involves computing

\bar{s} (u, v)

and maintaining the top-k results, starts from line 3 of Algorithm 2.

In step 3, the algorithm initializes a priority queue Q to store the results (line 4). Based on Observation 1, for any vertex v, a smaller shortest-meeting length corresponds to a larger

B_{v}

. Thus, HitSim processes the sets in ascending order of their shortest-meeting lengths t in order to obtain the top-k results as early as possible (lines 5 to 20). Specifically, for each vertex v in

M_{t}

, it runs

n_{r} = \frac{3 c}{ε^{2}} \log \frac{n}{δ}

independent trials (lines 6 to 16). The computation of

n_{r}

, which represents the minimum number of iterations that guarantees an error less than

ε

with at least

1 - δ

, is defined by [13]. During each iteration, it generates a

\sqrt{c}

-walk starting from v and limits the length of the walk to

l_{m a x}

(lines 8 to 9). Then, for the

\sqrt{c}

-walk with length i, where

i \in [1, l_{m a x}]

, it accumulates the total first-meeting probability of this walk meeting the

\sqrt{c}

-walk starting from u at the ith element of

W (v)

(lines 10 to 12). After completing the

n_{r}

trials, it computes the average of the results to obtain the final SimRank

\bar{s} (u, v) = \frac{1}{n_{r}} {\bar{s}}_{x} (u, v)

and inserts

(v, \bar{s} (u, v))

into Q (lines 14 to 15). When the size of the current results is k, it checks whether the upper bound

B_{M_{t + 1}}

of the next set to be processed is larger than

{\bar{s}}_{m i n}

. If it is, HitSim continues generating

\sqrt{c}

-walks for the vertices within

M_{t + 1}

. Otherwise, it terminates and returns the current top-k results (lines 17 to 19).

Algorithm 2: HitSim

Example 3.

Continuing from Example 2, we set k to 1; the objective of HitSim is to return the top-1 result. The algorithm begins by processing the first vertex

v_{2}

within the set

M_{1} = {v_{2}, v_{3}, v_{4}, v_{5}, v_{6}}

and

B_{M_{1}} = 0.3281

. Suppose that at the xth trial, it generates a

\sqrt{c}

-walk

W (v_{2}) = (v_{2}, v_{3}, v_{1}, v_{0})

starting from

v_{2}

. It computes the SimRank as

{\bar{s}}_{x} (v_{0}, v_{2}) = U_{0} (v_{2}) + U_{1} (v_{3}) + U_{2} (v_{1}) + U_{3} (v_{0}) = 0 + 0 + 0.0417 + 0.0104 = 0.0521

. After conducting a total of

n_{r}

iterations, it computes the average value of

\bar{s} (v_{0}, v_{2}) = \frac{1}{n_{r}} {\bar{s}}_{x} (v_{0}, v_{2})

. Once all the vertices in

M_{1}

have been processed, it compares the minimum value of SimRank

{\bar{s}}_{m i n}

with the upper bound

B_{M_{2}} = 0.0781

of the set

M_{2}

. If

B_{M_{2}} \leq {\bar{s}}_{m i n}

, HitSim batch-prunes all the vertices in

M_{2}

.

The SimRanks between

v_{0}

and any other vertices are listed in Table 2, and they have been computed by [6] within a

10^{- 5}

error. From Table 2 we can see that, due to the branch and bound strategy, HitSim is able to efficiently batch-prune vertices such as

v_{1}

and

v_{7}

within

M_{2}

, whose SimRanks are notably less compared to the rest.

4.3. Analysis

The time and space complexity of HitSim can be analyzed in steps 1, 2, and 3.

Step 1 (computing the reverse reachable tree U): In the worst case, it requires traversing each edge once and storing each vertex, resulting in a time complexity of $O (m)$ and the space complexity is $O (n)$ .
Step 2 (partitioning vertices): This step involves traversing each vertex of the reverse reachable tree and storing all the vertex sets. The worst case is that we need to traverse every edge in the graph by visiting the out-neighbors of each vertex, and the worst-case time complexity is $O (m)$ . The space complexity is $O (n)$ .
Step 3 (computing $\bar{s} (u, v)$ and maintaining the top-k results): In the worst case, it requires running $n_{r}$ trials of generating a $\sqrt{c}$ -walk for each vertex, with a limited length of $l_{m a x}$ . Additionally, it requires maintaining a priority queue Q of size k. Thus, the time complexity is $O (n \cdot l_{m a x} \cdot \frac{3 c}{ε^{2}} \log \frac{n}{δ} + n \log k)$ , and space complexity is $O (n + k)$ .

In summary, the total time complexity of HitSim is

O (2 m + n \cdot l_{m a x} \cdot \frac{3 c}{ε^{2}} \log \frac{n}{δ} + n \log k)

, and the total space complexity is

O (n + k)

.

5. Optimization

The batch-pruning method employed in HitSim aims to prune unpromising vertices within the sets whose upper bounds of SimRank are less than the minimum SimRank of the current results. However, in certain cases, such as Example 3, where k is set to 2, the batch-pruning strategy may fail. In the example, after computing the SimRank for all vertices in

M_{1}

, the current top-2 results are

v_{3}

and

v_{4}

, with a minimum SimRank

{\bar{s}}_{m i n}

of approximately 0.074, as shown in Table 2. Comparing

{\bar{s}}_{m i n}

with the upper bound

B_{M_{2}} = 0.0781

, HitSim continues to compute the SimRank for vertices

v_{1}

and

v_{7}

within

M_{2}

, indicating a failure of the batch-pruning method.

The reason for this failure is that in dense graphs, vertices may have a large number of in-neighbors. Vertices with the same shortest-meeting length but significantly different SimRanks can be partitioned into the same set, leading to imprecise pruning. For instance, in

G_{2}

, the reverse reachable tree of

v_{7}

(shown in Figure 4) reveals that

v_{7}

cannot meet source

v_{0}

at

v_{4}

in level 2, where

U_{2} (v_{4})

is the maximum value in level 2. Similarly,

v_{7}

cannot meet source

v_{0}

at

v_{7}

in level 3, where

U_{3} (v_{7})

is the maximum value in level 3. Consequently, although

v_{7}

is partitioned into

M_{2}

, its upper bound

B_{M_{2}}

, computed by accumulating the maximum values from level 2 and level 3, significantly exceeds its actual upper bound

B_{v_{7}}

. When processing such sets (e.g.,

M_{2}

), it becomes inevitable to compute the SimRank for the vertices with a small SimRank (e.g.,

v_{7}

), leading to the batch pruning being ineffective.

To address the issue, we propose an optimized algorithm called HitSim-OPT. HitSim-OPT aims to perform fine-grained and effective vertex pruning by computing a tighter upper bound of SimRank for each individual vertex, instead of that for each set. The optimized algorithm consists of the following three steps.

Step 1: Computing the reverse reachable tree U. The details of this step are omitted as they are the same as step 1 of HitSim.

Step 2: Computing the upper bound of SimRank for each vertex. According to Lemma 1, the upper bound of SimRank for any vertex v is computed by

B_{v} = \sum_{l = t}^{l_{m a x}} p_{l}

, where t represents the shortest-meeting length and

p_{l}

denotes the maximum probability in level l of the reverse reachable tree of u. To refine the upper bound

B_{v}

, we introduce a modification by replacing

p_{l}

with

p_{l} (v)

, which represents the maximum probability in level l of the reverse tree of v. Since each vertex

t p r

in the reverse reachable tree of v is the potential vertex within the

\sqrt{c}

-walks

W (v)

starting from v,

\sum_{l = t}^{l_{m a x}} p_{l} (v)

provides a more precise bounding than

\sum_{l = t}^{l_{m a x}} p_{l}

.

By utilizing the inequality

\sum_{l = t}^{l_{m a x}} p_{l} (v) \leq \sum_{l = 0}^{l_{m a x}} p_{l} (v)

, we can ensure

B_{v} \leq \sum_{l = 0}^{l_{m a x}} p_{l} (v)

. Consequently, we can carefully modify the upper bound for each vertex v as

B_{v} = \sum_{l = 0}^{l_{m a x}} p_{l} (v) .

(10)

Step 3: Maintaining the top-k results. Similar to step 3 of HitSim, HitSim-OPT maintains the top-k results and the current minimum SimRank

{\bar{s}}_{m i n}

by sorting the results based on their SimRank in descending order. For each vertex v, HitSim-OPT determines whether to further compute its actual SimRank by comparing

B_{v}

with

{\bar{s}}_{m i n}

. If

B_{v}

is less than

{\bar{s}}_{m i n}

, HitSim-OPT prunes v.

Since step 1 is identical to that of HitSim, we will omit its detailed description. The detailed descriptions of the algorithms in steps 2 (Section 5.1) and 3 (Section 5.2) are as follows.

5.1. Computing Upper Bound of SimRank for Each Vertex

A straightforward approach is to construct a reverse reachable tree for each vertex v and compute the maximum probability of v meeting the source vertex u at each level of u’s reverse reachable tree, as computed by Equation (10). Taking

v_{7}

from Figure 2 as an example, we construct the reverse reachable tree of

v_{7}

, as shown in Figure 4. At level 1, the maximum probability is 0 because

v_{7}

cannot meet the source vertex

v_{0}

with one step (see the reverse reachable tree of source

v_{0}

in Figure 3). In level 2, the maximum probability is 0.0417 as

v_{7}

can meet

v_{0}

at

v_{1}

with two steps. Similarly, in level 3, the maximum probability is 0.0104 as

v_{7}

can meet

v_{0}

at either

v_{1}

or

v_{0}

with three steps. By accumulating the maximum probabilities of the three levels, we obtain

B_{v_{7}} = 0 + 0.0417 + 0.0104 = 0.0521

.

However, constructing a reverse reachable tree for each vertex in the straightforward approach suffers from significant challenges in terms of time and space complexity. To address this challenge, we further propose a more efficient method called UBV (Upper Bound of Vertex) based on dynamic programming to compute the upper bound of SimRank for each vertex v. It utilizes an array

d p_{w}

of length

l_{m a x}

for each potential meeting point w to store the maximum probabilities at levels 0 to

l_{m a x}

. Such processing allows the algorithm to only traverse source u’s reverse reachable tree once instead of constructing a reverse reachable tree for each vertex. For each potential meeting point w in level

t (0 \leq t \leq l_{m a x})

,

d p_{w} (i) (0 \leq i \leq l_{m a x})

represents the maximum probability that v can meet u at level i. UBV dynamically computes

d p_{w} (i)

for each w starting from the bottom of u’s reverse reachable tree. By the end of the traversal of level 0, a set of

d p_{w}

values can be obtained, where the vertices w in level 0 are those that can meet u within

l_{m a x}

steps. The upper bounds of SimRank for these vertices are computed as

\sum_{i = 0}^{l_{m a x}} d p_{w} (i)

.

Lemma 2.

For each potential meeting point w, the state array

d p_{w}

in level t is computed by the following equation.

d p_{w} (i) = \{\begin{matrix} 0, & i < t \\ U_{t} (w), & i = t \\ \max {d p_{t p r} (i)}, & i > t \end{matrix}

(11)

where

U_{t} (w)

is the probability of source u traversing and stopping at vertex w, and

t p r

is the in-neighbor of vertex w.

Proof of Lemma 2.

(1) When

i < t

, the

\sqrt{c}

-walk starting from w at level t cannot meet any vertex at level i in the reverse reachable tree of source vertex u, thus

d p_{w} (i) = 0

; (2) When

i = t

, the

\sqrt{c}

-walk starting from w at level t happens to meet w itself, thus

d p_{w} (i) = U_{t} (w)

; (3) When

i < t

, the

\sqrt{c}

-walk starting from w at level t must pass through one of its in-neighbors, denoted as

t p r

. In this case,

d p_{w} (i)

is equal to the maximum probability among its in-neighbors meeting at level i, i.e.,

d p_{w} (i) = max d p_{t p r} (i)

. □

The pseudo-code of UBV is illustrated in Algorithm 3. UBV initializes an array

d p_{t p r}

for each

t p r

in the reverse reachable tree of source vertex u from the bottom to the top (lines 4 to 9). Subsequently, it updates

d p_{v}

for each out-neighbor of

t p r

based on Equation (11) (lines 10 to 20). Note that the vertices v at level 0 represent the vertices that can meet u within

l_{m a x}

steps, and

\sum_{i = 0}^{l_{m a x}} d p_{v} (i)

denotes the upper bounds of SimRank (lines 22 to 26).

Algorithm 3: UBV

Example 4.

We use

G_{2}

in Figure 2 to illustrate step 2 ofHitSim-OPT, i.e., the computation of the SimRank upper bound for each vertex. Given source

v_{0}

, for simplicity, we set

l_{m a x} = 4

and the decay factor

c = 0.25 (\sqrt{c} = 0.5)

. Continuing with Example 2,HitSim-OPTcomputes the upper bound of SimRank for each vertex. Specifically,UBVinitializes a hash map

H^{t}

for each level t to store the array’s

d p

. Starting from the bottom of u’s reverse reachable tree, it sets

d p_{v_{7}} (3) = U_{3} (v_{7}), d p_{v_{0}} (3) = U_{3} (v_{0}), d p_{v_{4}} (3) = U_{3} (v_{4})

, and

d p_{v_{1}} (3) = U_{3} (v_{1})

based on Equation (11). After the first iteration

(t = l_{m a x})

, the result

H^{3}

is shown in Figure 5. When

t = 2

,UBVfirst sets

d p_{v_{4}} = U_{2} (v_{4})

,

d p_{v_{1}} = U_{2} (v_{1})

, and

d p_{v_{3}} = U_{2} (v_{3})

. Then,UBVcomputes

d p

for all the out-neighbors of each

t p r \in H^{3}

. After this iteration,

H^{2}

is obtained. Similarly, when

t = 1

and 0,

H^{1}

and

H^{0}

can be obtained. Finally,UBVcomputes

\sum_{i = 0}^{l_{m a x}} d p_{v} (i)

for each

v \in H^{0}

. The upper bounds of all vertices are

B_{v_{2}} = 0.3281, B_{v_{3}} = 0.3281, B_{v_{4}} = 0.3281, B_{v_{5}} = 0.2451, B_{v_{6}} = 0.2191, B_{v_{1}} = 0.0521, B_{v_{7}} = 0.0521

.

5.2. Maintaining the Top-k Results

To obtain the top-k results, the algorithm requires computing SimRank for vertices whose upper bounds of SimRank are larger than the current minimum SimRank

{\bar{s}}_{m i n}

. However, the computation of SimRank involves a large number of iterations for generating

\sqrt{c}

-walks to obtain an average value. Ref. [13] defines the minimum number of iterations, denoted as

n_{r}

, that guarantees an error less than

ε

with at least

1 - δ

. However, when

n_{r}

increases, the number of iterations for generating

\sqrt{c}

-walks also increases. This can lead to significant time consumption when computing SimRank for the vertices with negligible SimRank. Here, we introduce an effective method to avoid the redundant generation for such vertices. Based on Lemma 3, at the xth iteration, we assume that the SimRanks of the remaining

n_{r} - x

iterations are all computed as the upper bound

B_{v}

, if the average value of total

n_{r}

iterations is not larger than the current minimum SimRank

{\bar{s}}_{m i n}

; then, we can safely skip the computation of the remaining

n_{r} - x

iterations and prune this vertex. This enables us to reduce computational overhead and improve efficiency.

Lemma 3.

During the computation of SimRank for vertex v, at the xth iteration for generating a

\sqrt{c}

-walk, if

\frac{\sum_{i = 1}^{x} {\bar{s}}_{i} (u, v) + (n_{r} - x) * B_{v}}{n_{r}} \leq {\bar{s}}_{m i n}

(12)

then the remaining

n_{r} - x

iterations can be safely skipped.

Proof of Lemma 3.

Since

B_{v}

is the upper bound of the SimRank of v, for any yth trial, we have

{\bar{s}}_{y} (u, v) \leq B_{v}

(13)

Furthermore,

\bar{s} (u, v) = \frac{\sum_{i = 1}^{n_{r}} {\bar{s}}_{i} (u, v)}{n_{r}}

(14)

Then, we have

\bar{s} (u, v) = \frac{\sum_{i = 1}^{x} {\bar{s}}_{i} (u, v) + \sum_{j = n_{r} - x}^{n_{r}} {\bar{s}}_{j} (u, v)}{n_{r}} \leq \frac{\sum_{i = 1}^{x} {\bar{s}}_{i} (u, v) + (n_{r} - x) * B_{v}}{n_{r}}

(15)

Based on Equation (15), if Equation (12) is satisfied, it implies that

{\bar{s}}_{(} u, v)

is less than

{\bar{s}}_{m i n}

. Therefore, the remaining

n_{r} - x

trials can be safely skipped. □

The complete pseudo-code of HitSim-OPT is illustrated in Algorithm 4. It first invokes the revReach algorithm (referenced as CrashSim [13]) to construct the reverse reachable tree of source u and return a matrix U (line 1). Subsequently, it invokes the UBV algorithm to compute the upper bound

B_{v}

for each vertex v (line 2).

Algorithm 4: HitSim-OPT

In step 3, it maintains a priority queue Q of size k to store the top-k results (lines 4 to 22). Specifically, it determines whether to prune v to avoid generating

\sqrt{c}

-walks for v by comparing

B_{v}

with the top element

{\bar{s}}_{m i n}

in Q. If

B_{v} \leq {\bar{s}}_{m i n}

, v can be pruned (lines 6 to 8). If v is not pruned, the algorithm generates a

\sqrt{c}

-walk for

n_{r}

trials to compute

\bar{s} (u, v)

for v (lines 9 to 20). During the

n_{r}

trials, if the current

\sum_{i = 1}^{x} \bar{s} (u, v)

satisfies Lemma 3, the generation process breaks (lines 15 to 17).

5.3. Analysis

The time and space complexity of HitSim-OPT can be analyzed in steps 1, 2, and 3.

Step 1 (computing the reverse reachable tree U). It is the same as step 1 of HitSim.
Step 2 (computing the upper bound of SimRank for each vertex). Computing an array of $l_{m a x}$ for all vertices is essential, resulting in a time complexity of $O (m \cdot l_{m a x})$ and a space complexity of $O (n \cdot l_{m a x})$ .
Step 3 (maintaining the top-k results). It is the same as step 3 of HitSim.

Based on the above analysis, the total time complexity of HitSim-OPT is

O (m + m \cdot l_{m a x} + n \cdot l_{m a x} \cdot \frac{3 c}{ε^{2}} \log \frac{n}{δ} + n \log k)

, and space complexity is

O (2 n + k + n \cdot l_{m a x})

.

6. Experiments

6.1. Experimental Setup

We conduct extensive experiments to evaluate the performance of our algorithms. The algorithms evaluated in our experiments are summarized as follows:

MC [14];
CrashSim [13];
HitSim: Algorithm 2;
HitSim-OPT: Algorithm 4.

We conduct all the experiments on an Ubuntu machine with Intel(R) Core(TM) i7-12700 CPU 2.10 GHz and 64 G memory. All of the algorithms are implemented in C++.

Datasets. We used six datasets to evaluate the performance of all the algorithms. The dataset soc-Epinions (http://konect.cc/networks/ (accessed on 20 September 2022)) represents the trust network from the online social network Epinions; emai-EuAll (http://snap.stanford.edu/data/ (accessed on 1 December 2021)) is the email communication network of a large European institution; amazon (http://konect.cc/networks/ (accessed on 20 September 2022)) is the network of items on Amazon; wiki-topcats (http://snap.stanford.edu/data/ (accessed on 1 December 2021)) is a web graph of Wikipedia hyperlinks; soc-LiveJournal (http://snap.stanford.edu/data/ (accessed on 1 December 2021)) is a free online community with almost 10 million members; wikipedia-link-en (http://konect.cc/networks/ (accessed on 20 September 2022)) consists of the wikilinks of Wikipedia in the English language. These datasets represent relevant knowledge in various fields. Detailed statistics of these datasets are summarized in Table 3, where

| V |

,

| E |

, and

\bar{d}

denote the number of vertices, the number of edges, and the average of degrees.

Parameters. Consistent with previous studies [3,7,13,17], we set the decay factor c to 0.6 and

l_{m a x}

to 5. To evaluate the performance under different error guarantees, we vary the parameter

ε

to achieve overall absolute error guarantees of 0.1, 0.05, 0.025, and 0.0125, while maintaining a failure probability of

δ = 0.01

.

Metrics. For evaluating the quality of results for each single-source and top-k query from the source vertex u, we employ two metrics:

A E @ k

and

P r e c i s i o n @ k

. The

A E @ k

metric, representing the average absolute error, is computed as

A E @ k = \frac{1}{k} \sum_{i = 1}^{k} ∣ s (u, v) - \bar{s} (u, v) ∣

, where

s (u, v)

is the true SimRank score between vertices u and v, and

\bar{s} (u, v)

is the SimRank estimation. The

P r e c i s i o n @ k

metric measures the proportion of correctly identified results among the top-k results. It is computed as

P r e c i s i o n @ k = ∣ V_{k} \cap V_{k}^{'} ∣ / k

, where

V_{k}

represents the list of top-k vertices returned by the algorithm being evaluated, and

V_{k}^{'}

represents the ground-truth top-k results.

6.2. Performance of Algorithms

Absolute error in querying (AE@50). In this experiment, we set k to 50 and vary

ε

from 0.1 to 0.05, 0.025, and 0.0125. Figure 6 illustrates the trade-offs between

A E @ 50

and the query time of each algorithm. We can observe that as

ε

decreases the value of

A E @ 50

decreases. However, the query time increases for each algorithm. The curves in the graph show a near-linear relationship, suggesting that the algorithms achieve a faster query time when

ε

is large.

Additionally, when comparing algorithms with the same parameters, both HitSim and HitSim-OPT demonstrate similar average

A E

values compared to CrashSim, while requiring less query time. This can be attributed to the fact that CrashSim computes SimRank for all vertices to obtain the top-k results, whereas HitSim only computes SimRank for vertices within sets whose upper bounds of SimRank exceed

{\bar{s}}_{m i n}

, and HitSim-OPT computes SimRank only for vertices with larger upper bounds.

It is worth noting that the results of MC may be unstable when

n_{r}

(the number of iterations) is not sufficiently large, as MC requires generating a large number of

\sqrt{c}

-walks for each vertex v to determine if it meets source u.

Precision in querying (Precision@50). In this experiment, we fix the value of k to 50 and vary the parameter

ε

from 0.1 to 0.05, 0.025, and 0.0125. The trade-off between

P r e c i s i o n @ 50

and query time is illustrated in Figure 7. We observe that, as

ε

decreases, both

P r e c i s i o n @ 50

and the query time increase. When

ε = 0.0125

, all algorithms achieve a precision of nearly 1. The curves in the graph show a near-linear relationship, indicating that algorithms achieve a faster query time when

ε

is large.

Moreover, when comparing algorithms with the same parameters, both HitSim and HitSim-OPT demonstrate similar precision values compared to CrashSim, while requiring less query time. The reason is the same as the last experiment. The results of MC may be unstable when

n_{r}

(the number of iterations) is not sufficiently large.

Running time. In this experiment, we compare the running time of querying the top-250 results using the same parameters for each algorithm. Figure 8 shows the running time of each algorithm. The results demonstrate that on all datasets HitSim and HitSim-OPT exhibit faster performance compared to MC and CrashSim. Specifically, HitSim is approximately 7 times faster than MC on average and 3 times faster than CrashSim. Similarly, HitSim-OPT is approximately 30 times faster than MC on average and 11 times faster than CrashSim.

The main reason is that CrashSim processes all vertices, whereas HitSim only processes the vertices within sets whose upper bounds of SimRank exceed

{\bar{s}}_{m i n}

, and HitSim-OPT processes only vertices with larger upper bounds. As mentioned in the previous analysis, the running time of generating

\sqrt{c}

-walks for vertices dominates the overall performance. Therefore, processing fewer vertices results in less time cost for HitSim and HitSim-OPT compared to MC and CrashSim.

To validate the correctness, we also test the number of vertices processed by each algorithm, as shown in Table 4. The results confirm that MC and CrashSim process all the vertices, while HitSim and HitSim-OPT process significantly fewer vertices due to their batch-pruning methods. Additionally, the number of vertices processed by MC or CrashSim is approximately 4 times more than HitSim on average and 16 times more than HitSim-OPT. The results demonstrate that HitSim and HitSim-OPT can efficiently return the single-source and top-k query.

Performance of HitSim and HitSim-OPT. In this experiment, we compare the performance of the preprocessing and query steps of HitSim and HitSim-OPT under the parameters

k = 250

and

ε = 0.025

. As described in Section 4, HitSim and HitSim-OPT return the top-k results in three steps. We consider the first two steps as preprocessing and the third step as the query phase.

Figure 9 illustrates the performance comparison between the preprocessing and query steps of HitSim and HitSim-OPT on all datasets. The results indicate that Hitsim-OPT outperforms HitSim. Notably, the preprocessing step of HitSim-OPT is slower than that of HitSim. This is attributed to the fact that HitSim-OPT requires constructing

d p

arrays of length

l_{m a x}

for all vertices on the reverse reachable tree to obtain their SimRank upper bounds, whereas HitSim only requires visiting all vertices on the reverse reachable tree. The query performance of HitSim-OPT demonstrates a significant improvement over HitSim. This enhancement can be attributed to the finer-grained pruning rules of HitSim-OPT, allowing it to achieve efficiency by processing fewer vertices. However, in large-scale dense graphs such as SL and WE, many vertices’ upper bounds may be quite similar, resulting in a large number of vertices meeting the processing condition (upper bound

B > {\bar{s}}_{m i n}

). Nonetheless, during the processing phase, HitSim-OPT leverages its efficient pruning rule (Lemma 3) to early-terminate

\sqrt{c}

-walks, which is the most time-consuming operation. As a result, many vertices do not need to complete all the

n_{r}

iterations of

\sqrt{c}

-walks, thereby reducing the computational overhead. For instance, on SL, only 597 vertices complete all

n_{r}

iterations of

\sqrt{c}

-walks, while 3,841,961 − 597 = 3,841,364 vertices execute 1 to

n_{r}

iterations of

\sqrt{c}

-walks. Similarly, on WE, only 849 vertices complete all

n_{r}

iterations of

\sqrt{c}

-walks, while 5,147,351 − 849 = 5,146,502 vertices execute 1 to

n_{r} - 1

iterations of

\sqrt{c}

-walks.

Scalability. In this experiment, we test the scalability of HitSim and HitSim-OPT. We generate four subgraphs by randomly sampling 20–80% of the edges from the WE dataset. We test the running time of HitSim and HitSim-OPT on WE with fixed parameters

c = 0.6, ε = 0.025

, and

δ = 0.1

. Figure 10 shows the results of the scalability test. We can see that both HitSim and HitSim-OPT show near-scalability as the number of edges

∣ E ∣

increases from 20% to 100%, and as the value of k ranges from 20 to 2000. Additionally, we note that as k increases, the running time of both HitSim and HitSim-OPT also increases. This is because the batch-pruning strategy is triggered later as k increases.

7. Conclusions

In this paper, we study the single-source and top-k SimRank query problem. We first propose an efficient algorithm called HitSim that utilizes a branch-and-bound strategy. HitSim partitions vertices into distinct sets based on their shortest-meeting lengths to the source vertex. Subsequently, it computes the upper bound of SimRank for each set. By batch-pruning vertices within the same set whose upper bound is less than the minimum SimRank of the current results, HitSim significantly enhances computational efficiency. Furthermore, we propose an optimized algorithm called HitSim-OPT, which employs a fine-grained pruning strategy. HitSim-OPT computes the upper bound of SimRank for each vertex, thereby improving pruning efficiency. Our experimental results on six real-world datasets demonstrate that, while maintaining comparable precision and absolute error to CrashSim, HitSim achieves an average speedup of 3 times compared to CrashSim, and HitSim-OPT achieves an average speedup of 11 times.

Author Contributions

Conceptualization, J.B. and J.Z.; methodology, J.B. and M.M.; software, M.M. and S.C.; validation, J.B. and J.Z.; formal analysis, M.D. and M.M.; investigation, J.B., M.M. and S.C.; data curation, J.B.; writing—original draft preparation, M.M. and J.B.; writing—review and editing, J.B. and J.Z.; supervision, J.Z., M.D. and Z.C.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by grants from the Natural Science Foundation of China (No.: 62372101, 61873337, 62272097).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available on request from the first author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jin, R.; Lee, V.E.; Hong, H. Axiomatic ranking of network role similarity. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; pp. 922–930. [Google Scholar] [CrossRef]
Liben-Nowell, D.; Kleinberg, J.M. The link-prediction problem for social networks. J. Assoc. Inf. Sci. Technol. 2007, 58, 1019–1031. [Google Scholar] [CrossRef]
Antonellis, I.; Garcia-Molina, H.; Chang, C. Simrank++: Query rewriting through link analysis of the click graph. Proc. VLDB Endow. 2008, 1, 408–421. [Google Scholar] [CrossRef]
Spirin, N.; Han, J. Survey on web spam detection: Principles and algorithms. SIGKDD Explor. 2011, 13, 50–64. [Google Scholar] [CrossRef]
Rothe, S.; Schütze, H. CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, MD, USA, 22–27 June 2014;: Long Papers; Volume 1, pp. 1392–1402. [Google Scholar] [CrossRef]
Jeh, G.; Widom, J. SimRank: A measure of structural-context similarity. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; pp. 538–543. [Google Scholar] [CrossRef]
Lizorkin, D.; Velikhov, P.E.; Grinev, M.N.; Turdakov, D. Accuracy estimate and optimization techniques for SimRank computation. VLDB J. 2010, 19, 45–66. [Google Scholar] [CrossRef]
Tao, W.; Yu, M.; Li, G. Efficient Top-K SimRank-based Similarity Join. Proc. VLDB Endow. 2014, 8, 317–328. [Google Scholar] [CrossRef]
Fogaras, D.; Rácz, B. Scaling link-based similarity search. In Proceedings of the 14th International Conference on World Wide Web, WWW 2005, Chiba, Japan, 10–14 May 2005; pp. 641–650. [Google Scholar] [CrossRef]
Lee, P.; Lakshmanan, L.V.S.; Yu, J.X. On Top-k Structural Similarity Search. In Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA, 1–5 April 2012; pp. 774–785. [Google Scholar] [CrossRef]
Tian, B.; Xiao, X. SLING: A Near-Optimal Index Structure for SimRank. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1859–1874. [Google Scholar] [CrossRef]
Liu, Y.; Zheng, B.; He, X.; Wei, Z.; Xiao, X.; Zheng, K.; Lu, J. ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs. Proc. VLDB Endow. 2017, 11, 14–26. [Google Scholar] [CrossRef]
Li, M.; Choudhury, F.M.; Borovica-Gajic, R.; Wang, Z.; Xin, J.; Li, J. CrashSim: An Efficient Algorithm for Computing SimRank over Static and Temporal Graphs. In Proceedings of the 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, 20–24 April 2020; pp. 1141–1152. [Google Scholar] [CrossRef]
Liu, Y.; Zou, L.; Ge, Q.; Wei, Z. SimTab: Accuracy-Guaranteed SimRank Queries through Tighter Confidence Bounds and Multi-Armed Bandits. Proc. VLDB Endow. 2020, 13, 2202–2214. [Google Scholar] [CrossRef]
Shao, Y.; Cui, B.; Chen, L.; Liu, M.; Xie, X. An Efficient Similarity Search Framework for SimRank over Large Dynamic Graphs. Proc. VLDB Endow. 2015, 8, 838–849. [Google Scholar] [CrossRef]
Jiang, M.; Fu, A.W.; Wong, R.C.; Wang, K. READS: A Random Walk Approach for Efficient and Accurate Dynamic SimRank. Proc. VLDB Endow. 2017, 10, 937–948. [Google Scholar] [CrossRef]
Wei, Z.; He, X.; Xiao, X.; Wang, S.; Liu, Y.; Du, X.; Wen, J. PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1042–1059. [Google Scholar] [CrossRef]

Figure 1. Directed graph

G_{1}

.

Figure 1. Directed graph

G_{1}

.

Figure 2. Directed graph

G_{2}

.

Figure 2. Directed graph

G_{2}

.

Figure 3. Reverse reachable tree of

v_{0}

.

Figure 3. Reverse reachable tree of

v_{0}

.

Figure 4. Reverse reachable tree of

v_{7}

.

Figure 4. Reverse reachable tree of

v_{7}

.

Figure 5. Array

d p

of each vertex.

Figure 5. Array

d p

of each vertex.

Figure 6. The performance of

A E @ 50

.

Figure 6. The performance of

A E @ 50

.

Figure 7. The performance of

P r e c i s i o n @ 50

.

Figure 7. The performance of

P r e c i s i o n @ 50

.

Figure 8. Running time of top-250 results query under the same parameters.

Figure 9. Preprocessing and query of HitSim and Hitsim-OPT.

Figure 10. Scalability tests of HitSim and HitSim-OPT by varying

∣ E ∣

from 20% to 100% and k from 20 to 2000 on WE.

Figure 10. Scalability tests of HitSim and HitSim-OPT by varying

∣ E ∣

from 20% to 100% and k from 20 to 2000 on WE.

Table 1. Summary of notation.

Notation	Description
G	Directed graph
$V, E$	The vertex and edge sets of G
$n, m$	The number of vertices and edges in G, $n = \| V \|$ , $m = \| E \|$
$I (u), O u t (u)$	Sets of in-neighbors and out-neighbors of vertex u
$l_{m a x}$	Limited length of $\sqrt{c}$ -walk
c	Decay factor in the definition of SimRank
$s (u, v)$	SimRank of two vertices u and v
$\bar{s} (u, v)$	A SimRank estimation of $s (u, v)$
${\bar{s}}_{m i n}$	The minimum value of SimRank of current results
$B_{v}, B_{M}$	The upper bound of SimRank of vertex v, set M
$W (u)$	A reverse $\sqrt{c}$ -walk starting from u
$ε$	The difference between $\bar{s} (u, v)$ and $s (u, v)$
$δ$	The failure probability

Table 2. SimRank with respect to

v_{0}

.

Table 2. SimRank with respect to

v_{0}

.

Vertex	$v_{0}$	$v_{1}$	$v_{2}$	$v_{3}$	$v_{4}$	$v_{5}$	$v_{6}$	$v_{7}$
$s (v_{0}, *)$	1.0	0.0064	0.048	0.132	0.074	0.04	0.048	0.0064

Table 3. Statistics of datasets.

Dataset	$\| V \|$	$\| E \|$	$\bar{d}$
`soc-Epinions`(`SE`)	75,879	508,837	6.71
`emai-EuAll`(`EE`)	265,214	420,045	1.58
`amazon`(`AZ`)	403,394	3,387,388	8.40
`wiki-topcats`(`WC`)	1,791,489	28,511,087	15.92
`soc-LiveJournal`(`SL`)	4,847,571	68,993,773	14.23
`wikipedia-link-en`(`WE`)	13,593,032	437,217,424	32.16

Table 4. The number of vertices processed. The column total represents the total number of vertices processed by HitSim-OPT, while the column

n_{r}

iteration represents the number of vertices that completed all

n_{r}

iterations of

\sqrt{c}

-walks.

Table 4. The number of vertices processed. The column total represents the total number of vertices processed by HitSim-OPT, while the column

n_{r}

iteration represents the number of vertices that completed all

n_{r}

iterations of

\sqrt{c}

-walks.

Dataset	MC	CrashSim	HitSim	HitSim-OPT
Dataset	MC	CrashSim	HitSim	Total	$n_{r}$ Iteration
`soc-Epinions` (`SE`)	75,879	75,879	41,659	21,847	833
`emai-EuAll` (`EE`)	265,214	265,214	32,504	18,919	695
`amazon` (`AZ`)	403,394	403,394	59,484	5434	423
`wiki-topcats` (`WC`)	1,791,489	1,791,489	1,290,877	1,057,051	872
`soc-LiveJournal` (`SL`)	4,847,571	4,847,571	1,863,063	3,841,961	597
`wikipedia-link-en` (`WE`)	13,593,032	13,593,032	2,683,698	5,147,351	849

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bai, J.; Zhou, J.; Chen, S.; Du, M.; Chen, Z.; Min, M. HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation. Information 2024, 15, 348. https://doi.org/10.3390/info15060348

AMA Style

Bai J, Zhou J, Chen S, Du M, Chen Z, Min M. HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation. Information. 2024; 15(6):348. https://doi.org/10.3390/info15060348

Chicago/Turabian Style

Bai, Jing, Junfeng Zhou, Shuotong Chen, Ming Du, Ziyang Chen, and Mengtao Min. 2024. "HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation" Information 15, no. 6: 348. https://doi.org/10.3390/info15060348

APA Style

Bai, J., Zhou, J., Chen, S., Du, M., Chen, Z., & Min, M. (2024). HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation. Information, 15(6), 348. https://doi.org/10.3390/info15060348

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation

Abstract

1. Introduction

2. Preliminaries

3. Related Works

4. HitSim

4.1. Partitioning Vertices

4.2. Computing $\bar{s} (u, v)$ and Maintaining the Top-k Results

4.3. Analysis

5. Optimization

5.1. Computing Upper Bound of SimRank for Each Vertex

5.2. Maintaining the Top-k Results

5.3. Analysis

6. Experiments

6.1. Experimental Setup

6.2. Performance of Algorithms

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

HitSim: An Efficient Algorithm for Single-Source and Top-k SimRank Computation

Abstract

1. Introduction

2. Preliminaries

3. Related Works

4. HitSim

4.1. Partitioning Vertices

4.2. Computing s ¯ ( u , v ) and Maintaining the Top-k Results

4.3. Analysis

5. Optimization

5.1. Computing Upper Bound of SimRank for Each Vertex

5.2. Maintaining the Top-k Results

5.3. Analysis

6. Experiments

6.1. Experimental Setup

6.2. Performance of Algorithms

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Computing $\bar{s} (u, v)$ and Maintaining the Top-k Results