Efficient Reachability Ratio Computation for 2-Hop Labeling Scheme

Xian Tang; Junfeng Zhou; Yunyu Shi; Xiang Liu; Lihong Kong

doi:10.3390/electronics12051178

,

and

¹

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

School of Computer Science and Technology, Donghua University, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Electronics2023, 12(5), 1178;https://doi.org/10.3390/electronics12051178

This article belongs to the Special Issue Intelligent Analysis and Security Calculation of Multisource Data

Version Notes

Order Reprints

Abstract

Reachability queries processing has been extensively studied during the past decades. Many approaches have followed the line of designing 2-hop labels to ensure acceleration. Considering its index size cannot be bounded, researchers have proposed to use a part of nodes to construct partial 2-hop labels (p2HLs) to cover as much reachability information as possible. We achieved better query performance using p2HLs with a limited index size and index construction time. However, the adoption of p2HLs was based on intuition, and the number of nodes used to generate p2HLs was fixed in advance blindly, without knowing its applicability. In this paper, we focused on the problem of efficiently computing a reachability ratio (RR) in order to obtain RR-aware p2HLs. Here, RR denoted the ratio of the number of reachable queries that could be answered by p2HLs over the total number of reachable queries involved in a given graph. Based on the RR, users could determine whether p2HLs should be used to answer the reachability queries for a given graph and how many nodes should be chosen to generate p2HLs. We discussed the difficulties of RR computation and propose an incremental-partition algorithm for RR computation. Our rich experimental results showed that our algorithm could efficiently obtain the RR and the overall effects on query performance by different p2HLs. Based on the experimental results, we provide our findings on the use p2HLs for a given graph for processing reachability queries.

Keywords:

processing reachability queries; 2-hop labeling scheme; reachability ratio

1. Introduction

Reachability queries processing is a fundamental graph operation that has been extensively studied in the literature. When given a directed graph, a reachability query

u ? ⇝ v

inquires whether there exists a directed path from node u to v. It could be used for the Semantic Web, online social networks, biological networks, ontology, transportation networks, etc., to answer whether two nodes have a certain connection. It could also be used as a building block for answering structured queries, such as XQuery (https://www.w3.org/TR/2017/REC-xquery-31-20170321, accessed on 1 January 2020) or SPARQL (https://www.w3.org/TR/rdf-sparql-query, accessed on 1 January 2020). For applications where answering reachability queries is intensively involved, any substantial progress in query time could significantly affect the performance of all applications using it. Over the past decades, researchers have proposed many efficient labeling schemes to facilitate processing reachability queries. According to [1,2], the existing approaches could be classified into two categories: Label-Only and Label+G. Label-Only means that the index conveys the complete reachability information, and the given query

u ? ⇝ v

could be answered by comparing the labels of u and v, without graph traversal. Label+G means that the index covers partial reachability information, and we may need to perform graph traversal to answer a query.

For Label-Only, the state-of-the-art approaches in [3,4,5] were based on 2-hop labeling schemes [6], and they generated 2-hop labels based on all the nodes to maintain the whole transitive closure (TC). The problem of Label-Only approaches has been that the index size cannot be bounded with respect to the size of the input graph and minimizing the size of the 2-hop labels as NP-hard [6], which makes the index construction a difficult task in practice for dense graphs. For example, it was shown in [2] that when the number of nodes was 10 million and the average degree of the input graph was greater than 6, the state-of-the-art Label-Only approaches, such as TF [5], DL [3], and TOL [4], could not construct the index successfully due to exceeding the memory limit. As compared to Label-Only, the main advantage of Label+G approaches has been that the index could be easily constructed, which meant that Label+G approaches were indispensable when Label-Only approaches were not appropriate for the given graph. Due to that, the index could not cover all the reachability information, and the Label+G approaches usually used two types of labels for pruning in order to terminate the graph traversal in advance. The first was No-Label, which was used to determine whether a query was an unreachable query. The second was Yes-Label, which was used to determine whether a query was a reachable query, including the tree interval [7,8,9] and p2HLs, which were constructed based on a portion of, rather than all of, the nodes. In this paper, we referred to nodes that were used to construct p2HLs as hop-nodes. It was shown in [2] that the state-of-the-art No-Label could correctly answer more than 95% of unreachable queries. However, this was not enough. Although a workload with completely random queries would be heavily skewed towards unreachable queries [9,10], in real scenarios, it would be highly unlikely that most queries would be unreachable, as the node pair in a query would tend to have a certain connection [11]. For Label+G approaches, therefore, processing reachable queries has been regarded as a worst-case scenario due to the need of graph traversal [7], and the query performance has been dominated by the pruning power of Yes-Label when the number of reachable queries increased.

The pruning power of Yes-Label depends on reachability ratio (RR), i.e., the ratio of the number of reachable queries that can be answered by Yes-Label based on the size of the TC. The higher the RR, the larger the probability that the given reachable query can be answered using Yes-Label without graph traversal. Our statistics showed that even using five randomly generated tree intervals, as performed in [10], the RR was less than 10% for most graphs, resulting in poor query performance when the number of reachable queries increased [7]. As a comparison, it was shown in [7] that p2HLs could improve the query performance significantly for many graphs. For example, Figure 1 shows the RR of p2HLs on three graphs, from which we found that even if we constructed p2HLs using 1 hop-node, the RR was greater than 95% on web-uk, meaning the probability that a given reachable query q could be answered by p2HLs was greater than 95% without graph traversal. In practice, however, the RR of p2HLs changed dynamically for different graphs, and the query performance was not as efficient as expected [7]. In shown in Figure 1, the RR was close to 0 on patent, meaning the probability that q could be answered by p2HLs was close to 0. In this case, using p2HLs resulted in only additional costs.

Figure 1. Reachability ratio (RR) of p2HLs based on k nodes with a large degree.

Considering that Label+G approaches are indispensable and its query performance is mainly affected by the pruning power of the most important Yes-Label, i.e., p2HLs, the key problem that needed to be solved was the following: whether we should use p2HLs for a given graph. The answer depended on the RR. For example, given the RR in Figure 1, we could decide to use p2HLs on web-uk but not on patent, as by using the same number of hop-nodes, the RR would be greater than 95% on web-uk and close to 0 on patent. Furthermore, once RR were determined, we could further determine how many hop-nodes should be chosen to construct the p2HLs. For example, according to Figure 1, we could decide to use four hop-nodes for a human, but for web-uk, one hop-node would be sufficient due to its larger RR and smaller index size.

To the best of our knowledge, this was the first work to address RR-aware p2HLs. Computing RR for given p2HLs was not a simple task and, thus, involved two operations. One was computing the size of TC, and the other was computing the exact number of reachable queries that could be answered by p2HLs, which we referred to as the coverage size. Considering that the TC size computation could be efficiently solved by existing methods in [12], the difficulty of RR computation was related to efficiently computing the coverage size. The naive way was by first generating p2HLs based on the selected k-hop-nodes and then obtaining the coverage size by reviewing all the node pairs using p2HLs. In this way, the cost of coverage size computation was

O (k | V |^{2})

and could not be scaled for larger graphs, where V was the set of nodes in the input graph. Furthermore, if the RR was too small to fulfill the requirement, we may need to increase k-value and repeat the above operation, which would then make the RR computation more difficult to solve.

We proposed the computation of the coverage size incrementally, so that when the value of k changed, we could avoid the costly coverage size re-computation to support a more efficient RR computation. The basic concept was, given the coverage size with respect to the k nodes, when we computed the coverage size with respect to

k + 1

hop-nodes, we would not compute it from scratch but only compute the increased coverage size. However, the increased coverage size could not be easily computed. To obtain the increased coverage size with respect to the

(k + 1)

th hop-node u, we had to firstly traverse from u to obtain a set of nodes

D_{u}

that u could reach, then traverse from u in reverse to obtain the second set of nodes

A_{u}

that could reach u. Given

A_{u}

and

D_{u}

, we needed to determine for each pair of nodes

(a, d)

, whether a could reach d and whether that could be determined by the current p2HLs without u, where

a \in A_{u}, d \in D_{u}

. If this was possible, we could affirm that a could reach d and whether that could be determined by p2HLs without u and, therefore, should not be considered when computing the increased coverage size with respect to u. The cost of processing one hop-node u was as high as

O (k | A_{u} | | D_{u} |)

. Obviously, with an increase in the number of hop-nodes for p2HLs construction, the cost could become unaffordable. For this problem, we proposed dividing both

A_{u}

and

D_{u}

into a set of disjointed subsets based on an equivalence relationship (as defined later), so that for each pair of subsets

A_{1} \subseteq A_{u}

and

D_{1} \subseteq D_{u}

, we only needed to test one reachability query, rather than

| A_{1} | \times | D_{1} |

queries. The cost of the RR computation was, therefore, reduced significantly even when processing large graphs. We made the following contributions.

To the best of our knowledge, this was the first work to focus on RR-aware p2HLs.
We proposed a set of algorithms for RR computation. We showed that according to the properties of 2-hop labels, the two sets of nodes that could reach, and be reached by, a certain hop-node could be divided into a set of disjointed subsets, so that the computation cost could be reduced significantly. We proved the correctness and efficiency of our approach.
We conducted rich experiments on real datasets. The experimental results showed that when compared with the baseline approach, our algorithm operated much more efficiently on RR computations. We also showed the overall query performance was affected by p2HLs with different numbers of hop-nodes, based on which we provided our findings as to whether and how to use p2HLs for a given graph for processing reachability queries.

The remainder of the paper is organized as follows. We discuss the preliminaries and the related work in Section 2. In Section 3, we provide the baseline algorithm for RR computations and propose the first incremental algorithm in Section 4. After that, we propose the optimized incremental algorithm in Section 5. We report our experimental results in Section 6 and conclude our paper in Section 7.

2. Background and Related Work

2.1. Preliminaries

Given a directed graph

G

, we constructed a directed acyclic graph (DAG) G from

G

in linear time [13] by coalescing each strongly connected component (SCC) of

G

into a node in G. Then, the reachability query on

G

could be answered equivalently on G. We followed the convention and assumed that the input graph was a DAG

G = (V, E)

, where V is the set of nodes and E the set of edges. We defined

i n_{} (u) = {v | (v, u) \in E}

as the set of in-neighbors of u in G and

o u t_{} (u) = {v | (u, v) \in E}

, the set of out-neighbors of u. We used

u ⇝ v

to denote that node u could reach node v in G. The transitive closure (TC) of node u was denoted as

T C (u)

, which was the set of all nodes that u could reach. The TC size of G was

| T C (G) | = \sum_{u \in V} | T C (u) |

. We used

T C^{- 1} (u)

to denote the set of all nodes that could reach u.

Given a set of k nodes

S_{k} \subseteq V

, we used

L_{k}

to denote the 2-hop labels constructed based on the nodes of

S_{k}

, where each node in

S_{k}

was called a hop-node. If

v \in T C (u)

and

L_{k}

could correctly affirm that

u ⇝ v

, we assumed

L_{k}

(or

S_{k}

) could cover the reachable query

u ⇝ v

. Let

N_{k}

be the number of distinct reachable queries that are covered by

L_{k}

, the RR

α

with respect to

S_{k}

is defined as Equation (1).

α = N_{k} / | T C (G) |

(1)

Problem Statement:

Given a DAG

G = (V, E)

, its TC size

| T C (G) |

, and a hop-node set

S_{k} \subseteq V

, return the RR of

S_{k}

.

2.2. Related Work

As no existing works addressed the RR computation, we only discussed existing works on processing reachability queries and the TC size computation.

Label-Only:

The Label-Only algorithms [3,4,5,6,14] have attempted to compress TC in order to obtain a smaller index size and facilitate answering queries. Recent research includes TF [5], DL [3], PLL [14], and TOL [4]. The main concept was based on the 2-hop labeling [6] that could answer reachability queries, where each node u was assigned two labels: one of which was an in-label

L_{i n}^{} (u)

, and the other was an out-label

L_{o u t}^{} (u)

. The statement

L_{i n}^{} (u) (L_{o u t}^{} (u))

consists of a set of nodes v that could reach (and be reached by) u. Based on this labeling scheme,

u ? ⇝ v

could be answered by computing the result of

L_{o u t}^{} (u) ⋂ L_{i n}^{} (v)

. If

L_{o u t}^{} (u) ⋂ L_{i n}^{} (v) \neq \emptyset

, then

u ⇝ v

; otherwise,

u \neg ⇝ v

.

Considering that minimizing the 2-hop label size would be NP-hard [6], Cohen et al. proposed a

(log | V |)

-approximate solution. However, the index construction cost was

{O (| V | | E | log (| V |}^{2} / | E |))

, which made it difficult to scale for large graphs. Motivated by this, the following works, [3,4,5,14], had to discard the approximation guarantee and focus on finding better ordering strategies to rank nodes in order to improve the efficiency of their 2-hop label construction. In [5], TF folded the given DAG recursively based on the topological level (topo-level). Assuming that all nodes were sorted by certain ranking values, both DL [3] and PLL [14] adopted the same concept to compute 2-hop labels that enumerated a node u in each iteration with a forward (backward) breadth-first search (BFS) to add u to an in (or out) label of nodes that u could reach (and could be reached). The TOL [4] framework summarized TF, DL, and PLL and then computed 2-hop labels based on a newly proposed total order.

Recently, researchers proposed computing 2-hop labels in parallel [15,16] to accelerate the construction of 2-hop labels. However, the index size still could not be bounded with respect to the size of the input graph.

Label+G:

The Label+G methods answered reachability queries using both Yes-Label and No-Label covering partial reachability information. The main advantage, when compared with Label-Only, was that the index size could be bounded. Recent Label+G approaches include those proposed in GRAIL [9,10], FERRARI-G [7], FELINE [8], IP

^{+}

[1] and BFL

^{+}

[2]. For these approaches, No-Label was used to test whether the given query was an unreachable one, including topo-order [7,8], topo-level [7,8,9], graph interval [7,9,10], i.e., intervals covering all reachable nodes, IP label [1], and Bloom filter label [2]. The Yes-Label was used to test whether the given query was a reachable query, including the tree interval [7,8,9] and p2HLs [7,17].

It was shown by the existing research that No-Label could prune most unreachable queries [2], though the performance of Yes-Label could fluctuate unpredictably [7,17]. Our statistics showed that the RR of the tree interval was usually less than 10%. As a comparison, the RR of p2HLs could be much higher even with very few hop-nodes, as shown by Figure 1. Therefore, adopting p2HLs could improve the query performance of the Label+G approaches significantly for some graphs [7,17]. However, for some other graphs, p2HLs may not work as efficiently as expected [7,17]. Therefore, when considering p2HLs for processing reachability queries, it was necessary that we could quickly obtain the RR to correctly determine whether we should use p2HLs or not, and furthermore, we could then decide the number hop-nodes that should be chosen to construct p2HLs.

Other Reachability Approaches:

Ref. [18] proposed an index-free approach to process reachability queries on dynamic graphs with an approximate answer. Refs. [19,20] discussed processing label-constrained reachability queries on large graphs. Ref. [21] adopted the concept of 2-hop coverage and proposed an index-based approach to answer span-reachability queries in large temporal graphs.

TC-Size Computation:

Given a DAG

G = (V, E)

, a path

p = v_{0}, v_{1}, . . ., v_{s}

satisfies

\forall i \in [0, s - 1]

,

(v_{i}, v_{i + 1}) \in E

, where s denotes the length of p. The path decomposition of G is a partition

S_{p} = {p_{1}, p_{2}, . . ., p_{k}}

, satisfying

V = ⋃_{i \in [1, | S_{p} |]} p_{i}

and

\forall i, j \in [1, | S_{p} |], i \neq j, p_{i} ⋂ p_{j} = \emptyset

. PTR [22] proposed a linear greedy algorithm to obtain the path partition, based on which [12] proposed an efficient algorithm buTC

^{+}

for a TC size computation with time complexity

O (r | E |)

, where

r = | S_{p} |

. This approach processed nodes of each path in a bottom-up method to achieve high efficiency. It was inspired by an observation that for two nodes, u and v, in a path, if

(u, v) \in E

, then

T C (v) \subset T C (u)

. Therefore, by processing the nodes of a path in a bottom-up approach with the nodes marked carefully, the nodes in

T C (v)

did not need to be revisited when computing

| T C (u) |

.

In addition, [4,23,24] proposed estimating the approximate TC size for all nodes in linear time

O (| V | + | E |)

.

Note that the TC size computation was different than the TC computation. The former addressed the size of

T C (v)

, i.e.,

| T C (v) |

, while the latter returned all nodes in each

T C (v)

.

3. The Baseline Algorithm

In this section, we first analyze the construction of 2-hop labels and then discuss the RR computation and the baseline algorithm.

Step-1: 2-hop Label Construction.

To construct 2-hop labels, we followed [3,14] and sorted all nodes by

(| i n_{} (u) | + 1) \times (| i n_{} (u) | + 1)

. The sorting result was

v_{1}, v_{2}, . . ., v_{| V |}

, where the first node has the largest rank value. We selected the first k nodes to obtain the hop-node set

S_{k} = {v_{1}, v_{2}, . . ., v_{k}}

. We had the following result: (Equations (2) and (3)).

\emptyset = S_{0} \subset S_{1} \subset S_{2} \subset \dots \subset S_{| V |} = V

(2)

S_{i} \ S_{i - 1} = {v_{i}}, where 0 < i \leq | V |

(3)

Given

S_{i}

, the p2HL

L_{i}

could be generated by processing

v_{i}

, based on

L_{i - 1}

, according to Equations (2) and (3). Specifically, we performed forward (backward) BFS from

v_{i}

to obtain a set

D_{i} (A_{i})

of nodes that

v_{i}

could reach (could be reached by

v_{i}

),as denoted by Figure 2a. For each node

a \in A_{i}

, we added

v_{i}

to a’s out-label, i.e.,

L_{o u t}^{i} (a) = L_{o u t}^{i - 1} (a) \cup {v_{i}}

, denoting that a could reach

v_{i}

. For each node

d \in D_{i}

, we added

v_{i}

to d’s in-label, i.e.,

L_{i n}^{i} (d) = L_{i n}^{i - 1} (d) \cup {v_{i}}

, denoting that

v_{i}

could reach d. After processing

v_{i}

, we obtained 2-hop labels

L_{i}

. The superscript i in

L_{o u t}^{i} (a) (L_{i n}^{i} (a))

denotes that both 2-hop labels

L_{o u t}^{i} (a)

and

L_{i n}^{i} (a)

, with respect to node a, are subsets of

S_{i}

. Here, we could use

L_{i - 1}

to reduce the size of both

A_{i}

and

D_{i}

. For example, in Figure 2c, when processing

v_{i}

, the backward BFS traversal from

v_{i}

could be terminated at

v_{i - 1}

, because

v_{i - 1}

reaching

v_{i}

could be determined by

L_{i - 1}

, and

\forall a \in T C^{- 1} (v_{i - 1})

reaching

v_{i}

through

v_{i - 1}

could also be answered by

L_{i - 1}

.Therefore, in practice,

A_{i} \subseteq T C (v_{i}) \land D_{i} \subseteq T C^{- 1} (v_{i})

.

Figure 2. The relationship between different hop-nodes, where (a)

v_{i}

is the first node, (b)

v_{i - 1} \neg ⇝ v_{i}

and

v_{i} \neg ⇝ v_{i - 1}

and (c)

v_{i - 1} ⇝ v_{i}

.

Example 1.

Consider G in Figure 3. Assume that the sorting result is

v_{1}, v_{2}, v_{3}, . . ., v_{15}

. To obtain

L_{2}

, we first performed bidirectional BFS of

v_{1}

to obtain

A_{1} = {v_{1}, v_{4}, v_{6}, v_{11}}

and

D_{1} = {v_{1}, v_{2}, v_{7}, v_{9}, v_{10},

v_{13}, v_{15}}

. After that, we added 1 to the out-label of nodes in

A_{1}

and the in-label of nodes in

D_{1}

, in order to obtain

L_{1}

. The next processed node was

v_{2}

. Similarly, we obtained

A_{2} = {v_{2}, v_{3}, v_{5}, v_{12}}

and

D_{2} = {v_{2}, v_{10}, v_{13}, v_{15}}

. Note that all nodes in

A_{1}

could reach

v_{2}

, but some of them were not in

A_{2}

, because for nodes that were in

A_{1}

but not in

A_{2}

, whether they could reach

v_{2}

could be answered by

L_{1}

. Then, we added 2 to the out- and in-labels of nodes in

A_{2}

and

D_{2}

, as shown by Table 1.

Figure 3. A sample DAG G.

Table 1. The p2HLs

L_{1}, L_{2}, L_{3}

constructed based on

S_{1} = {v_{1}}, S_{2} = {v_{1}, v_{2}}

, and

S_{3} = {v_{1}, v_{2}, v_{3}}

, respectively.

Step-2: RR Computation.

Given

L_{k}

with respect to

S_{k}

, the baseline approach computed the RR, as follows. First, it obtained the set of nodes that could reach either one of the set of hop-nodes, as shown by Equation (4). Second, it obtained the set of nodes that could be reached by either one of the set of hop-nodes, as shown by Equation (5). It computed the number of reachable queries that could be answered by

L_{k}

, as shown by Equation (6). Finally, we returned the RR of

S_{k}

based on Equation (1).

A = ⋃_{i \in [1, k]} A_{i}

(4)

D = ⋃_{i \in [1, k]} D_{i}

(5)

\begin{matrix} N_{k} = | {(a, d) | & a \in A, d \in D, a \neq d \\ L_{o u t}^{k} (a) \cap L_{i n}^{k} (d) \neq \emptyset} | \end{matrix}

(6)

Example 2.

Continued example of 1. To compute the RR of

S_{2} = {v_{1}, v_{2}}

, we first computed

A_{} = A_{1} \cup A_{2} = {v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{6}, v_{11}, v_{12}}

,

D_{} = D_{1} \cup D_{2} = {v_{1}, v_{2}, v_{7}, v_{9}, v_{10}, v_{13}, v_{15}}

, according to Equations (4) and (5), respectively. Then, we review each pair of nodes

a \in A_{}

and

d \in D_{} (a \neq d)

as to whether a reaching d could be answered by

L_{2}

. Furthermore, we computed the number of answered reachable queries according to Equation (6), which was 42 for G in Figure 3 and

L_{2}

in Table 1. Given

T C (G) = 70

, we knew that the RR of

S_{2}

was

42 / 70 = 60 %

.

The Algorithm:

The baseline algorithm (Algorithm 1) computed RR in two steps. Step-1 (lines 1–18) first calls the buTC

^{+}

algorithm [12] to compute TC size in line 1, and then constructs 2-hop labels

L_{k}

and obtained the two sets of nodes,

A_{}

and

D_{}

. Specifically, it first sorts all nodes in certain order in line 3 and then selects k hop-nodes in line 4. In lines 5–16, it performs forward and backward BFS for each hop-node

v_{i}

to construct 2-hop labels

L_{i}

. During the process, only if the reachability relationship between

v_{i}

and the visited node v cannot be answered by 2-hop labels

L_{i - 1}

, then it adds

v_{i}

to v’s in-label (line 9) or out-label (line 14), and adds v to

D_{i}

(line 10) or

A_{i}

(line 15); otherwise, it terminates the process because the reachability relationship has already been covered by

L_{i - 1}

. In lines 17–18, it obtains the two sets

A_{}

and

D_{}

according to Equations (4) and (5). Step-2 (lines 19–21) computes the number of covered reachable queries by

L_{k}

according to Equation (6). Finally, it returns the RR in line 22.

Algorithm 1: blRR

(G, k)

1 compute

| T C (G) |

by the buTC

^{+}

algorithm [12]

2

N_{k} \leftarrow 0, S_{k} \leftarrow \emptyset, A_{} \leftarrow \emptyset, D_{} \leftarrow \emptyset

3 rank all nodes v in G by

(| i n_{} (v) | + 1) \times (| i n_{} (v) | + 1)

4 put the first k nodes into

S_{k}

as hop-nodes

5 foreach

(v_{i} \in S_{k})

do

6

A_{i} \leftarrow \emptyset

;

D_{i} \leftarrow \emptyset

7 perform forward BFS from

v_{i}

, and for each visited v

8 if

(L_{o u t}^{i - 1} (v_{i}) ⋂ L_{i n}^{i - 1} (v) = \emptyset)

then /*

L_{i - 1}

*/

9

L_{i n}^{i} (v) \leftarrow L_{i n}^{i - 1} (v) ⋃ {v_{i}}

/*compute

L_{i}

*/

10

D_{i} \leftarrow D_{i} \cup {v}

11 else stop expansion from v

12 perform backward BFS from

v_{i}

, for each visited v

13 if

(L_{o u t}^{i - 1} (v) ⋂ L_{i n}^{i - 1} (v_{i}) = \emptyset)

then /*

L_{i - 1}

*/

14

L_{o u t}^{i} (v) \leftarrow L_{o u t}^{i - 1} (v) ⋃ {v_{i}}

/*compute

L_{i}

*/

15

A_{i} \leftarrow A_{i} \cup {v}

16 else stop expansion from v

17

A = ⋃_{i \in [1, k]} A_{i}

18

D = ⋃_{i \in [1, k]} D_{i}

19 foreach

(a \in A_{}, d \in D_{}, a \neq d)

do

20 if

(L_{o u t}^{k} (a) ⋂ L_{i n}^{k} (d) \neq \emptyset)

then /*

L_{k}

*/

21

N_{k} \leftarrow N_{k} + 1

22 return

α \leftarrow N_{k} / | T C (G) |

as RR of

S_{k}

Analysis:

The time cost of line 1 was

O (r | E |)

(details are in from Section 2.2) and was

O (| V |)

for line 3 by counting sort. The time cost of BFS from each hop-node

v_{i}

was

O (| V | + | E |)

(lines 6–16). During the two BFS traversals, the cost of processing each visited node v was

O (k)

(lines 8 and 13). Therefore, the time cost of processing k hop-nodes was

O (k^{2} (| V | + | E |))

. As a result, the time cost of Step-1 was

O (k^{2} (| V | + | E |) + r | E |)

. For Step-2, the time cost of computing

N_{k}

was

O (k | A | | D |)

. Therefore, the time complexity of Algorithm 1 was

O (k^{2} (| V | + | E |) + r | E | + k | A | | D |)

.

During the processing, we did not need to actually maintain every

A_{i}

and

D_{i}

; instead, we only needed to maintain

A_{}

and

D_{}

. Furthermore, we needed to maintain the 2-hop labels with respect to k hop-nodes, so the space cost was

O (k | V |)

. As

S_{k}, A_{}

, and

D_{}

were bounded by V, the space complexity of Algorithm 1 was

O (k | V |)

.

In practice, if the RR was too low to meet the requirement, then we needed to use additional hop-nodes. Therefore, Algorithm 1 would be called once more to compute the new RR, for which all reachability relationships tested for

S_{k}

would be tested again for the new hop-node set.

Example 3.

Continue Example 2. Given

A_{} = {v_{1}, v_{2}, v_{3},

v_{4}, v_{5}, v_{6}, v_{11}, v_{12}}

, and

D_{} = {v_{1}, v_{2}, v_{7},

v_{9}, v_{10}, v_{13}, v_{15}}

, we needed to test 56 queries in lines 19–21, due to

| A_{} | = 8

and

| D_{} | = 7

. By line 22, we knew the RR was 60%. If the RR was required to be equal or greater than 80%, we would need to enlarge the hop-node set and re-compute the RR from scratch. As a result, the 56 queries tested for

S_{2}

would be tested again for the new hop-node set.

4. The Incremental Approach

Considering that Algorithm 1 would be called again when the hop-node set was enlarged, a natural question was: Could we compute the RR incrementally? That is, given the RR with respect to

S_{i - 1}

, when we decided to compute it with respect to

S_{i} = S_{i - 1} \cup {v_{i}}

, we did not start from scratch; instead, we only computed the number of reachable queries that could not be covered by

L_{i - 1}

but could be covered by

L_{i}

.

However, the increased RR could not be easily computed. On the one hand, by constructing 2-hop labels using hop-node

v_{i}

, we captured 3 types of reachability relationships: (1)

v_{i}

that could reach every node in

D_{i} \ {v_{i}}

could be determined by 2-hop labels with respect to

v_{i}

, and the number of covered reachable queries was

| D_{i} | - 1

; (2) Every node in

A_{i} \ {v_{i}}

that could reach

v_{i}

could be determined by 2-hop labels with respect to

v_{i}

, and the number of covered reachable queries was

| A_{i} | - 1

; and (3) each node in

A_{i} \ {v_{i}}

that could reach every node in

D_{i} \ {v_{i}}

could be determined by 2-hop labels with respect to

v_{i}

, and the number of covered reachable queries was

(| A_{i} | - 1) \times (| D_{i} | - 1)

. Therefore, the number of covered queries by 2-hop labels with respect to

v_{i}

was

(| A_{i} | - 1) \times (| D_{i} | - 1) + (| A_{i} | - 1) + (| D_{i} | - 1) = | A_{i} | \times | D_{i} | - 1

.

On the other hand, 2-hop labels with respect to different hop-nodes could cover the same reachable queries. For example, consider Figure 2b. After processing

v_{i - 1}

, every node

a_{1} \in A_{i 1}

that could reach every node

d_{1} \in D_{i 1}

could be covered by 2-hop labels with respect to

v_{i - 1}

, because

v_{i - 1} \in L_{o u t}^{i - 1} (a_{1}) \cap L_{i n}^{i - 1} (d_{1})

. After processing

v_{i}

, we knew that

a_{1}

could reach

d_{1}

, which could also be covered by 2-hop labels with respect to

v_{i}

, because

v_{i} \in L_{o u t}^{i} (a_{1}) \cap L_{i n}^{i} (d_{1})

. Then, the increased number of reachable queries with respect to

v_{i}

could be computed as Equations (7) and (8), and the total number of reachable queries

N_{k}

covered by

L_{k}

could be computed as Equation (9).

n_{i} = | A_{i} | \times | D_{i} | - 1 - λ

(7)

\begin{matrix} λ = | {(a, d) | & a \in A_{i}, d \in D_{i}, a \neq d, \\ L_{o u t}^{i - 1} (a) \cap L_{i n}^{i - 1} (d) \neq \emptyset} | \end{matrix}

(8)

N_{k} = \sum_{i \in [1, k]} n_{i}

(9)

Therefore, the intuitive approach was to first obtain the two sets of nodes,

A_{i}

and

D_{i}

, and then testing for each pair of nodes

a \in A_{i}

and

d \in D_{i}

, whether a could reach d could be answered by

L_{i - 1}

. If the answer was true, then it indicated that

a ⇝ d

had already been covered by

S_{i - 1}

; otherwise, it was a new covered query and needed to be considered, as shown by Algorithm 2.

In Algorithm 2, we computed the RR for each

S_{i}

when

v_{i} (i \in [1, k])

had been added into

S_{i - 1}

. In lines 15–18, we computed the number of queries that could be covered by

L_{i - 1}

. We obtained the increased number of queries in line 19 according to Equation (7). Then, we obtained the RR of

S_{i}

in line 21. We computed

L_{i}

based on

L_{i - 1}

in lines 22–25. Finally, we returned the RR of

S_{k}

in line 26.

Analysis:

Algorithm 2 computed the two sets

A_{i}

and

D_{i}

in lines 2–14, and then it computed

L_{i}

in lines 22–25. The overall cost of Step-1 was same as that of Algorithm 1, i.e.,

O (k^{2} (| V | + | E |) + r | E |)

. The cost of Step-2 for each hop-node was

O (i | A_{i} | | D_{i} |)

. For k hop-node, the cost was

O (\sum_{i \in [1, k]} i | A_{i} | | D_{i} |)

. Therefore, the time complexity of Algorithm 2 was

O (k^{2} (| V | + | E |) + r | E | + \sum_{i \in [1, k]} i | A_{i} | | D_{i} |)

. Similar to Algorithm 1, the space complexity of Algorithm 2 was

O (k | V |)

.

Example 4.

Consider G in Figure 3. Assume that we want to construct p2HLs

L_{3}

. The first processed node is

v_{1}

. As

A_{1} = {v_{1}, v_{4}, v_{6}, v_{11}}

,

D_{1} = {v_{1}, v_{2}, v_{7}, v_{9}, v_{10}, v_{13}, v_{15}}

, and then we know

N_{1} = n_{1} = | A_{1} | \times | D_{1} | - 1 = 27

. The second processed node is

v_{2}

. The p2HLs are shown in Table 1 and

A_{2} = {v_{2}, v_{3}, v_{5}, v_{12}}

,

D_{2} = {v_{2}, v_{10}, v_{13}, v_{14}}

. Then, in lines 16–18, we need to test

| A_{2} | \times | D_{2} | = 16

queries. As

λ = 0

,

n_{2} = | A_{2} | \times | D_{2} | - 1 - 0 = 15

, and

N_{2} = n_{1} + n_{2} = 27 + 15 = 42

. The third processed node is

v_{3}

and

A_{3} = {v_{3}, v_{4}, v_{5}, v_{6}, v_{11}}

,

D_{3} = {v_{3}, v_{7}, v_{8}, v_{9}, v_{14}}

. Then, in lines 16–18, we need to test

| A_{3} | \times | D_{3} | = 25

queries. As

λ = 6

,

n_{3} = | A_{3} | \times | D_{3} | - 1 - 6 = 18

, and

N_{3} = N_{2} + n_{3} = 42 + 18 = 60

. After processing

v_{3}

, we had

L_{3}

, as shown in Table 1, and the RR was

60 / 70 = 85.7

% by testing

16 + 25 = 41

queries in Algorithm 2.

As a comparison, for Algorithm 1,

| A_{3} | = 8

,

| D_{3} | = 10

, and we needed to test 80 queries to obtain the RR.

Algorithm 2: incRR

(G, k)

1 compute

| T C (G) |

by the buTC

^{+}

algorithm [12]

2

N_{0} \leftarrow 0

3 rank all nodes v in G by

(| i n_{} (v) | + 1) \times (| i n_{} (v) | + 1)

4 put the first k nodes into

S_{k}

as hop-nodes

5 foreach

(v_{i} \in S_{k})

do

6

A_{i} \leftarrow \emptyset, D_{i} \leftarrow \emptyset

7 perform forward BFS from

v_{i}

, and for each visited v

8 if

(L_{o u t}^{i - 1} (v_{i}) ⋂ L_{i n}^{i - 1} (v) = \emptyset)

then /*

L_{i - 1}

*/

9

D_{i} \leftarrow D_{i} \cup {v}

10 else stop expansion from v

11 perform backward BFS from

v_{i}

, for each visited v

12 if

(L_{o u t}^{i - 1} (v) ⋂ L_{i n}^{i - 1} (v_{i}) = \emptyset)

then /*

L_{i - 1}

*/

13

A_{i} \leftarrow A_{i} \cup {v}

14 else stop expansion from v

15

λ \leftarrow 0

16 foreach

(a \in A_{i}, d \in D_{i}, a \neq d)

do

17 if

(L_{o u t}^{i - 1} (a) ⋂ L_{i n}^{i - 1} (d) \neq \emptyset)

then /*

L_{i - 1}

*/

18

λ \leftarrow λ + 1

19

n_{i} \leftarrow | A_{i} | \times | D_{i} | - 1 - λ

20

N_{i} \leftarrow N_{i - 1} + n_{i}

21

α \leftarrow N_{i} / | T C (G) |

/*RR of

S_{i}

*/

22 foreach

(a \in A_{i})

do /*compute

L_{i}

*/

23

L_{o u t}^{i} (a) \leftarrow L_{o u t}^{i - 1} (a) ⋃ {v_{i}}

24 foreach

(d \in D_{i})

do /*compute

L_{i}

*/

25

L_{i n}^{i} (d) \leftarrow L_{i n}^{i - 1} (d) ⋃ {v_{i}}

26 return

α

as RR of

S_{k}

5. The Incremental-Partition Approach

5.1. The Equivalence Relationship

Though Algorithm 2 did not need to compute RR from scratch when the hop-node set became large, it still needed to test

| A_{i} | \times | D_{i} |

queries in line 16 with cost

O (i | A_{i} | | D_{i} |)

. Given a large hop-node set, the cost could still become unaffordable.

Definition 1.

[Equivalence Relationship]Given a hop-node

v_{i}

, its ancestor set was

A_{i}

, and its descendant set was

D_{i}

. We assumed the two nodes

a_{1}, a_{2} (a_{1} \neq a_{2})

of

A_{i}

were forward-equivalent to each other, denoted as

a_{1} \equiv_{F} a_{2}

, if they had the same out-labels, i.e.,

L_{o u t}^{i} (a_{1}) = L_{o u t}^{i} (a_{2})

. Similarly, the two nodes

d_{1}, d_{2} (d_{1} \neq d_{2})

of

D_{i}

were backward-equivalent to each other, denoted as

d_{1} \equiv_{B} d_{2}

, if they had the same in-label, i.e.,

L_{i n}^{i} (d_{1}) = L_{i n}^{i} (d_{2})

.

By Definition 1, we could determine that for

A_{i}

, a partition

A (i) = {A_{i 1}, A_{i 2}, . . ., A_{i m}}

, which consists of a set of m disjointed subsets satisfying that (1)

\forall l, j \in [1, m], l \neq j, A_{i l} \cap A_{i j} = \emptyset

and

\cup_{l \in [1, m]} A_{i l} = A_{i}

; and (2)

\forall a_{l}, a_{j}

belonging to the same subset,

a_{l} \equiv_{F} a_{j}

. For

D_{i}

, we also had a partition

D (i) = {D_{i 1}, D_{i 2}, . . ., D_{i n}}

satisfying that (1)

\forall l, j \in [1, n], l \neq j, D_{i l} \cap D_{i j} = \emptyset

and

\cup_{j \in [1, n]} D_{i j} = D_{i}

; and (2)

\forall d_{l}, d_{j}

belonging to the same subset,

d_{l} \equiv_{B} d_{j}

. We had the following result.

Theorem 1.

Given a hop-node

v_{i}

, its ancestor set

A_{i}

, and its descendant set

D_{i}

, the number of tested queries for RR computation was

| A (i) | \times | D (i) |

, which was bounded by

min {| A_{i} | \times | D_{i} |, 4^{i - 1}}

.

Proof.

Let

A (i) (D (i))

be the partition of

A_{i} (D_{i})

based on the equivalence relationship and

P_{A} (i) (P_{D} (i))

the partition of V with respect to hop-node set

S_{i}

and the forward (backward) equivalence relationship. Initially,

A (1) = {A_{1}} (D (1) = {D_{1}})

,

P_{A} (1) = {A_{1}, V \ A_{1}} (P_{D} (1) = {D_{1}, V \ D_{1}})

.

On the one hand, for each subset

A_{i l} \in A (i)

, the result of testing all the reachability relationships from nodes of

A_{i l}

to any other node was equivalent to each other; thus, we only needed to randomly select a node and let it be the representative node of

A_{i l}

to perform the testing of reachability relationship. Similarly, for each subset

D_{i j} \in D (i)

, we could also randomly select a node and let it be the representative node of

D_{i j}

to test the reachability relationships from any node to the nodes of

D_{i j}

. As a result, the number of tested queries from the nodes of

A_{i}

to the nodes of

D_{i}

was

| A (i) | \times | D (i) |

. Since

A (i) (D (i))

was the partition of

A_{i} (D_{i})

, we knew that

| A (i) | \times | D (i) | \leq | A_{i} | \times | D_{i} |

.

On the other hand, given the partition

P_{A} (i - 1) (P_{D} (i - 1))

of V, the size of

P_{A} (i) (P_{D} (i))

was, at most, twice the size of

P_{A} (i - 1) (P_{D} (i - 1))

, and all nodes in each subset of

P_{A} (i - 1) (P_{D} (i - 1))

could be further divided into, at most, two disjointed subsets. One consisted of nodes that could reach (and be reached by)

v_{i}

, and the other contained nodes that could reach (or be reached by)

v_{i}

. Then, the size of

P_{A} (i) (P_{D} (i))

was bounded by

2^{i}

, and the size of

A (i) (D (i))

was bounded by

2^{i - 1}

. Therefore, the the number of tested reachability queries was bounded by

2^{i - 1} \times 2^{i - 1} = 4^{i - 1}

.

Hence, the number of tested queries for RR with respect to hop-node

v_{i}

was bounded by

min {| A_{i} | \times | D_{i} |, 4^{i - 1}}

. □

According to Theorem 1, we could reduce the number of tested reachability queries when processing hop-node

v_{i}

. As shown by Equation (10), for each pair of subsets

(A_{i l} \in A (i), D_{i j} \in D (i))

, we only needed to review the reachability relationship between their representative nodes

a \in A_{i l}

and

d \in D_{i j}

. To accomplish this, we first determined the partitions of both

A_{i}

and

D_{i}

, according to their equivalence relationship.

λ = \sum_{\begin{matrix} a \in A_{i l} \in A (i) \\ d \in D_{i j} \in D (i) \\ L_{o u t}^{i - 1} (a) \cap L_{i n}^{i - 1} (d) \neq \emptyset \end{matrix}} | A_{i l} | \times | D_{i j} |

(10)

5.2. Partitions Computation

To obtain the partitions of

A_{i}

and

D_{i}

, the intuitive approach was to sort all nodes in

A_{i} (D_{i})

by comparing their out-labels (in-labels) in lexicographic order. After the sorting operation, all equivalent nodes were clustered together. As the size of each label was bounded by i, the cost of computing the partition

A (i) (D (i))

of

A_{i} (D_{i})

was

O (i \times | A_{i} | \times log | A_{i} |) (O (i \times | D_{i} | \times log | D_{i} |))

.

Let

P_{A} (i)

be the partition of V with respect to hop-node set

S_{i}

and the forward equivalence relationship,

P_{D} (i)

the partition of V with respect to

S_{i}

and backward equivalence relationship. We obtained the following result.

Theorem 2.

Assume the hop-node

v_{i}

and its ancestor (descendant) set

A_{i} (D_{i})

for

\forall v_{1}, v_{2} \in A_{i} (D_{i})

,

v_{1} \equiv_{F} v_{2} (v_{1} \equiv_{B} v_{2})

, if they belong to the same subset of

P_{A} (i - 1) (P_{D} (i - 1))

.

Proof.

We proved this result from two aspects. First, we proved the correctness when both

v_{1}

and

v_{2}

belonged to

A_{i}

(Case-1), and then we proved the correctness when both

v_{1}

and

v_{2}

belonged to

D_{i}

(Case-2).

Case-1 where

v_{1}, v_{2} \in A_{i}

and

v_{i} \in L_{o u t}^{i} (v_{1}) \cap L_{o u t}^{i} (v_{2})

.

On the one hand, if

v_{1} \equiv_{F} v_{2}

, it indicated that

L_{o u t}^{i} (v_{1}) = L_{o u t}^{i} (v_{2})

according to Definition 1. Hence,

L_{o u t}^{i} (v_{1}) \ {v_{i}} = L_{o u t}^{i} (v_{2}) \ {v_{i}}

, i.e., they belonged to the same subset of

P_{A} (i - 1)

.

On the other hand, if both

v_{1}

and

v_{2}

belonged to the same subset of

P_{A} (i - 1)

, it indicated that before processing hop-node

v_{i}

,

L_{o u t}^{i - 1} (v_{1}) = L_{o u t}^{i - 1} (v_{2})

, according to the definition of

P_{A} (i - 1)

. As

v_{1}, v_{2} \in A_{i}

, we knew that after processing

v_{i}

,

v_{i} \in L_{o u t}^{i} (v_{1}) \cap L_{o u t}^{i} (v_{2})

and

L_{o u t}^{i} (v_{1}) = L_{o u t}^{i} (v_{2})

still held. According to Definition 1,

v_{1} \equiv_{F} v_{2}

.

Therefore, we determined that

v_{1} \equiv_{F} v_{2}

, if they belonged to the same subset of

P_{A} (i - 1)

.

Case-2 where

v_{1}, v_{2} \in D_{i}

and

v_{i} \in L_{i n}^{i} (v_{1}) \cap L_{i n}^{i} (v_{2})

.

Similar to the proof of Case-1, we knew that

v_{1} \equiv_{B} v_{2}

, if they belonged to the same subset of

P_{D} (i - 1)

.

Therefore, for

\forall v_{1}, v_{2} \in A_{i} (D_{i})

,

v_{1} \equiv_{F} v_{2} (v_{1} \equiv_{B} v_{2})

, if they belonged to the same subset of

P_{A} (i - 1) (P_{D} (i - 1))

. □

According to Theorem 2, we assigned each node v two set IDs, denoted as

i d_{A} (v)

and

i d_{D} (v)

, which were used to determine which subset it belonged to in

P_{A} (i)

and

P_{D} (i)

, respectively. Then, given the ancestor (descendant) set

A_{i} (D_{i})

of

v_{i}

, we only needed to scan all nodes of

A_{i} (D_{i})

once to know immediately that for two nodes

v_{1}

and

v_{2}

: If

i d_{A} (v_{1}) = i d_{A} (v_{2}) (i d_{D} (v_{1}) = i d_{D} (v_{2}))

in

P_{A} (i - 1) (P_{D} (i - 1))

, then

v_{1} \equiv_{F} v_{2} (v_{1} \equiv_{B} v_{2})

, and definitely belonged to the same subset of

P_{A} (i) (P_{D} (i))

. Therefore,

P_{A} (i) (P_{D} (i))

was a refinement of

P_{A} (i - 1) (P_{D} (i - 1))

, i.e., each element of

P_{A} (i) (P_{D} (i))

was a subset of a unique element of

P_{A} (i - 1) (P_{D} (i - 1))

.

Recall that when processing the hop-node

v_{i}

, we first had its ancestor (descendant) set

A_{i} (D_{i})

and then obtained the partition

A (i) (D (i))

of

A_{i} (D_{i})

, based on the equivalence relationship. Since

P_{A} (i) (P_{D} (i))

was the partition of V with respect to the equivalence relationship, we knew that

A (i) \subset P_{A} (i) (D (i) \subset P_{D} (i))

, and the relationship between

P_{A} (i) (P_{D} (i))

,

P_{A} (i - 1) (P_{D} (i - 1))

and

A (i) (D (i))

was shown as Equations (11) and (12).

P_{A} (i) = {P \ A_{i} | P \in P_{A} (i - 1)} \cup A (i)

(11)

P_{D} (i) = {P \ D_{i} | P \in P_{D} (i - 1)} \cup D (i)

(12)

When processing hop-node

v_{i}

, we used a hash table

H_{A}^{} (H_{D}^{})

to achieve linear-time complexity. Each element of

H_{A}^{} (H_{D}^{})

was a tuple

(i d_{o}, e_{n})

, denoting a subset

A_{i l} (D_{i l})

of

A (i) (D (i))

, where

i d_{o}

was, for all nodes of

A_{i l} (D_{i l})

, their old set ID in

P_{A} (i - 1) (P_{D} (i - 1))

, and

e_{n} = (i d_{n}, v_{1}, s)

was a triple denoting the new set ID for all nodes of

A_{i l} (D_{i l})

, the representative node of

A_{i l} (D_{i l})

, and the size of

A_{i l} (D_{i l})

, respectively.

Example 5.

Consider G in Figure 3. Before processing

v_{1}

,

P_{A} (0) = P_{D} (0) = {V}

,

A (0) = D (0) = \emptyset

, and for all nodes v,

i d_{A} (v) = i d_{D} (v) = 0

.

For the first node

v_{1}

, we have

A_{1} = {v_{1}, v_{4}, v_{6}, v_{11}}

,

D_{1} = {v_{1}, v_{2}, v_{7}, v_{9}, v_{10}, v_{13}, v_{15}}

. Since all nodes in

A_{1} (D_{1})

have the same

i d_{A} (v) (i d_{D} (v))

, we know that

A (1) = {A_{1}}

and

D (1) = {D_{1}}

.

P_{A} (1) = {A_{1}, V \ A_{1}}

, and

P_{D} (1) = {D_{1}, V \ D_{1}}

. In Table 2, the two columns under

v_{1}

denote

P_{A} (1)

and

P_{D} (1)

, where each 1 in the second (third) column correspond to a node in

A_{1} (D_{1})

. Figure 4a shows the two hash tables denoting

A (1)

and

D (1)

, respectively. For

H_{A}^{}

, there is one (key, value) pair, denoting that

A (1)

contains one subset

A_{1}

, and for all nodes in

A_{1}

, their set ID is 0 in

P_{A} (0)

. Therefore, they all belong to the same subset in

A (1)

, i.e.,

A (1) = {A_{1}}

. By

H_{A}^{}

in Figure 4a, we know that all nodes in

A_{1}

now have the new set ID 1, the representative node of

A_{1}

is

v_{4}

, and

| A_{1} | = 4

.

Table 2. The status of set IDs for all nodes.

Figure 4. Running status of the two hash tables

H_{A}^{}

and

H_{D}^{}

.

For the second processed node

v_{2}

,

A_{2} = {v_{2}, v_{3}, v_{5}, v_{12}}

,

D_{2} = {v_{2}, v_{10}, v_{13}, v_{14}}

. As all nodes in

A_{2}

have the same set ID 0 in

P_{A} (1)

,

A (2) = {A_{2}}

. As shown by Figure 4b, the key is 0, and the triple

(2, v_{3}, 4)

denotes that the new set ID for all nodes in

A_{2}

is 2, the representative node is

v_{3}

, and

| A_{2} | = 4

. Similarly, all nodes in

D_{2}

have the same set ID 1 in

P_{D} (1)

,

D (2) = {D_{2}}

, which is denoted as

H_{D}^{}

in Figure 4b.

For the third processed node

v_{3}

,

A_{3} = {v_{3}, v_{4}, v_{5}, v_{6}, v_{11}}

,

D_{3} = {v_{3}, v_{7}, v_{8}, v_{9}, v_{14}}

. As

v_{3}

and

v_{5}

have the same set ID 2 in

P_{A} (2)

, they form the subset in

A (3)

. Furthermore,

v_{4}, v_{6}, v_{11}

have the same set ID 1 in

P_{A} (2)

, they form the second subset in

A (3)

. Therefore

A (3) = {{v_{3}, v_{5}}, {v_{4}, v_{6}, v_{11}}}

. Similarly,

D (3) = {{v_{3}, v_{8}, v_{14}}, {v_{7}, v_{9}}}

. Both

A (3)

and

D (3)

are denoted by

H_{A}^{}

and

H_{D}^{}

in Figure 4c, respectively.

The Algorithm:

As shown by Algorithm 3, for each hop-node

v_{i}

, we first performed forward and backward BFS to obtain

D_{i}

(lines 6–15) and

A_{i}

(lines 16–25). At the same time, we generated their partitions

D (i)

and

A (i)

, for which each subset was recorded in

H_{D}^{}

and

H_{A}^{}

, respectively. In lines 26–29, we computed

λ

according to Equation (10), which was the number of reachable queries that were covered by

L_{i - 1}

. In line 30, we obtained the number of reachable queries that could be covered by

L_{i}

but not by

L_{i - 1}

. After that, we had the total number of covered reachable queries in line 31 and the RR in line 33. Finally, we generated

L_{i}

in lines 34–37 and returned the RR of

S_{k}

in line 38.

It was worth noting that for Algorithm 3, the improvements was related to not only the coverage size computation, but also the TC size computation. In line 32, we obtained an estimated TC size by the approach in [23], which operated in linear time

O (| V | + | E |)

and was more efficient than the exact approach in [12]. Since the denominator in Equation (1) did not change for

α_{i} (i \in [1, k])

, we could achieve the same effect using an approximate TC size for the RR computation.

Algorithm 3: incRR

^{+} (G = (V, E), k, T C (G))

1

N_{0} \leftarrow 0

;

n_{A} \leftarrow 0

;

n_{D} \leftarrow 0

;

2 rank all nodes v in G by

(| i n_{} (v) | + 1) \times (| i n_{} (v) | + 1)

3 put the first k nodes into

S_{k}

as hop-nodes

4 foreach

(v_{i} \in S_{k})

do

5

A_{i} \leftarrow \emptyset

;

D_{i} \leftarrow \emptyset

;

H_{A}^{} \leftarrow \emptyset

;

H_{D}^{} \leftarrow \emptyset

6 perform forward BFS from

v_{i}

, and for each visited v

7 if

(L_{o u t}^{i - 1} (v_{i}) ⋂ L_{i n}^{i - 1} (v) = \emptyset)

then /*

L_{i - 1}

*/

8 if

(i d_{A} (v) \notin H_{D}^{})

then

9

n_{A} \leftarrow n_{A} + 1

10

H_{D}^{} [i d_{A} (v)] \leftarrow (n_{A}, v, 1)

11 else

12

H_{D}^{} [i d_{A} (v)] . s \leftarrow H_{D}^{} [i d_{A} (v)] . s + 1

13

D_{i} \leftarrow D_{i} \cup {v}

14

i d_{A} (v) \leftarrow n_{A}

15 else stop expansion from v

16 perform backward BFS from

v_{i}

, for each visited v

17 if

(L_{o u t}^{i - 1} (v) ⋂ L_{i n}^{i - 1} (v_{i}) = \emptyset)

then /*

L_{i - 1}

*/

18 if

(i d_{D} (v) \notin H_{A}^{})

then

19

n_{D} \leftarrow n_{D} + 1

20

H_{A}^{} [i d_{D} (v)] \leftarrow (n_{D}, v, 1)

21 else

22

H_{A}^{} [i d_{D} (v)] . s \leftarrow H_{A}^{} [i d_{D} (v)] . s + 1

23

A_{i} \leftarrow A_{i} \cup {v}

24

i d_{D} (v) \leftarrow n_{D}

25 else stop expansion from v

26

λ \leftarrow 0

27 foreach

((i d, a, s_{A}) \in H_{A}^{}, (i d, d, s_{D}) \in H_{A}^{})

do

28 if

(L_{o u t}^{i - 1} (a) ⋂ L_{i n}^{i - 1} (d) \neq \emptyset)

then /*

L_{i - 1}

*/

29

λ \leftarrow λ + | s_{A} | \times | s_{D} |

30

n_{i} \leftarrow | A_{i} | \times | D_{i} | - 1 - λ

31

N_{i} \leftarrow N_{i - 1} + n_{i}

32 estimate the TC size by Formula 3 in [23]

33

α \leftarrow N_{i} / | T C (G) |

/*RR of

S_{i}

*/

34 foreach

(a \in A_{i})

do /*compute

L_{i}

*/

35

L_{o u t}^{i} (a) \leftarrow L_{o u t}^{i - 1} (a) ⋃ {v_{i}}

36 foreach

(d \in D_{i})

do /*compute

L_{i}

*/

37

L_{i n}^{i} (d) \leftarrow L_{i n}^{i - 1} (d) ⋃ {v_{i}}

38 return

α

as RR of

S_{k}

Example 6.

Consider G in Figure 3. Assume that we want to compute the RR of

S_{3} = {v_{1}, v_{2}, v_{3}}

.

For

v_{1}

, as there is no covered reachability relationship, we do not need to test any queries in lines 28. As

| A_{1} | = 4

,

| D_{1} | = 7

,

n_{1} = | A_{1} | \times | D_{1} | - 1 = 27

in line 30 of Algorithm 3.

For

v_{2}

, as both

| A (2) | = | D (2) | = 1

, we only need to test one reachable query, i.e., whether

v_{3} ? ⇝ v_{10}

can be answered by

L_{1}

. As

L_{o u t}^{1} (v_{3}) \cap L_{i n}^{1} (v_{10}) = \emptyset

, we know that

n_{2} = | A_{2} | \times | D_{2} | - 1 = 15

in line 30 of Algorithm 3.

For

v_{3}

, as shown by Figure 4c, we know that

A (3) = {{v_{3}, v_{5}}, {v_{4}, v_{6}, v_{11}}}

and

D (3) = {{v_{3}, v_{8}, v_{14}}, {v_{7}, v_{9}}}

. In line 28, we only need to test

| A (3) | \times | D (3) | = 2 \times 2 = 4

reachable queries. As whether

v_{4}

can reach

v_{7}

can be answered by

L_{2}

, we know that whether all nodes in

{v_{4}, v_{6}, v_{11}}

can reach every node in

{v_{7}, v_{9}}

can be answered by

L_{2}

; thus,

λ = 6

for

v_{3}

. In line 30, we know that

n_{3} = | A_{3} | \times | D_{3} | - 1 - λ = 5 \times 5 - 1 - 6 = 18

.

Then, we know

N_{3} = n_{1} + n_{2} + n_{3} = 27 + 15 + 18 = 60

, and the RR is

α = N_{3} / | T C (G) | = 60 / 70 = 85.7

%. During the process, the total number of tested reachability queries by Algorithm 3 was.

As a comparison, the total number of tested queries for Algorithm 2 was 41, and is 80 for Algorithm 1 to obtain the RR.

Analysis:

As Algorithm 3 computed the approximate TC size in linear time, the time cost of Step-1 was

O (k^{2} (| V | + | E |)

. The difference was found in Step-2. The cost of Step-2 for each hop-node was

O (i | A (i) | | D (i) |)

. For k hop-nodes, the cost was

O (\sum_{i \in [1, k]} i | A (i) | | D (i) |)

for coverage computation. The time complexity of Algorithm 3 was, therefore,

O (k^{2} (| V | + | E |) + \sum_{i \in [1, k]} i | A (i) | | D (i) |)

. The space complexity of Algorithm 3 was

O (k | V |)

.

6. Experiment

In this section, we show the experimental results of the RR computation. The baseline algorithms included blRR, incRR, and incRR

^{+}

. Moreover, we show the impacts of p2HLs on processing reachability queries based on the state-of-the-art algorithm FL [8] and BFL

^{+}

[2], in terms of index size, index construction time, and query time. We implemented all algorithms using C++ and compiled them using G++ 6.2.0. All experiments were conducted on a desktop computer with an Intel(R) Core(TM) i5-1135G7 @ 2.4 GHz CPU, 16 GB memory, and Ubuntu 18.04.1 Linux OS. For algorithms that processed

\geq 3

h or exceed the memory limit (16GB), we indicated their results as “–” in the tables.

Datasets:

Table 3 shows the statistics of 18 real datasets, where the first 6 were small datasets (|V| ≤ 100,000) downloaded from the same web page (https://code.google.com/archive/p/grail/downloads, accessed on 1 January 2020). The following 12 datasets were large (|V| > 100,000). These datasets have usually been used in the recent works for processing reachability queries [1,2,3,4,5,7,8,9,14,23]. Among these datasets, human, anthra, agrocyc, ecoo, and vchocyc were graphs describing the genome and biochemical machinery of E. coli K-12 MG1655. The site (http://snap.stanford.edu/data/index.html, accessed on 1 January 2020) is an email network. LJ is an online social network soc-LiveJournal1 (http://snap.stanford.edu/data/index.html, accessed on 1 January 2020). The source web is a web graph web-Google (https://code.google.com/p/ferrari-index/downloads/list, accessed on). In addition, arxiv, 10cit-Patent (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), 10citeseerx (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), 05cit-Patent (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), 05citeseerx (http://pan.baidu.com/s/1bpHkFJx, accessed on 1 January 2020), citeseerx (https://code.google.com/archive/p/grail/downloads, accessed on 1 January 2020), and patent (https://code.google.com/archive/p/grail/downloads, accessed on 1 January 2020) (cit-Patents) are all citation networks. The source dbpedia (http://pan.baidu.com/s/1c00Jq5E, accessed on 1 January 2020) is a knowledge graph Dbpedia. The sourcetwitter (https://code.google.com/p/ferrari-index/downloads/list, accessed on 1 January 2020) is a DAG transformed from a large-scale social network with 55 million users and 1.96 billion edges [25]. The source web-uk (https://code.google.com/p/ferrari-index/downloads/list, accessed on 1 January 2020) is a DAG transformed from a web graph dataset with 133 million nodes and 5.5 billion edges. The statistics in Table 3 are that of the DAGs.

Table 3. Statistics of datasets, where

d = 2 | E | / | V |

is the average degree of G,

| T C (\cdot) |

is the average number of reachable nodes for nodes of G, and

n_{t}

is the number of topological levels (the length of the longest path) of G.

6.1. RR Computation

RR and Index Size:

We show RR and the index size ratio (ISR) of these datasets in Figure 5, where ISR denotes the ratio of the size of p2HLs over the size of the 2-hop labels, with respect to all nodes. From Figure 5, we observed the following.

Figure 5. Comparison of the Reachability Ratio (RR) and Index-Size Ratio (ISR), where ISR is ratio of the index size of p2HLs, with respect to k hop nodes, over that of the total 2-hop label size with respect to all nodes.

First, all datasets could be classified into three categories according to their RRs. The first (D1) included email, LJ, web, citeseerx, dbpedia, twitter, and web-uk, for which the RR was more than 99%, even when

k = 1

, and both the RR and ISR do not significantly change with the increase in k. The second type (D2) included human, anthra, agrocyc, ecoo, vchocyc, and arxiv, for which both the RR and ISR increased with the increase in k. The third type (D3) included 10cit-Patent, 10citeseerx, 05cit-Patent, 05citeseerx, and patent, for which both the RR and ISR were very small or even approach zero, and with the increase in k, both RR and ISR did not significantly change.The value of k, therefore, only affected the second type of dataset, and the RR was more than 80% when

k \geq 16

for all datasets of the second type, which indicated that this approach could benefit from using p2HLs on datasets of both the first and second types.

Second, the storage space used to maintain p2HLs was small, when compared with the RR value. For example, for the first type of dataset, we used about

1 / 4

of the available storage space (ISR

\approx 25

%) to maintain more than 99% (RR

> 99

%) of the reachability information.

Operational Time:

Figure 6 shows the comparison of the operational time for the RR computation, from which we knew that incRR

^{+}

was much faster than both blRR and incRR, on all datasets, and incRR operated faster than blRR on most datasets. For example, incRR

^{+}

was faster than blRR by more than two or three orders of magnitude on most datasets, and incRR was ten-times faster than blRR on email, LJ, web, citeseerx, and dbpedia. According to the last to the second column of Table 3, the average number of reachable nodes was usually high. Therefore, the number of tested reachability queries by blRR was significantly high. Although incRR could reduce the number of tested reachability queries, it still needed to test significantly more reachability queries than incRR

^{+}

. Neither blRR nor incRR could obtain the value of the RR on both twitter and web-uk for

k \geq 2

in a limited time frame (3 h), due to testing too many reachability queries.

Figure 6. Operational time of different algorithms for RR computation (ms).

Based on the above results, we knew that our incRR

^{+}

algorithm could be used to efficiently compute the RR for a given dataset, which allowed us to determine whether we could use p2HLs to facilitate processing reachability queries.

6.2. Processing Reachability Queries

We combined p2HLs with FELINE [8] (abbr. FL) and BFL

^{+}

[2] to show the impact of p2HLs on processing reachability queries in Table 4, Table 5 and Table 6, where FL-k (BFL

^{+}

-k) denotes the FL (BFL

^{+}

) algorithm combined with the p2HLs generated using k hop-nodes by Algorithm 3. Note that we did not set

k = 1, 2, 4, 8

, because when

k = 16

, we only needed to use one integer as a bit-vector for each node v to represent both

L_{o u t}^{16} (v)

and

L_{i n}^{16} (v)

.

Table 4. Comparison of the index size (MB).

Table 5. Comparison of the index construction time (ms).

Table 6. Comparison of the query time (ms).

Index Size:

Table 4 shows the impact of k on index size, from which we found that with an increase in k, the index size increased accordingly. For example, for the web-uk dataset, the index size of FL-128 was more than two times the size of FL-0 on all datasets. The reason was that the larger the value of k, the more space required to maintain p2HLs.

Index Construction Time:

Figure 5 shows the impact of k on the index construction time, from which we found that with an increase in k, we required more time for index construction. The time for generating p2HLs was usually much less than the index construction time for FL, i.e., FL-0, and the increased time for index construction could be omitted. As a comparison, the index of BFL

^{+}

, i.e., BFL

^{+}

-0, could be constructed very efficiently, as the time used for p2HLs construction was usually one-to-two times longer than the index construction time of BFL

^{+}

. However, the whole index construction was still less than 5 s for all the tested graphs.

Query Time:

We reported the query time was about equal workload, which contained 1,000,000 reachability queries for each dataset. The equal workload consisted of 50% reachable queries and 50% unreachable queries. The reason that we used equal workloads was that using completely random queries would be heavily skewed towards unreachable queries [9,10], which was highly unlikely for the real workload as the node pair in a query tended to have a certain connection [11]. Here, unreachable queries were generated by sampling node pairs with the same probability until we reached the required number of unreachable queries by testing each query using the FL algorithm. For reachable queries, we could not choose them randomly by sampling the TC because the TC computation was not similar to the TC size computation, and it suffered from high processing times and space complexity. We could not obtain it within the limited time frame and available memory size for large graphs. To address this problem, we randomly selected a node u in each iteration, and then randomly selected an out-neighbor v recursively until v had no out-neighbors available. Then, we had a path p from u. Finally, we randomly selected a node

v \neq u

from p to obtain a reachable query

u ⇝ v

. This operation was continued until we reached the required number of reachable queries.

We show the comparison of query time from FL-0 to FL-128 and from BFL

^{+}

-0 to BFL

^{+}

-128 in Table 6, from which we observed the following.

First, FL-16 (BFL

^{+}

-16) and FL-32 (BFL

^{+}

-32) usually required the least amount of time on the first type of datasets D1, including email, LJ, web, citeseerx, dbpedia, twitter, and web-uk, where the RR was more than 99% even when

k = 1

. For these datasets, although the index size increased along with the index construction time, as compared to FL-0 (BFL

^{+}

-0), we used the least amount of resources to achieve significant improvements.For example, when compared with FL-0, FL-16 used about 1.2-times the index size and 1.4-times the index construction time to achieve a more than 30-fold improvement in query time on the citeseerx dataset. Furthermore, when compared with BFL

^{+}

-0, BFL

^{+}

-16 consumed 1.1-times the index size and 2.3-times index construction time to achieve a more than 15-fold improvement on the amount of query time on the citeseerx dataset.

Second, FL-128 (BFL

^{+}

-128) suffered from the largest index size (about 2–3-times larger than FL-0 (BFL

^{+}

-0)) and longest index construction time (about 1–3-times longer than FL-0 (BFL

^{+}

-0)), but it usually achieved a better query performance on the second type of datasets D2, including human, anthra, agrocyc, ecoo, vchocyc, and arxiv. On these datasets, the RR increased along with the increase in k. From Figure 5, we found that for human, anthra, agrocyc, ecoo, and vchocyc, the RR was greater than 95% when

k \geq 16

, and thus, both BFL

^{+}

-16 and BFL

^{+}

-32 usually performed better. For arxiv, when

k = 32

, the RR was still less than 90%, indicating that FL-128 and BFL

^{+}

-128 performed the best.

Third, for the third type of datasets D3, including 10cit-Patent, 10citeseerx, 05cit-Patent, 05citeseerx, and patent, the RR was very small or even approached zero, and it did not significantly change with an increase in k. For these datasets, FL-0 and BFL

^{+}

-0 usually performed better, and the use of p2HLs did not yield positive results. For example, the index size of FL-128 (BFL

^{+}

-128) was 2.6 (2.1)-times larger than that of FL-0 (BFL

^{+}

-0) on 10cit-Patent, and the index construction time of FL-128 (BFL

^{+}

-128) was 1.3 (2.5)-times longer than that of FL-0 (BFL

^{+}

-0) on 10cit-Patent. Such cost, however, resulted in more query time. The query time of FL-128 (BFL

^{+}

-128) was 1.3 (5.5)-times longer than that of FL-0 (BFL

^{+}

-0) on 10cit-Patent.

Finally, we selected one dataset from each type and showed the trends of their query times, with respect to k in Figure 7. Based on these results, we provide the following suggestions for applying p2HLs: (1) For the first type of datasets D1, we highly recommend using p2HLs with

k = 16

to process reachability queries, because this could increase the speed of reachability queries significantly with a minimal increase in index size and index construction time. (2) For the second type of datasets D2, we also recommend using p2HLs, because could increase the speed of reachability queries. However, for balancing the value of k, this depends on the resources available for the increases in index size and index construction time. In general, the larger the value of k, the less the query time, but the longer the index construction time and the larger the index size. (3) For the third type of datasets D3, we do not recommend using p2HLs to process reachability queries.

Figure 7. Impacts of k on query time (ms) over different datasets.

7. Conclusions

Using p2HLs for processing reachability queries could be useful for some graphs when p2HLs can answer the most queries with larger RR, but for other graphs, p2HLs may not be as powerful as expected and could even lead to query performance degradation for small RR. In this paper, we addressed the important problem of using p2HLs for processing reachability queries in a given graph. To solve this problem, we proposed an RR-aware p2HLs, formally defined the RR problem, and proposed a set of algorithms for efficient RR computation. Our initial experimental results demonstrated that our optimized algorithm could efficiently compute RR for a given graph, opening up the possibility for users to determine whether p2HLs should be used. Our subsequent experimental results revealed that the query performance of combining p2HLs with state-of-the-art algorithms varied with the increase in hop-nodes k. Based on these results, we provided recommendations to guide the use of p2HLs for processing reachability queries: (1) for datasets with large RR, p2HLs should be used with

k = 16

; (2) for datasets with small RR, we do not recommend using p2HLs; and (3) for the remaining datasets, p2HLs can be used, and users can determine the k-value based on their requirements for index size, index construction time, and query time.

Author Contributions

Conceptualization, X.T. and J.Z.; methodology, X.T. and J.Z.; soft-ware, X.T.; validation, J.Z. and Y.S.; formal analysis, J.Z.; investigation, X.T.; resources, X.T.; data curation, X.L.; writing—original draft preparation, X.T. and J.Z.; writing—reviewing and editing, X.T., X.L. and L.K.; funding acquisition, X.T. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by grants from the Natural Science Foundation of Shanghai (No. 20ZR1402700) and from the Natural Science Foundation of China (No.: 61472339, 61873337).

Data Availability Statement

All datasets used in this study are publicly available and are discussed in Section 6. They are also available from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, H.; Yu, J.X.; Lu, C.; Jin, R. Reachability querying: An independent permutation labeling approach. Proc. VLDB Endow. 2014, 7, 1191–1202. [Google Scholar] [CrossRef]
Su, J.; Zhu, Q.; Wei, H.; Yu, J.X. Reachability querying: Can it be even faster? IEEE Trans. Knowl. Data Eng. 2017, 29, 683–697. [Google Scholar] [CrossRef]
Jin, R.; Wang, G. Simple, fast, and scalable reachability oracle. Proc. VLDB Endow. 2013, 6, 1978–1989. [Google Scholar] [CrossRef]
Zhu, A.D.; Lin, W.; Wang, S.; Xiao, X. Reachability queries on large dynamic graphs: A total order approach. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22–27 June 2014; pp. 1323–1334. [Google Scholar]
Cheng, J.; Huang, S.; Wu, H.; Fu, A.W. Tf-label: A topological-folding labeling scheme for reachability querying in a large graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 193–204. [Google Scholar]
Cohen, E.; Halperin, E.; Kaplan, H.; Zwick, U. Reachability and distance queries via 2-hop labels. SIAM J. Comput. 2003, 32, 1338–1355. [Google Scholar] [CrossRef]
Seufert, S.; Anand, A.; Bedathur, S.J.; Weikum, G. FERRARI: Flexible and efficient reachability range assignment for graph indexing. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD, Australia, 8–12 April 2013; pp. 1009–1020. [Google Scholar]
Veloso, R.R.; Cerf, L.; Junior, W.M.; Zaki, M.J. Reachability queries in very large graphs: A fast refined online search approach. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT), Athens, Greece, 24–28 March 2014; pp. 511–522. [Google Scholar]
Yildirim, H.; Chaoji, V.; Zaki, M.J. GRAIL: A scalable index for reachability queries in very large graphs. VLDB J. 2012, 21, 509–534. [Google Scholar] [CrossRef]
Yildirim, H.; Chaoji, V.; Zaki, M.J. GRAIL: Scalable reachability index for large graphs. Proc. VLDB Endow. 2010, 3, 276–284. [Google Scholar] [CrossRef]
Jin, R.; Ruan, N.; Dey, S.; Yu, J.X. SCARAB: Scaling reachability computation on large graphs. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 169–180. [Google Scholar]
Tang, X.; Chen, Z.; Li, K.; Liu, X. Efficient computation of the transitive closure size. Clust. Comput. 2019, 22, 6517–6527. [Google Scholar] [CrossRef]
Tarjan, R.E. Depth-first search and linear graph algorithms. SIAM J. Comput. 1972, 1, 146–160. [Google Scholar] [CrossRef]
Yano, Y.; Akiba, T.; Iwata, Y.; Yoshida, Y. Fast and scalable reachability queries on graphs by pruned labeling with landmarks and paths. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1601–1606. [Google Scholar]
Jin, R.; Peng, Z.; Wu, W.; Dragan, F.F.; Agrawal, G.; Ren, B. Parallelizing pruned landmark labeling: Dealing with dependencies in graph algorithms. In Proceedings of the ICS ’20: 2020 International Conference on Supercomputing, Barcelona, Spain, 29 June–2 July 2020; Ayguadé, E., Hwu, W.W., Badia, R.M., Hofstee, H.P., Eds.; ACM: New York, NY, USA, 2020; pp. 11:1–11:13. [Google Scholar]
Li, W.; Qiao, M.; Qin, L.; Zhang, Y.; Chang, L.; Lin, X. Scaling distance labeling on small-world networks. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1060–1077. [Google Scholar]
Du, M.; Yang, A.; Zhou, J.; Tang, X.; Chen, Z.; Zuo, Y. HT: A novel labeling scheme for k-hop reachability queries on dags. IEEE Access 2019, 7, 172110–172122. [Google Scholar] [CrossRef]
Sengupta, N.; Bagchi, A.; Ramanath, M.; Bedathur, S. ARROW: Approximating reachability using random walks over web-scale graphs. In Proceedings of the 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019; pp. 470–481. [Google Scholar]
Peng, Y.; Lin, X.; Zhang, Y.; Zhang, W.; Qin, L. Answering reachability and k-reach queries on large graphs with label constraints. VLDB J. 2022, 31, 101–127. [Google Scholar] [CrossRef]
Peng, Y.; Zhang, Y.; Lin, X.; Qin, L.; Zhang, W. Answering billion-scale label-constrained reachability queries within microsecond. Proc. VLDB Endow. 2020, 13, 812–825. [Google Scholar] [CrossRef]
Wen, D.; Yang, B.; Zhang, Y.; Qin, L.; Cheng, D.; Zhang, W. Span-reachability querying in large temporal graphs. VLDB J. 2022, 31, 629–647. [Google Scholar] [CrossRef]
Simon, K. An improved algorithm for transitive closure on acyclic digraphs. Theor. Comput. Sci. 1988, 58, 325–346. [Google Scholar] [CrossRef]
Zhou, J.; Zhou, S.; Yu, J.X.; Wei, H.; Chen, Z.; Tang, X. DAG reduction: Fast answering reachability queries. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, 14–19 May 2017; Salihoglu, S., Zhou, W., Chirkova, R., Yang, J., Suciu, D., Eds.; ACM: New York, NY, USA, 2017; pp. 375–390. [Google Scholar]
Zhou, J.; Yu, J.X.; Li, N.; Wei, H.; Chen, Z.; Tang, X. Accelerating reachability query processing based on DAG reduction. VLDB J. 2018, 27, 271–296. [Google Scholar] [CrossRef]
Cha, M.; Haddadi, H.; Benevenuto, F.; Gummadi, P.K. Measuring user influence in twitter: The million follower fallacy. In Proceedings of the ICWSM, Washington, DC, USA, 23–26 May 2010. [Google Scholar]

Figure 1. Reachability ratio (RR) of p2HLs based on k nodes with a large degree.

Figure 2. The relationship between different hop-nodes, where (a)

v_{i}

is the first node, (b)

v_{i - 1} \neg ⇝ v_{i}

and

v_{i} \neg ⇝ v_{i - 1}

and (c)

v_{i - 1} ⇝ v_{i}

.

Figure 3. A sample DAG G.

Figure 4. Running status of the two hash tables

H_{A}^{}

and

H_{D}^{}

.

Figure 5. Comparison of the Reachability Ratio (RR) and Index-Size Ratio (ISR), where ISR is ratio of the index size of p2HLs, with respect to k hop nodes, over that of the total 2-hop label size with respect to all nodes.

Figure 6. Operational time of different algorithms for RR computation (ms).

Figure 7. Impacts of k on query time (ms) over different datasets.

Table 1. The p2HLs

L_{1}, L_{2}, L_{3}

constructed based on

S_{1} = {v_{1}}, S_{2} = {v_{1}, v_{2}}

, and

S_{3} = {v_{1}, v_{2}, v_{3}}

, respectively.

Table 1. The p2HLs

L_{1}, L_{2}, L_{3}

constructed based on

S_{1} = {v_{1}}, S_{2} = {v_{1}, v_{2}}

, and

S_{3} = {v_{1}, v_{2}, v_{3}}

, respectively.

Node	$L_{1}$		$L_{2}$		$L_{3}$
Node	$L_{o u t}^{1} (v)$	$L_{i n}^{1} (v)$	$L_{o u t}^{2} (v)$	$L_{i n}^{2} (v)$	$L_{o u t}^{3} (v)$	$L_{i n}^{3} (v)$
$v_{1}$	1	1	1	1	1	1
$v_{2}$		1	2	1, 2	2	1, 2
$v_{3}$			2		2, 3	3
$v_{4}$	1		1		1, 3
$v_{5}$			2		2, 3
$v_{6}$	1		1		1, 3
$v_{7}$		1		1		1, 3
$v_{8}$						3
$v_{9}$		1		1		1, 3
$v_{10}$		1		1, 2		1, 2
$v_{11}$	1		1		1, 3
$v_{12}$			2		2
$v_{13}$		1		1, 2		1, 2
$v_{14}$						3
$v_{15}$		1		1, 2		1, 2

Table 2. The status of set IDs for all nodes.

Node	$v_{1}$		$v_{2}$		$v_{3}$
Node	${i d}_{A} (v)$	${i d}_{D} (v)$	${i d}_{A} (v)$	${i d}_{D} (v)$	${i d}_{A} (v)$	${i d}_{D} (v)$
$v_{1}$	1	1	1	1	1	1
$v_{2}$		1	2	2	2	2
$v_{3}$			2		3	3
$v_{4}$	1		1		4
$v_{5}$			2		3
$v_{6}$	1		1		4
$v_{7}$		1		1		4
$v_{8}$						3
$v_{9}$		1		1		4
$v_{10}$		1		2		2
$v_{11}$	1		1		4
$v_{12}$			2		2
$v_{13}$		1		2		2
$v_{14}$						3
$v_{15}$		1		2		2

Table 3. Statistics of datasets, where

d = 2 | E | / | V |

is the average degree of G,

| T C (\cdot) |

is the average number of reachable nodes for nodes of G, and

n_{t}

is the number of topological levels (the length of the longest path) of G.

Table 3. Statistics of datasets, where

d = 2 | E | / | V |

is the average degree of G,

| T C (\cdot) |

is the average number of reachable nodes for nodes of G, and

n_{t}

is the number of topological levels (the length of the longest path) of G.

Dataset	$\| V \|$	$\| E \|$	d	$\| T C (\cdot) \|$	$n_{t}$
human	38,811	39,576	2.04	9	18
anthra	12,499	13,104	2.10	12	16
agrocyc	12,684	13,408	2.11	13	16
ecoo	12,620	13,350	2.12	14	22
vchocyc	9491	10,143	2.14	14	21
arxiv	6000	66,707	22.24	928	167
email	231,000	223,004	1.93	11,698	7
LJ	971,232	1,024,140	2.11	206,907	24
web	371,764	517,805	2.79	55,055	34
10cit-Patent	1,097,775	1,651,894	3.01	3	7
10citeseerx	770,539	1,501,126	3.90	70	36
05cit-Patent	1,671,488	3,303,789	3.95	8	12
05citeseerx	1,457,057	3,002,252	4.12	116	36
citeseerx	6,540,401	15,011,260	4.59	15,510	59
dbpedia	3,365,623	7,989,191	4.75	83,659	146
patent	3,774,768	16,518,947	8.75	1544	32
twitter	18,121,168	18,359,487	2.03	1,346,820	22
web-uk	22,753,644	38,184,039	3.36	3,417,930	2793

Table 4. Comparison of the index size (MB).

Dataset	FL-0	FL-16	FL-32	FL-64	FL-128	BFL $^{+}$ -0	BFL $^{+}$ -16	BFL $^{+}$ -32	BFL $^{+}$ -64	BFL $^{+}$ -128
human	0.74	0.89	1.04	1.33	1.92	1.09	1.23	1.38	1.68	2.27
anthra	0.24	0.29	0.33	0.43	0.62	0.36	0.41	0.45	0.55	0.74
agrocyc	0.24	0.29	0.34	0.43	0.63	0.36	0.41	0.46	0.56	0.75
ecoo	0.24	0.29	0.34	0.43	0.62	0.36	0.41	0.46	0.56	0.75
vchocyc	0.18	0.22	0.25	0.32	0.47	0.27	0.31	0.35	0.42	0.56
arxiv	0.11	0.14	0.16	0.20	0.30	0.25	0.27	0.29	0.34	0.43
email	4.41	5.29	6.17	7.93	11.45	6.39	7.27	8.15	9.92	13.44
LJ	18.52	22.23	25.93	33.34	48.16	27.26	30.96	34.67	42.08	56.90
web	7.09	8.51	9.93	12.76	18.43	11.52	12.94	14.35	17.19	22.86
10cit-Patent	20.94	25.13	29.31	37.69	54.44	32.16	36.35	40.54	48.91	65.66
10citeseerx	14.70	17.64	20.58	26.45	38.21	22.34	25.28	28.22	34.10	45.85
05cit-Patent	31.88	38.26	44.63	57.38	82.89	51.56	57.94	64.32	77.07	102.57
05citeseerx	27.79	33.35	38.91	50.02	72.25	42.11	47.67	53.23	64.35	86.58
citeseerx	124.75	149.70	174.65	224.55	324.34	185.10	210.05	235.00	284.90	384.70
dbpedia	64.19	77.03	89.87	115.55	166.90	109.16	122.00	134.84	160.52	211.87
patent	72.00	86.40	100.80	129.60	187.19	132.91	147.31	161.71	190.51	248.11
twitter	345.63	414.76	483.89	622.14	898.65	475.78	544.91	614.03	752.29	1028.79
web-uk	433.99	520.79	607.59	781.18	1128.38	654.25	741.05	827.84	1001.44	1348.63

Table 5. Comparison of the index construction time (ms).

Dataset	FL-0	FL-16	FL-32	FL-64	FL-128	BFL $^{+}$ -0	BFL $^{+}$ -16	BFL $^{+}$ -32	BFL $^{+}$ -64	BFL $^{+}$ -128
human	9.01	11.63	13.12	13.27	13.29	0.92	3.52	3.64	4.47	4.54
anthra	2.89	3.53	4.20	4.26	4.39	0.31	1.08	1.16	1.18	1.21
agrocyc	2.96	3.80	4.15	4.72	4.48	0.25	1.11	1.21	1.26	1.47
ecoo	3.08	3.78	4.42	6.56	4.51	0.25	1.11	1.24	1.31	1.61
vchocyc	2.22	2.79	3.58	3.89	4.07	0.21	0.86	0.89	0.93	0.99
arxiv	4.71	5.93	7.65	8.57	9.76	1.05	3.45	3.66	4.32	5.00
email	81.3	87.9	101.8	97.2	97.9	8.7	27.6	27.9	29.8	30.4
LJ	325.1	381.7	438.3	430.7	442.2	43.5	135.7	138.1	138.3	139.0
web	178.5	215.8	241.7	240.6	245.5	33.2	90.2	93.4	91.2	96.5
10cit-Patent	801.3	1026.2	1084.9	1055.0	1085.0	189.4	448.9	451.7	454.2	466.7
10citeseerx	376.3	497.7	527.9	531.4	550.0	90.2	233.7	236.4	243.0	248.6
05cit-Patent	1517.8	1966.1	1989.4	2037.9	2048.2	417.8	937.8	954.9	975.9	974.0
05citeseerx	775.6	1032.4	1087.6	1076.4	1146.7	197.5	509.2	517.3	525.6	552.4
citeseerx	4063.9	5615.0	6061.8	6002.4	6123.3	1371.5	3179.8	3182.9	3221.6	3234.7
dbpedia	2264.3	2982.4	3200.0	3218.6	3209.9	595.8	1391.9	1396.7	1410.9	1436.2
patent	5022.8	7642.4	7818.6	7890.3	7862.5	2266.9	4725.9	4786.5	4833.9	4984.7
twitter	6287.6	7497.8	8287.8	8285.5	8770.7	1014.2	2760.4	2761.9	2791.2	2812.9
web-uk	8689.7	9968.6	11,138.9	11,185.5	11,559.9	1116.6	3204.8	3236.5	3374.6	3375.9

Table 6. Comparison of the query time (ms).

Dataset	FL-0	FL-16	FL-32	FL-64	FL-128	BFL $^{+}$ -0	BFL $^{+}$ -16	BFL $^{+}$ -32	BFL $^{+}$ -64	BFL $^{+}$ -128
human	24.6	4.4	4.1	4.4	5.0	15.5	7.9	8.5	8.4	10.1
anthra	26.2	5.4	4.6	5.0	5.3	13.3	6.8	6.9	6.9	7.5
agrocyc	24.5	5.5	5.3	5.0	5.6	16.3	7.7	7.5	6.9	8.2
ecoo	28.0	5.7	5.1	5.4	5.6	13.4	7.2	7.1	7.2	7.6
vchocyc	26.9	5.0	4.8	4.8	5.5	13.3	6.9	7.7	7.0	7.5
arxiv	479.1	153.4	144.2	140.7	140.3	177.4	37.3	36.3	35.6	32.6
email	82.2	12.4	13.2	13.9	20.5	47.6	23.9	23.0	23.4	27.7
LJ	108.7	27.7	25.0	27.4	33.5	58.0	31.9	30.7	34.5	37.9
web	142.9	34.2	31.7	34.9	41.0	76.5	32.7	32.3	33.3	37.0
10cit-Patent	155.8	169.2	175.0	189.2	196.4	96.2	112.7	135.3	147.1	145.4
10citeseerx	206.2	190.3	182.7	177.3	188.9	113.2	114.7	115.0	113.8	117.8
05cit-Patent	410.4	381.2	367.2	352.3	363.7	197.5	199.3	192.0	185.7	184.7
05citeseerx	250.8	218.8	217.2	207.0	220.1	130.9	136.1	131.2	130.3	138.1
citeseerx	2207.8	76.3	73.5	77.4	85.7	988.1	64.2	67.4	67.1	73.7
dbpedia	346.6	55.8	54.8	57.6	62.5	205.0	49.0	49.3	48.8	52.6
patent	13,716.5	12,584.4	12,629.1	12,026.3	12,021.3	3415.1	3388.7	3248.2	3204.3	3114.2
twitter	168.9	42.8	43.0	47.8	54.5	113.0	64.9	64.2	72.6	83.0
web-uk	256.9	115.0	127.5	136.6	157.6	126.0	86.6	91.2	87.5	100.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Efficient Reachability Ratio Computation for 2-Hop Labeling Scheme

Abstract

1. Introduction

2. Background and Related Work

2.1. Preliminaries

2.2. Related Work

3. The Baseline Algorithm

4. The Incremental Approach

5. The Incremental-Partition Approach

5.1. The Equivalence Relationship

5.2. Partitions Computation

6. Experiment

6.1. RR Computation

6.2. Processing Reachability Queries

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics