Graph Compression by BFS

: The Web Graph is a large-scale graph that does not ﬁt in main memory, so that lossless compression methods have been proposed for it. This paper introduces a compression scheme that combines efﬁcient storage with fast retrieval for the information in a node. The scheme exploits the properties of the Web Graph without assuming an ordering of the URLs, so that it may be applied to more general graphs. Tests on some datasets of use achieve space savings of about 10% over existing methods.


Introduction
In recent years, many applications have been developed for retrieving information over the World Wide Web, and analyzing the structure of the underlying Web Graph, which contains currently more than 1 trillion different URLs (http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html).This large-scale graph has too many links to be stored in the main memory which forces several random seeks to a disk.Since disk access is (by five orders of magnitude) slower than main memory access, this leads to unacceptable retrieval times.To mitigate this problem, several compression techniques have been proposed for large graphs, aimed at reducing the number of bits per link required in graph representation [1][2][3][4][5][6][7].
In this paper, we focus on the efficient storage of and rapid access to compressed graphs.In contrast to other techniques that make use of lexicographic ordering of URLs, and thus are specifically tailored for the Web Graph, however, the scheme presented here does not need to refer to the URLs and therefore may be applied to graphs of more general nature.The specific aim of our method is to produce a compressed graph supporting queries of the kind: • For two input pages X and Y, does X have a hyperlink to page Y?
• For input page X, list the neighbours of X In summary, our method produces a compressed Web Graph featuring high compression ratio, short retrieval time of the adjacency list of a node, and fast testing of whether or not two pages share a hyperlink.The paper is organized as follows.Section 2 stipulates some notation and reviews previous work.Our method is presented in Section 3. Section 4 outlines the structure of a new universal code inspired by our method, to be further analyzed in a forthcoming paper.Finally, Section 5 documents the performance achieved.

Preliminaries
The Web Graph over some subset of URLs or pages is a directed graph in which each node u is an URL in the subset and an edge or link is directed from u to v whenever there is a hyperlink from u to v. Formally, the Web Graph is a directed graph G = (V, E), where V is the set of URL identifiers or indices and E is the set of links between them.For any node v ∈ V , A v = {u 1 , . . ., u k |u i ∈ V } will denote the adjacency list of v.We will assume that the identifiers in each list appear sorted in some linear order.
From the standpoint of compression, one convenient way to assign indices is to sort the URLs lexicographically and then to give each page its corresponding rank in the ordering.This induces two properties, namely: • Locality: For a node with index i, most of its neighbours may be expected to have an index close to i; and • Similarity: Pages with a lexicographically close index (for example, pages residing on a same host) may be expected to have many neighbours in common (that is, they are likely to reference each other).
These properties induce that the gap between the index i of a page and the index j of one of its neighbours is typically small.The approach followed in this paper is based on ordering nodes based on the Breadth First Search (BFS) of the graph instead of the lexicographic order, while still retaining these features.In fact, Figure 1 (top) shows the corresponding distribution of the gaps between neighbours.This distribution follows a power law similar to the node degree distribution displayed in Figure 1 (bottom).The problem of graph compression has been approached by several authors over the years, perhaps beginning with the paper [9].Among the most recent works, Feder and Motwani [10] looked at graph compression from the algorithmic standpoint, with the goal of carrying out algorithms on compressed versions of a graph.Among the earliest works specifically devoted to web graph compression one finds papers by Adler and Mitzenmacher [1], and Suel and Yuan [6].For quite some time the best compression performance was that achieved by the WebGraph algorithm by Boldi and Vigna (BV in the following) [3], which uses two parameters to compress the Web Graph G = (V, E): the refer range W and the maximum reference count R. For each node v i ∈ V BV represent a modified version of A v i obtained from an adjacency list A v j of another node v j ∈ V (i − W ≤ j < i) called the reference list (the value i − j is called reference number).The representation is composed of a sequence of bits (copy list), which tells if a neighbour of v j is also a neighbour of v i , and the description of the extra nodes The parameter R is the maximum size of a reference chain.In fact, BV do not consider all the nodes v j , with i − W ≤ j < i, to encode v i , but only those that produce a chain not longer than R. BV also developed ζ codes [11], a family of universal codes used to compress power law distributions with small exponent.
Claude and Navarro (CN) [4] proposed a modified version of Re-Pair [12] to compress the Web Graph.Re-Pair is an algorithm that builds a generative grammar for a string by hierarchically grouping frequent pairs into variables, frequent variable pairs into more variables and so on.Along these lines, CN essentially applies Re-Pair to the string that results from concatenation of the adjacency lists of the vertices.The data plots presented in [4] display that their method achieved a compression comparable to BV (at more than 4 bits per link) but the retrieval time of the neighbours of a node is faster.
Buehrer and Chellapilla [7] (BC) proposed a compression based on a method presented in [10] by Feder and Motwani; They search for recurring dense bipartite graphs (communities) and for each occurrence found they generate a new node, called virtual node, that replaces the intra-links of the community (see Figure2).In [13] Karande et al. showed that this method has competitive performances over well know algorithms including PageRank [14,15].In [2], Asano et al. obtained better compression result than BV and BC but their technique does not permit a comparably fast access to the neighbours of a node.They compressed the intra-host links (that is, links between pages residing in the same host) by identifying through local indices six different types of blocks in the adjacency matrix, respectively dubbed: isolated 1-element, horizontal block, vertical block, L-shaped block, rectangular block, and diagonal block.Each block is represented by its first element, its type, and its size.The inter-host links are compressed by the same method used for the intra-host links, through resort to ad-hoc "new local indices" (refer to [2] for details).

Encoding by BFS
Our compression method is based on the topological structure of the Web Graph rather than on the underlying URLs.Instead of assigning indices to nodes based on the lexicographical ordering of their URLs, we perform a breadth-first traversal of G and index each node according to the order in which it is expanded.We refer to this process, and the compression it induces, as Phase 1.Following this, we compress in Phase 2 all of the remaining links.
During the traversal of Phase 1, when expanding a node v i ∈ V , we assign consecutive integer indices to its k i (not yet expanded) neighbours, and also store the value of k i .Once the traversal is over, all the links that belong to the breadth-first tree are encoded in the sequence {k 1 , k 2 , . . ., k |V | }, which we call the traversal list.In our experiments, the traversal allows to remove almost |V | − 1 links from the graph.Figure 3 shows an example of Phase 1: the graph with the indices assigned to nodes is displayed in Figure 4(a), while Figure 4(b) shows the links remaining after the BFS and, below them, the traversal list.The compression ratio achieved by the present method is affected by the indices assignment of the BFS.In [16], Chierichetti et al. showed that finding an optimal assignment that minimizes   We now separately compress consecutive chunks of l nodes, where l is a prudently chosen value for what we call the compression level.Each compressed chunk is prefixed with the items of the traversal list that pertain to the nodes in the chunk: that is, assuming that the chunk C consists of the nodes v i , . .., v i+l−1 , then the compressed representation of C is prefixed by the sequence {k i , . . ., k i+l−1 }.
In Phase 2, we encode the adjacency list A i of each node v i ∈ V in a chunk C in increasing order.Each encoding consists of the integer gap between adjacent elements in the list and a type indicator chosen in the set {α, β, χ, φ} needed in decoding.With A j i denoting the jth element in A i , we distinguish three main cases as follows. 1.
The types α and φ encode the gap with respect to the previous element in the list (A j−1 i ), while β and χ are given with respect to the element in the same position of the adjacency list of the previous node (A j i−1 ).When A j i−1 does not exist it is replaced by A j k , where k (k < i − 1 and v k ∈ C) is the closest index to i for which the degree of v k is not smaller than j, or by a φ-type code in the event that even such a node does not exist.In the following, we will refer to an encoding by its type.Table 3 displays the encoding that results under these conventions for the adjacency list of Table 2.
As mentioned, our encoding achieves that two nodes connected by a link are likely to be assigned close index values.Moreover, since two adjacent nodes in the Web Graph typically share many neighbors then the adjacency lists will feature similar consecutive lines.This leads to the emergence of four types of "redundancies" the exploitation of which is described, with the help of We exploit the redundancies according to the order in which they are listed above, and if there is more than one box beginning at the same entry we choose the largest one.
In order to signal the third or fourth redundancy to the decoder we introduce a special character Σ, to be followed by a flag Σ F denoting whether the redundancy starting with this element is of type 3 (Σ F = 2), 4 (Σ F = 3), or both (Σ F = 1).
For the second redundancy in our example we write "φ Σ 2 1 1", where φ identifies a φ-type encoding, 2 is the value of Σ F , the first 1 is the gap, and the second 1 is the number of times that the element appears minus L min (2 in this example).
To represent the third redundancy both width and height of the box need encoding, thus in our example we can write "β Σ 3 0 7 5", where β is the code type, 3 is the value of Σ F , 0 is the gap, 7 is the width minus 1, and 5 is the height minus 2.
When a third and fourth type redundancy originate at the same entry, both are encoded in the format "type Σ 1 gap l w b h b ", where the 1 is the Σ F , l is the number of identical elements on the same line starting from this element, and w and h are, respectively, the width and the height of the box.
Table 5 shows the encoding resulting from this treatment.Table 5.An example of adjacency list encoding exploiting redundancies.Lines Degree Links . . . . . . . . .0 0 0 9 We observe that we do not need to explicitly write φ characters, which are implicit in , a condition easily testable at the decoder.We encode the characters α, β and χ as well as Σ F by Huffmancode.Gaps, the special character Σ (Σ is an integer that does not appear as a gap) and other integers are encoded using the ad-hoc π-code described in the next section.When a gap g could be negative (as with degrees), then we encode 2g if g is positive, and 2|g| − 1 when g < 0.

A Universal Code
In this section we briefly introduce π-codes, a new family of universal codes for the integers.This family is better suited than the δ-and ζ-codes [11,17] to the cases of an entropy characterized by a power law distribution with an exponent close to 1. Let n be a positive integer, b its binary representation and h = 1 + log 2 (n) .Having fixed a positive integer k, we represent n using k +h+ h 2 k −1 bits.Specifically, say h = 2 k l−c (l > 0 and 0 ≤ c < 2 k ), then the π k -encoding of n is produced by writing the unary representation of l, which is followed by the k bits needed to encode c, and finally by the rightmost h − 1 bits of b.
For instance, the π 2 -encoding of 21 is 01 11 0101: since the binary representation b of 21 is 10101 and we can write h = 2 2 • 2 − 3 = 5, then the prefix of the encoding is the unary representation of l = 2, the 2 bits that follow indicate the value of c = 3 and the suffix is formed by the h − 1 least significant digits of b.In the following we use the approximation H P = E P (log 2 n) described in [17], where H p is the entropy of a distribution with probability P .The expected length for n is: A code is universal [17] if the expected codeword length is bounded in the interval 0 < H P < ∞.Each π-code is universal according to: and, for k → ∞, it is also asymptotically optimal [17], that is: In the context of Web Graph compression we use a modified version of π-codes in which 0 is encoded by 1 and any other positive integer n is encoded with a 0 followed by the π-code of n.

Experiments
Table 8 reports sizes expressed in bits per link of our compressed graphs.We used datasets (Datasets and WebGraph can be downloaded from http://webgraph.dsi.unimi.it/)collected by Boldi and Vigna [3] (salient statistics in Table 7).Many of these datasets were gathered using UbiCrawler [8] by different laboratories.
With a compression level l = 10 4 the present method yielded consistently better results than BV [3], BC [7] and Asano et al. [2].The BV highest compression scores (R = ∞) are comparable to those we obtain at level 8, while those for general usage (R = 3) are comparable to our level 4. Table 8 displays also the results of the BV method using an ordering of the URLs induced by the BFS.This indicates that BV does not take advantage.To randomly access the graph we need to store the offset of the first element of each chunk, but the results shown here do not account for these offsets.In fact, we do need N/l offsets.BV use l = 1 in their tests, which are slowed down by about 50% setting l = 4.In BC an offset per node is required.Asano et al. do not provide information about offsets.In order to recover the links of the BFS tree, we also need to store for the first node u of each chunk the smallest index of a node v such that (u, v) belongs to the BFS tree.In total, this charges an extra (b + k)/l bits per node, where b bits are charged by the offset and k by the index of the node.With r the bits per link and d the average degree of a graph, b requires at most 2 + log 2 (lrd) bits and k at most 2 + log 2 l bits by Elias-Fano encoding [18,19].Using the same encoding, BC and BV require 2 + log 2 (rd) bits per node to represent any offset.Since (4 + log 2 (l 2 rd))/l < 2 + log 2 (rd), then we need less memory to store this information.Currently, in our implementation we use 64 bits per node.
Table 9 displays average times to retrieve adjacencies of 2 • 10 7 random nodes.The tests run on an Intel Core 2 Duo T7300 2.00 GHz with 2 GB main memory and 4 MB L2-Cache under Linux 2.6.27.For the tests, we use the original Java implementation of BV (running under java version 1.6.0)and a C implementation of our method compiled with gcc (version 4.3.2 with -O3 option).The performance of Java and C are comparable [20], but we turned off the garbage collector anyway to speed-up.Table 9 shows that the BV high compression mode is slower than our method, while the BV general usage version (R = 3) performs with comparable speed.However, settling for l = 4 our method becomes faster.
Table 9.Average times to retrieve the adjacency list of a node.
BV [3] This Paper R   By virtue of the underlying BFS, we can implement a fast query to check whether or not a link (v i , v j ) exists.In fact, we know that a node v i has k i links that belong to the BFS tree, say, to v a , . .., v a+k i −1 .We also know that v i does not have any link with a node v b where b ≥ a + k i , so we need to generate the adjacency list of v i only if j < a. Figure 4 displays the pseudocode of this query.Table 10 shows average times to test the connectivity of 2 • 10 7 pairs of random nodes.The average time is less than 60% of the retrieval time.
Finally, Figure 5 presents the actual main memory (top) respectively required by BV and our method, and the space-time tradeoff (bottom).As said, we do not compress the offsets.The space requirement of our compression level 8 is 80% of BV at R = 3.

Conclusion
We have proposed a new way to compress the Web Graph and other graphs of comparable structure.In fact, we assume no a priori knowledge of the graph, and in contrast with previous works based on lexicographic ordering of URLs we use a traversal to order nodes.The size of the compressed files is smaller of that of Asano et al. [2], considered the current state of the art.The average retrieval time is comparable to that of BV [3].We also introduced a fast query to check whether two nodes are connected, without need to generate an entire adjacency list.Future work shall extend the set of primitive queries for compressed graphs.

Figure 1 .
Figure 1.The distribution of gaps between neighboring nodes (top) and (bottom) of node degrees in the dataset "in-2004" as gathered by Boldi and Vigna [3] using UbiCrawler [8].

Figure 2 .
Figure 2. The method by Buehrer and Chellapilla [7] compresses a complete bipartite graph (left) by introducing a virtual node (right).
Indices assigned to the nodes.

Figure 4 .
Figure 4. Algorithm to check if the directed link (v i , v j ) exists.

Figure 5 .
Figure 5. Main memory usage (top) by BV and the present method and the space-time tradeoff (bottom).

Table 2 .
An adjacency list.It is assumed that the node v i is the first node of a chunk.

Table 3 .
Encoding of the adjacency list of Table2.

Table 4 ,
as follows.1.A run of identical lines is encoded by assigning a multiplier to the first line in the sequence; 2. Since there are intervals of constant node degrees (such as, for example, the block formed by two consecutive "9" in the table) then the degrees of consecutive nodes are gap-encoded;3.Whenever for some suitably fixed L min there is a sequence of at least L min identical elements (such as the block of φ 1's in the table), then this sequence is run-length encoded;4.Finally, a box of identical rows (such as the biggest block in the table) exceeding a pre-set threshold size A min is run-length encoded.

Table 7 .
Statistics of datasets used for tests.

Table 8 .
Compressed sizes in bits per link.

Table 10 .
return true return false Average times to test adjacency between pairs of random nodes.