Structural Entropy of the Stochastic Block Models

With the rapid expansion of graphs and networks and the growing magnitude of data from all areas of science, effective treatment and compression schemes of context-dependent data is extremely desirable. A particularly interesting direction is to compress the data while keeping the “structural information” only and ignoring the concrete labelings. Under this direction, Choi and Szpankowski introduced the structures (unlabeled graphs) which allowed them to compute the structural entropy of the Erdős–Rényi random graph model. Moreover, they also provided an asymptotically optimal compression algorithm that (asymptotically) achieves this entropy limit and runs in expectation in linear time. In this paper, we consider the stochastic block models with an arbitrary number of parts. Indeed, we define a partitioned structural entropy for stochastic block models, which generalizes the structural entropy for unlabeled graphs and encodes the partition information as well. We then compute the partitioned structural entropy of the stochastic block models, and provide a compression scheme that asymptotically achieves this entropy limit.


Introduction
Shannon's metric of "Entropy" of information is a foundational concept of information theory [39,9].Given a discrete random variable X with support set (that is, the possible outcomes) x 1 , x 2 , . . ., x n , which occurs with probability p 1 , p 2 , . . ., p n , the entropy of X is defined as where the logarithm here and throughout this paper is of base 2. Note that the entropy of X is a function of the probability distribution of X.
The entropy was originally created by Shannon in [33] as part of his theory of communication, where a data communication system consists of a data source X, a channel and a receiver.The fundamental problem of communication is for the receiver to reliably recover what data was generated by the source, based on the bits it receives through the channel.Shannon proved that the entropy of the source X plays a central role -in his source coding theorem it is shown that the entropy is the mathematical limit on how well the data can be losslessly compressed.
The question then arises: How to compress data that has structures, e.g., data in social networks?In Shannon's 1953 less known paper [34] he argued for an extension of information theory, where data is considered as observations of a source, to "non-conventional data" (that is, lattices).Indeed, nowadays data appears in various formats and structures (e.g., sequences, expressions, interactions) and in drastically increasing amounts.In many scenarios, data is highly context-dependent and in particular, the structural information and the context information seem to be two conceptually different aspects.Therefore it is desirable to develop novel theory and efficient algorithms for extracting useful information from non-conventional data structures.Roughly speaking, such data consists of structural information, which, might be understood as the "shape" of the data, and context information which should be recognized as data labels.
It is well-known that complex networks (e.g., social networks) admit community structures [26].That is, users within a group interact with each other more frequently than those outside the group.The Stochastic Block Model (SBM) [13] is a celebrated random graph model that has been widely used to study the community structures in graphs and networks.It provides a good benchmark to evaluate the performance of community detection algorithms and inspires the design of many algorithms for community detection tasks.The theoretical underpinnings of the SBM have been extensively studied and sharp thresholds for exact recovery have been successively established [2,19,3,11].We refer readers to [1] for a recent survey, where other interesting and important problems in SBM are also discussed.
In addition to the SBM model discussed in [1], there are other angles to study compression of data with graph structures.Asadi et.al. [6] investigated data compression on graphs with clusters.Zenil et.al. [40] have surveyed information-theoretic methods, in particular Shannon entropy and algorithmic complexity, for characterizing graphs and networks.

Compression of graphs
In recent years, graphical data and the network structures supporting them are becoming increasingly common and important in branches of engineering and sciences.To better represent and transmit graphical data, many works consider the problem of compressing the (random) graph up to isomorphism, i.e., compressing the structure of a graph.A graph G contains a finite set V of vertices and a set E of edges each of which connects two vertices.A graph can be represented by a binary matrix (the adjacency matrix) that further can be viewed as a binary sequence.Thus, encoding a labeled graph (that is, all vertices need to be distinguished) is equivalent to encoding the |V | 2 -digit binary sequence, given certain probability distribution on all |V | 2 possible edges.However, such a string does not reflect internal symmetries that are conveyed by the graph automorphism, and sometimes we are only interested in the local or global structures in the graph, rather than the exact vertex labelings.The structural entropy is defined when the graphs are considered unlabeled, or simply called structures, where the vertices are viewed as undistinguishable.The goal of this natural definition is to capture the information of the structure, and thus provides a fundamental measure in graph/structure compression schemes.
The problem actually has a strong theoretical background.Back to 1984, Turán [37] raised the question of finding an efficient coding method for general unlabeled graphs on n vertices, where a lower bound of n 2 − n log n + O(n) bits is suggested.This lower bound can be seen by the number of unlabeled graphs [12].The question was later answered by Naor [25] in 1990 who proposed such a representation that is optimal up to the first two leading terms when all unlabeled graphs are equally likely.In a recent paper Kieffer et al. [14] proved a structural complexity of a binary tree.There also have been some heuristic methods for real-world graph compression schemes, see [4,7,28,32,35].Rather recently, Choi and Szpankowski [8] studied the structural entropy of the Erdős-Rényi random graph G(n, p).They computed the structural entropy given that p is not (very) close to 0 or 1 and also gave a compression scheme that matches their computation.Later, the structural entropy for other randomly generated graphs, e.g. the preferential attachment graphs and web graphs are also studied [18,17,31,15].However, it is well-known that the Erdős-Rényi model is too simplistic to model real networks, in particular due to its strong homogeneity and absence of community structure.In this paper, we consider the compression of graphical structures of the SBM, which in general model real networks better and circumvent the issues of the ER-model.In summary, our contributions are as follows: • We introduce the partitioned structural entropy which generalizes the structural entropy for unlabeled graphs and we show that it reflects the partition information of the SBM.
• We provide an explicit formula for the partitioned structural entropy of the SBM.
• We also propose a compression scheme that asymptotically achieves this entropy limit.
Semantic communications are considered as a key component of future generation networks, where a natural problem to consider is how to efficiently extract and transmit the "semantic information".In the case of graph data, one may view the (partitioned) structures as the information that needs to be abstracted while the concrete labeling information is considered redundant.From this point of view, our result is a step for the study of semantic compression/communication under appropriate contexts.

Related works
Finally, we would like to point out that there are some other information metrics defined on graphs.The term "graph entropy" has been defined and used in the history.For example, graph entropy introduced by Kőrner in [16] denotes the number of bits one has to convey to resolve the ambiguity of a vertex in a graph.This notion also turns out to be useful in other areas, including combinatorics.Chromatic entropy introduced in [5] is the lowest entropy of any coloring of a graph.It finds application in zero-error source coding.We remark that the structural entropy we considered is quite different from the Kőrner graph entropy and chromatic entropy.
On the other hand, a concept of graph entropy (also called topological information content of a graph) was introduced by Rashevsky [29] and Trucco [36], and later by Mowshowitz [23,20,21,22,24,10], which is defined as a function of (the structure of) a graph and an equivalence relation defined on its vertices or edges.Such a concept is a measure of the graph itself and does not involve any probability distribution.

Structural entropy of unlabeled graphs
Now let us formally define the structural entropy given a probability distribution on unlabeled graphs.
Given an integer n, define G n as the collection of all n-vertex labeled graphs.
Definition 2.1 (Entropy of Random Graph).Given an integer n and a probability distribution on G n , the entropy of a random graph G ∈ G n is defined as where Then the random structure model S n associated with the probability distribution G n , is defined as the unlabeled version of G n .For a given S ∈ S n , the probability of S can be computed as Here G ∼ = S means that G and S have the same structure, that is, S is isomorphic to G. Clearly if all isomorphic labeled graphs have the same probability, then for any labeled graph G ∼ = S, one has where N (S) stands for the number of different labeled graphs that have the same structure as S.
Definition 2.2 (Structural Entropy).The structural entropy H S of a random graph G is defined as the entropy of a random structure S associated with G n , that is, where the sum is over all distinct structures.
The Erdős-Rényi random graph G(n, p), also called the binomial random graph, is a fundamental random graph model, which has n vertices and each pair of vertices is connected with probability p, independent of other pairs.In 2012, Choi and Szpankowski [8] proved the following for the Erdős-Rényi random graphs.
Theorem 2.3 (Choi and Szpankowski, [8]).For large n and all p satisfying n −1 ln n ≪ p and 1 − p ≫ n −1 ln n, the following holds: 1.The structural entropy H S of G(n, p) is for some α > 0.
2. For a structure S of n vertices and ε > 0 is the entropy rate of a binary memoryless source.
Furthermore, they [8] also presented a compression algorithm for unlabeled graphs that asymptotically achieves the structural entropy up to an O(n) error term.

Stochastic Block Model -Our result
It is well-known that the ER model is too simplistic to model real networks, in particular due to its strong homogeneity and absence of community structure.The Stochastic Block Model is then introduced on the assumption that vertices in a network connect independently but with probability based on their profiles, or equivalently, on their community assignment.For example, in the SBM with two communities and symmetric parameters, also known as the planted bisection model, denoted by G(n, p, q), the vertex set is partitioned into two sets V 1 and V 2 , any pair of vertices inside V 1 or V 2 are connected with probability p and any pair of vertices across the clusters are connected with probability q, and all these connections are independent.
As an illuminating example, consider a context G where there are n/2 users and n/2 devices, and each pair of users and each pair of devices are connected with probability p, a user and a device is connected with probability q and each of these connections is independent of all other connections.Suppose that we need to compress the information of G.However, in the context it is not appropriate to view G as an unlabeled graph, that is, in addition to the structure information, it is also important to keep the "community" information -the compression also needs to encode the information that who is a user and who is device.
Definition 2.4 (Partition-respecting isomorphism, Partitioned Unlabeled Graphs).Let r ≤ n be integers.Suppose V is a set of n vertices and r parts.The partition-respecting isomorphism, denoted by " ∼ = P " is defined as follows.For any two labeled graphs G and G ′ , we write Then Γ P is defined as the collection of n-vertex graphs on V where we ignore the labels of vertices inside each V i , 1 ≤ i ≤ r, namely, the equivalence classes under partition-respecting isomorphism, with respect to P.
Note that every labeled graph G corresponds to a unique structure S ∈ Γ P , and we use G ∼ = P S to denote this relation.Furthermore, under the above definition, general unlabeled graphs correspond to the case r = 1.In this paper we extend Theorem 2.3 to the structural entropy of the Stochastic Block Model with any given number of blocks, and provide a compression algorithm that asymptotically matches this structural entropy.We first give the result for the balanced bipartition case G(n, p, q).Theorem 2.6.
is a probability distribution of graphs on V where every edge inside V 1 or V 2 is present with probability p and every edge between V 1 and V 2 is present with probability q, and these edges are mutually independent.For large even n and all p satisfying n −1 ln n ≪ p, q and 1−p ≫ n −1 ln n, the following holds: (i) The partitioned structural entropy H S of G(n, p, q) is for some α > 0.
(ii) For a balanced bipartitioned structure S and ε > 0 is the entropy rate of a binary memoryless source.
Note that the structural entropy H S here is larger than that in Theorem 2.3 (even if p = q), which reflects the fact that the SBM with "a planted (bi-)partition" contains prefixed structures, so has less symmetries than G(n, p), the pure random model 1 .
3 Proof of Theorem 2.6 One key ingredient in the proof of Theorem 2.3 in [8] is the following lemma on the symmetry of G(n, p).A graph is called asymmetric if its automorphism group does not contain any permutation other than identity; otherwise it is called symmetric.Proof of Theorem 2.6.Note that every pair of vertices in V 1 or in V 2 should be considered as undistinguishable, but not the pairs of vertices in X × Y .Recall that we write G ∼ = P S for a graph G and a structure S if S represents the structure of G (with respect to the partition P).
Let G := G(n, p, q).We first compute H G .Note that there are n 2 possible edges in G ∈ G, and we can view it as a binary sequence of length n 2 , where each digit is a Bernoulli random variable.Moreover, for edges inside V 1 or V 2 , the random variable, denoted by X 1 , has expectation p and for edges in V 1 × V 2 the random variable, denoted by X 2 , has expectation q.

Thus we have
Now write S n for the probability distribution on V over all partitioned unlabeled graphs inherited from G, namely, given S ∈ Γ P , P (S) = G ∼ =P S P (G).Let H S be the partitioned structural entropy of S n .Therefore, compared with our goal, it remains to show that Note that in G(n, p, q), all labeled graphs G ∈ G such that G ∼ = P S have the same probability P (G).Thus, given a (labeled) graph G ∈ G, we have P (G) = P (S)/N (S), where S ∈ S n is such that G ∼ = P S. So the graph entropy of G = G(n, p, q) can be written as Define S[W ] be be S restricted on W for W ∈ V .Now we split S into S 1 and S 2 , i.e., and S 2 = S[V 2 ].Write Aut(S i ) for the automorphism group for S i , and we naturally have Combining this with ( 2) and ( 3), it remains to show that In the summation above we only need to focus on S such that either S 1 or S 2 is symmetric, as otherwise log |Aut(S 1 )||Aut(S 2 )| = log 1 = 0.By Lemma 3.1, we conclude that the probability for some α > 0, and for such S we use the trivial bound log |Aut(S 1 )||Aut(S 2 )| ≤ 2 log(n/2)!≤ 2n log n.This gives us the desired estimate in (i) To show (ii), for a set V of n vertices and a balanced bipartition define the typical set T n ε as the set of structures S on n vertices satisfying (a) S is asymmetric on Denote by T n 1 and T n 2 the sets of structures satisfying the properties (a) and (b), respectively and thus we have Firstly, by the asymmetry of G(n, p) (Lemma 3.1), we conclude that P (T n 1 ) > 1 − 2ε for large n.Secondly, we use a binary sequence of length n 2 to represent a (labeled) instance G of G(n, p, q), where the first n/2 2 bits L 1 represent the induced subgraph on V 1 , the next n/2 2 bits L 2 represent the induced subgraph on V 2 , and finally the rest n 2 /4 bits L 12 represent the bipartite graph on V 1 × V 2 .Since all edges of G are generated independently, both L 1 and L 2 have in expectation n/2 2 p 1's and the AEP property of the binary sequences implies that holds with probability at least 1 − 2ε.Similarly, L 12 has in expectation (n 2 /4)q 1's and the AEP property of the binary sequences gives that with probability at least 1 − ε, Since these edges are independent, we finally conclude that (b) holds with probability at least 1 − 3ε.Thus, P (T n ε ) ≥ 1 − 4ε.Now we can compute P (S) for S ∈ T n ε .By (a), P (S) = (n/2)!(n/2)!P (G) for any G ∼ = S. Together with (b) and straightforward computation, the assertion of (ii) follows.

SBM Compression Algorithm
Given the computation of the structural entropy, a natural next step is to design efficient compression schemes that are close to or even (asymptotically) achieve this entropy limit.Choi and Szpankowski [8] presented such an algorithm (which they named Szip) for (unlabeled) random graphs, which uses in expectation at most n 2 h(p) − n log n + O(n) bits and asymptotically SBM data Parallel decoder Figure 1: Illustration of compression algorithm achieves the structural entropy given in Theorem 2.3.Roughly speaking, Szip greedily peels off vertices from the graph and (efficiently) store the neighborhood information.This procedure can be simply reversed but the labeling of the recovered graph may be different from the original graph, which is the reason on why a saving of the codeword length is achieved.Refinements and analysis [8] are also provided to achieve the proposed performance.
Here we give an algorithm that optimally compresses SBMs which uses the Szip algorithm as building blocks and matches the structural entropy computation in Theorem 2.6.The algorithm consists of two stages.It first compresses S[V 1 ] and S[V 2 ] using Szip and then compresses S[V 1 , V 2 ] using an arithmetic compression algorithm with the help of Szip decoding outputs.
To give a brief description of the compression algorithm, we again use the balanced bipartition V 1 ∪ V 2 as an example.The encoding and decoding procedure of the algorithm is illustrated in Figure 1.The algorithm encodes the observed S(n, p, q) into a binary string as follows.It uses Szip as a subroutine to compress S[V 1 ] and S[V 2 ] into binary sequences L 1 and L 2 .Then, as part of the encoder, we run the Szip decoder on L 1 and L 2 to obtain decoded structures , respectively.We then compress S[V 1 , V 2 ] as a labeled bipartite graph under the vertex labeling of S ′ [V 1 ] and S ′ [V 2 ] into L 12 .This "Labeled Encoder" can be done by treating it as a binary sequence of length n 2 /4 and compressing using a standard arithmetic encoder [27,30,38].The concatenation of Szip algorithms and the arithmetic encoder forms the cascade encoder of our algorithm and obtains the codeword (L 1 , L 2 , L 12 ).Upon receiving the codeword, we decode them parallelly using Szip decoder and the arithmetic decoder.This completes our algorithm.
The main challenge in the design of our algorithm is how the decoder can retrieve the consistency between the bipartite graph A key observation here is that since Szip is a deterministic algorithm, although it may permute the vertex labelings, its output is an invariant given the same input.Given this, our solution here is to first run Szip (both encoding and decoding) at the encoder, and obtain structures , respectively.We then compress S[V 1 , V 2 ] (as a labeled bipartite graph) under the vertex labeling of . This would guarantee that the decoded structures Ŝ , namely, S is recovered.
Before discussing the performance of the algorithm, we first describe some useful properties of the arithmetic compression algorithm in the following lemma.We omit the proof of the lemma, which follows from the analysis in [27,30,38] and AEP properties in [39,9].
Lemma 4.1.Let L be the codeword length of the arithmetic compression algorithm when compressing a binary sequence with length m and entropy rate h.For large m, the following holds: (i) The expected codeword length asymptotically achieves the entropy of the message, i.e., (ii) For any ǫ > 0, (iii) The arithmetic algorithm runs in time O(m).
The following theorem characterizes the performance of our algorithm.It is immediate from Theorem 2 in [8] (performance of Szip) and Lemma 4.1, we omit the detailed proofs here.
Given a partitioned unlabeled graph S on V , let L(S) be the codeword length given by our algorithm.For large n, our algorithm runs in time O(n2 ), and satisfies the following: (i) The algorithm asymptotically achieves the structural entropy in (1) 2 , i.e., (ii) For any ǫ > 0, 5 General SBM with r ≥ 2 blocks In previous sections, we discussed the structural entropy of SBM and the compression algorithm that asymptotically achieves this structural entropy for the balanced bipartition case (r = 2).
The corresponding results in Theorem 2.6 and Theorem 4.2 can be easily generalized to the general r-partition case.We briefly describe the generalizations below.

Structural entropy
Our approach can deal with general SBMs similarly.In a general SBM with r ≥ 2 parts, an r ×r symmetric matrix P is used to describe the probabilities between and within the communities, where two vertices u ∈ V i and v ∈ V j are connected by an edge with probability P ij (i and j are not necessarily distinct).To simplify the presentation, we only present the results below in its special form where P ij = p if i = j and P ij = q if i = j, and we remark that similar results hold in the general case as well.We first give the result on the computation of the partitioned structural entropy of SBM.
Theorem 5.1.Fix r reals x 1 , x 2 , . . ., x r in (0, 1) whose sum is 1.Let be a set of n vertices with a partition into r parts such that |V i | = x i n.For large n and all p satisfying n −1 ln n ≪ p, q and 1 − p ≫ n −1 ln n, the following holds: (i) The r-partitioned structural entropy H r S for a partitioned structure S on V is for some α > 0, where z := r i=1 x i n 2 .

Compression algorithm
The compression algorithm for a general r with vertex partition {V 1 , V 2 , . . ., V r } can be viewed as a union of the compression algorithms for S[V i ] and S[V i , V j ] (i < j ∈ {1, 2, . . ., r}).To be more precise, we describe the algorithm as follows.It first compresses all S[V i ] into L i using Szip.Then run the Szip decoder with input L i to obtain the decoded structure S ′ [V i ].With the indices of S ′ [V i ], i = 1, 2, . . ., r, we can compress S[V 1 , V 2 , . . ., V r ] as a labeled r-partite graph into L using an arithmetic encoder.This completes the encoding procedure and gives the codewords L 1 , . . ., L r , L, for which we concatenate together and get the final codeword.The decoding is to simply run the Szip decoders and labeled (arithmetic) decoders parallelly.The correctness of the decoding output can also be argued accordingly.
The performance of the algorithm can be obtained similar to Theorem 4.2 as follows.

Conclusion
In this paper we defined the partitioned unlabeled graphs and partitioned structural entropy, which generalize the structural entropy for unlabeled graphs introduced by Choi and Szpankowski [8].
We then computed the partitioned structural entropy for Stochastic Block Models and gave a compression algorithm that asymptotically achieves this structural entropy limit.As mentioned earlier, we believe that in appropriate contexts the structural information of a graph or network can be interpreted as a kind of semantic information, in which case, the communication schemes may benefit from structural compressions which considerably reduce the cost.

Definition 2 . 5 (
Partitioned Structural Entropy).Let V be a set of n vertices where n ∈ N. Suppose P = {V 1 , V 2 , . . ., V r } is a partition of V into r parts and S n is a probability distribution over all partitioned unlabeled graphs on n vertices.Then the structural entropy H S associated to S n is defined by H S = E[− log P (S)] = − S∈Sn P (S) log P (S).

Lemma 3 . 1 (
Kim, Sudakov and Vu, 2002).For all p satisfying n −1 ln n ≪ p and 1 − p ≫ n −1 ln n, a random graph G ∈ G(n, p) is symmetric with probability O(n −w ) for any positive constant w.
r be a set of n vertices with a partition into r parts such that |V i | = x i n.Given a partitioned unlabeled graph S on V , let L(S) be the codeword length given by our algorithm.For large n, our algorithm runs in time O(n 2 ), and satisfies the following:(i) The algorithm asymptotically achieves the structural entropy in(6), i.e., p) − h(q)) + n 2 h(q) − n log n + O(n).(ii)For any ǫ > 0, P (|L(S) − E[L(S)]| ≤ ǫn log n) ≥ 1 − o(1).