Information Limits for Community Detection in Hypergraph with Label Information

: In network data mining, community detection refers to the problem of partitioning the nodes of a network into clusters (communities). This is equivalent to identifying the cluster label of each node. A label estimator is said to be an exact recovery of the true labels (communities) if it coincides with the true labels with a probability convergent to one. In this work, we consider the effect of label information on the exact recovery of communities in an m -uniform Hypergraph Stochastic Block Model (HSBM). We investigate two scenarios of label information: ( 1 ) a noisy label for each node is observed independently, with 1 − α n as the probability that the noisy label will match the true label; ( 2 ) the true label of each node is observed independently, with the probability of 1 − α n . We derive sharp boundaries for exact recovery under both scenarios from an information-theoretical point of view. The label information improves the sharp detection boundary if and only if α n = n − β + o ( 1 ) for a constant β > 0.

In addition, A i 1 ,i 2 ...,i m (i 1 < i 2 < · · · < i m ) are assumed to be independent and conditional on σ. Here, +1, −1 represent two communities and I + and I − denote the nodes that belong to the +1 and −1 communities, respectively. The subset {i 1 , i 2 , . . . , i m } forms a hyperedge with a probability p if the distinct nodes i 1 , i 2 , . . . , i m are in the same community. Otherwise, {i 1 , i 2 , . . . , i m } forms a hyperedge with a probability q. Throughout this paper, the community structure is assumed to be balanced; that is, the number of nodes in each community is n 2 , as in [6,22,23]. Moreover, we focus on the case p = In words, exact recovery means that the estimated labelσ is equal to the true label σ, with a probability convergent to one as the number of nodes goes to infinity. We say exact recovery is possible if there is an estimatorσ that exactly recovers σ, and exact recovery is impossible if any estimatorσ does not exactly recover σ.
In practice, along with the hypergraph A, side information about node labels is usually available [23][24][25][26][27][28][29][30]. For example, in a co-authorship or co-citation network, the cluster labels of some authors are known [28]. In a Web query network, some queries are labeled [30]. In student relational networks, the dorms in which students live can serve as label information [29]. Various algorithms have been developed to incorporate label information in community recovery in hypergraphs [28][29][30], and incorporating side information has been shown to improve clustering performance [23][24][25][26][27][28][29][30]. The sharp recovery boundary with a label or side information were given by [23,27] in the graph case. However, to the best of our knowledge, the sharp recovery boundary for hypergraphs is still unknown. In this paper, we study the effect of label information on the boundary of exact recovery for hypergraphs and consider two types of label information: (1) a noisy label for each node is observed independently, with 1 − α n as the probability that the noisy label will match the true label; (2) the true label of each node is observed independently, with the probability of 1 − α n . Let α n = n −β+o(1) with a constant β ≥ 0. From an information-theoretical point of view, we derive sharp boundaries of exact recovery in terms of m, a, b, β. Interestingly, label information is useful if and only if β > 0. The main result is summarized in Table 1, where η m,a,b (β), C m,a,b are defined in Equations (1) and (2). In both cases, for a fixed m, the region (in terms of a, b) where exact recovery is impossible shrinks as β gets larger. The label information is helpful if and and only if β > 0; that is, α n has to converge to zero at the rate of n −β for β > 0. The visualization of the regions in Table 1 can be found in Figures 1 and 2 in Section 2. Table 1. Regions for the exact recovery of community structure in a hypergraph with label information.

Region Where Noisy Labels Are Observed Recovery
Exact recovery is impossible Exact recovery is possible

Region Where True Labels Are Partially Observed Recovery
Exact recovery is possible

Main Result
In this section, we consider community detection in hypergraphs through an observation of noisy labels or a proportion of the true labels from an information-theoretical point of view. We derive sharp boundaries of exact recovery, which provides a benchmark for developing practical community detection algorithms.

Detection with Noisy Label Information
In this subsection, we consider community detection in hypergraphs through a noisy version of node labels available. In the graph regime, community detection with noisy label information was proposed in [27] and extensively studied in [23]. Here, we focus on an m-uniform hypergraph with an arbitrary fixed m ≥ 2. Given a true label vector σ = (σ 1 , . . . , σ n ), for each node i, a noisy label Y i is independently observed and Y i coincides with the true label σ i with a probability of 1 − α n . More specifically, P( are independent and conditional on σ. If α n = 0, the true label for each node is fully known. If α n = 1 2 , the noisy labels Y = (Y 1 , Y 2 , . . . , Y n ) do not provide any information about the true labels. The hypergraph A and Y are assumed to be independent and conditional on σ. In this subsection, we focus on the effect of a noisy label Y on community detection in a hypergraph.
Assume α n = n −β+o(1) with a constant β ≥ 0 and define the following: where Here, C m,a,b and γ m,a,b are defined just for notation convenience without any practical meaning. The quantity η m,a,b (β) can be considered the signal contained in the model. The larger η m,a,b (β) is, the easier exact recovery becomes. It is clearer to see this in the special case of β = 0: For a fixed m, a large η m,a,b (0) implies that the difference of √ a − √ b is large. The withincommunity nodes are also more densely connected than between-community nodes. Hence, it gets easier to cluster the nodes into groups. Note that η m,a,b (0) was used to characterize the sharp detection boundary in [6]. For an arbitrary β ≥ 0, we provide the necessary and sufficient conditions for the exact recovery of the community structure. To this end, we firstly investigate the maximum likelihood estimator (MLE) of true labels. The region where exact recovery is impossible corresponds to the region where MLE fails. Then, based on the noisy labels, we construct an estimator that exactly recovers the community structure. The result is summarized in the following theorem.
Exact recovery is possible if Here, η m,a,b (β) and C m,a,b are defined in Equations (1) and (2).
Based on Theorem 1, there is a phase transition phenomenon for exact recovery in H m,n,a,b . In the region β < C m,a,b (a − b), exact recovery is possible if η m,a,b (β) > 1, and not possible if η m,a,b (β) < 1. In the region β > C m,a,b (a − b), exact recovery is possible if β > 1, and exact recovery is impossible if β < 1. In this sense, phase transition occurs at 1, and 1 is the sharp boundary for exact recovery. When α n is bounded away from zero, β = 0 and C m,a,b (a − b) > 0 trivially hold. Then, Theorem 1 recovers Theorem 4 and Theorem 5 in [6]. This shows that the noisy label is useful if and only if α n converges to zero at a rate of n −β for β > 0. Furthermore, exact recovery with a fixed m, a, b becomes easier as β increases, since η m,a,b (β) is increasing in β (see Lemma A1). When m = 2, Theorem 1 is reduced to Theorem 1 and Theorem 2 in [23]. Note that C m,a,b is decreasing in m. Then, given a fixed β, the region of exact recovery for m = 2 contains the region of m ≥ 3 as a proper subset. These findings can be summarized in Figure 1, where we visualize the regions characterized by (4) and (5) with m = 2, 3 and β = 0, 0.4, 0.8. In Figure 1, the red (green) region represents exact recovery as impossible (possible). We point out that the time complexity of our estimator for exact recovery is O(n m ). Since our focus in this paper is to derive the sharp boundary of exact recovery as in [23,27] and not to propose algorithms with optimal performance, our estimator may or may not outperform existing algorithms.

Detection with Partially Observed Labels
In this subsection, we consider the community detection problem with true labels partially observed. This type of side information was considered in [23][24][25][26] in the context of graphs, and in [28][29][30] for hypergraphs. Here, we focus on an m-uniform hypergraph with an arbitrary fixed m ≥ 2. Given the true labels σ = (σ 1 , . . . , σ n ), for each node i, the true label is independently observed with a probability of 1 − α n . More specifically, we define a random variable Y i with P(Y i = σ i |σ i ) = 1 − α n and P(Y i = 0|σ i ) = α n , α n ∈ [0, 1]. Here, Y i = 0 indicates that the true label for node i is not observed. If α n = 1, no label information is observed. If α n = 0, all the true labels are observed and community detection is not necessary. We study how α n changes the sharp detection boundary from the information-theoretical point of view. To this end, we investigate the maximum likelihood estimator (MLE) of true labels. The region where exact recovery is impossible corresponds to the region where MLE fails. The exact recovery estimator is constructed based on the partially observed labels. The result is summarized in the following theorem.
Exact recovery is possible if Theorem 2 clearly characterizes how the partially observed labels affect the boundary for exact recovery. A phase transition phenomenon of exact recovery exists at 1, since exact recovery is possible if η m,a,b (0) + β > 1, but not possible if η m,a,b (0) + β < 1. When β = 0, Theorem 2 recovers Theorem 4 and Theorem 5 in [6]. If β > 0, the region (6) where exact recovery is impossible is smaller than that in [6]. The side information of partially known labels makes exact recovery easier if and only if β > 0. When m = 2, Theorem 2 is reduced to Theorem 1 and Theorem 2 in [23]. For a fixed β, the region of exact recovery of m ≥ 3 is smaller than that of m = 2. These findings can be verified in Figure 2, where we visualize the regions characterized by (6) and (7) with m = 2, 3 and β = 0, 0.4, 0.8. In Figure 2, the red (green) region represents exact recovery as impossible (possible). Finally, we point out that the time complexity of our estimator achieving exact recovery is O(n m ). Again, our focus in this paper is to derive the sharp boundary of exact recovery as in [23,27], not to propose algorithms with optimal performance; hence, our estimator may or may not outperform existing algorithms.

Proof of Main Result
In this section, we provide detailed proof of Theorems 1 and 2.
To start with, we derive the explicit expression of the likelihood function. The likelihood function of hypergraph A given the node label σ is: Then, the log-likelihood function can be written as follows: Note that II is independent of σ.

Proof of Theorem 1
The likelihood function of the vector of noisy labels Y given node lable σ is the following: Then, the log-likelihood function can be written as follows: and II s = n log(α n ).
Noting that A and Y are independent given σ, the joint log-likelihood of A, Y given σ is the following: where , and II + II s consist of the terms that are independent of σ. Denote e S 1 ,S 2 as the number of edges between two sets of nodes, say S 1 and S 2 . Then, define the following events as follows: Proof. Take i 0 ∈ F + and j 0 ∈ F − . Then, define the following: Denote Λ = P(A,Y|σ) P(A,Y|σ) , and we need to show that P(Λ > 1) = 1 − o(1). By (8), we have the following: It is clear that Plugging into (10) yields In the last inequality, we assumed that e i 0 ,I − \{j 0 } ≥ e j 0 ,I + \{i 0 } without loss of generality.
Next, we will show that E(e i 0 ,I − \{j 0 } ) = o (1). Rewrite e i 0 ,I − \{j 0 } as: log log(n) . For any i ∈ H, define the following events: Then, the multiplicative Chernoff bound (see (iii) in Lemma A1) gives the following: By union bound we have the following: since τ > 1 m−1 by the assumption on |H|.  Proof. Under the condition in (4), Lemma 3 holds. That is, for any δ ∈ (0, 1) and for a sufficiently large n. By union bound, we have the following: Here, we used (11) in the first inequality. Proof. By Lemmas 2 and 4, there exists a sufficiently small δ > 0, such that P(∆ c ) ≥ 1 − δ and P(F H ) ≥ 1 − δ simultaneously hold. Clearly, ∆ c ∩ F H ⇒ F + . Hence, By symmetry, we have P(F − ) ≥ 1 − 2δ.
Then, it follows from Lemma 1 that which completes the proof.
The impossible part of Theorem 1 follows directly from Lemma 5. B. Proof of the possible part of Theorem 1 Let H 1 be an independently generated random hypergraph, built on the same node set of V = [n], with an edge probability of d n log(n) . Denote its complement by H 2 = H c 1 and A weak recovery algorithm [12] is applied to H 1 to return a partition of two communitiesĨ + andĨ − , which agree with the true communities I + and I − on at least (1 − δ)n nodes. Here, δ = δ(d n ) depends on d n such that δ → 0 as d n → ∞. d n can be taken as O(log log(n)) [12].
In the next step, we will use H 2 to decide whether to flip a node's membership or not. More specifically, for a node i ∈Ĩ + , if it has more edges in H 2 going toĨ − , plus the scaled side information To show (5), the possible part of Theorem 1, we first introduce the following definitions. For any node i ∈ [n], it is mis-classified if and only if it belongs to the following: With WLOG, we assume that i ∈Ĩ + . Then, the mis-classification probability of node i is given by the following: In the last equation, we assumed H 2 to be a complete graph. Then, the probability of the existence of a mis-classified label is: by union bound on all nodes.
Consider a node i ∈ H 1 . Its degree is given by the following: Proof. Note that deg(i) ∼ Binom ( n−1 m−1 ), d n log(n) . Then, the multiplicative Chernoff bound (see (iii) in Lemma A1) gives the following: where µ = ( n−1 m−1 ) d n log(n) . By union bound, we have the following: The Lemma above suggests that: Therefore, taking into account the incompleteness of H 2 , we will loose the upper bound (12) by removing 2( n−1 m−1 ) d n log(n) terms from both of the summations on the right-hand side of (12). That is,

Conclusions
In this paper, we studied the effect of label information on the exact recovery of communities in uniform hypergraphs from an information-theoretical point of view. Specifically, we considered two types of label information: a noisy label for each node was observed independently, with 1 − α n as the probability that the noisy label would match the true label, and the true label of each node was observed independently, with a probability of 1 − α n . We used the maximum likelihood method to derive a lower bound for exact recovery and then constructed an estimator that could exactly recover the communities above the lower bound. In this way, we obtained sharp boundaries for exact recovery under both scenarios. We found that the label information improved the sharp detection boundary if and only if α n converges d to zero at a rate of n −β for some positive constant β.
There are several possible future research directions: (I) The sharp recovery boundary for general HSBM with label information is still unknown. Characterizing the boundary in this case is an important problem. (II) In this paper, we focused on the label information. It is important to consider other side information, such as the covariates observed for each node. Acknowledgments: The authors are grateful to Editor and reivewers for helpful comments that significantly improve this manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Appendix A.1. Chernoff Bound
For a random variable X, denote its cumulant generating function (cgf) by ψ X (t) = log(E(e tX )). Define for any fixed D ∈ R, where t ranges over R or R + .
(Lemma 15 in [23]) For any D, ∈ R, where t max = arg sup t∈R (φ X,D (t)),X is a random variable with the same alphabet as X but distributed according to e tmax x P(x) E X (e tmax X ) , and µX, σ 2 X are the mean and variance ofX, respectively. (ii) (Generic Chernoff bound ) For any D ∈ R, where t max = arg sup t>0 (φ X,D (t)), and where t max = arg sup t>0 (φ −X,D (t)). (iii) (Multiplicative Chernoff bound ) For any t > 1, where µ = E(X).
For later use, we consider X = C a,b (Z − W), where Z ∼ Bern(q), W ∼ Bern(p) with p = a log(n) where C m,a,b = log(a)−log(b) 2 m−1 (m−1)! and ∆ ∈ R is a constant. In the special case ∆ = 0, we have where l m,n = ( when ∆ > −C m,a,b (a − b), and when ∆ < −C m,a,b (a − b). Here, l m,n and η m,a,b (∆) are defined as in (i).
Proof. We first calculate and approximate φ X (t) and ψ X (t). Denote s = ( a b ) t . Then direct calculation gives the following: Taking D = D m,n = log(n) l m,n (∆ + o(1)) in the first equation of (A1), we havẽ where l m,n = ( Noting that Plugging in (A6) yields where η m,a,b (∆) and γ m,a,b (∆) are given in (A2).
(1) For part (i), note that s max = s * is the global maximum ofφ X,D (t) on R, sincẽ φ X,D (t) < 0. Therefore, the first part is completed by applying the Chernoff bound (see (i) in Lemma A1).
If ∆ > −C m,a,b (a − b), then s max = s * and (A6) hold. This completes the proof of (A4) by applying the Chernoff bound (see (ii) in Lemma A1). Now, we are left to show (A5) in part (ii). Consider the random variable −X, we have Taking D = D m,n = log(n) l m,n (∆ + o(1)) in the second equation of (A1), we havẽ where ∆ ∈ R and C ma,b is defined as above.
Taking the first derivative ofφ Noting that t > 0, we have , then s max = 1 leads to a trivial bound since e −l m,n φ −X,D (t max =0) = 1.
Then e −l m,nφ−X (t * ) = n This completes the proof of (A5) by applying the Chernoff bound (see (ii) in Lemma A1).

Appendix A.2. Proof of Lemma 3
Denote l m,n = ( n 2 m−1 ) and recall that θ n = log(n) log log(n) . Then where D m,n ,y i = C a,b l m,n C α n C a,b y i + 1 + θ n + m,n , m,n = log(n) n m−1 o(1).
For the second part in (4), recall that β > C m,a,b (a − b). Then where D m,n ,y i = C a,b l m,n C α n C a,b y i + 1 + θ n .
. Then where c m depends only on m. We used δ → 0 as d n → ∞, and log 1 + d n log(n) 1 δ m−1 < log 1 δ m−1 for a sufficiently large d n in the last inequality.
For I, we again apply the Chernoff bound (see (ii) in Lemma A4).
where D λ,y i = C a,b l m,n C α n C a,b y i − λ log(n) .
Proof. Let ξ a,b (β m ) := η a,b (β m ) − β m , and we will show that it is convex in β m , with a global minimum value of 0. By (1), we have Here, we used the fact that Taking the first two derivatives of ξ a,b (β m ) w.r.t. β m , and using Thus, ξ a,b (β m ) is convex with a unique critical point β * m = C m,a,b (a − b). Hence, ξ a,b (β m ) ≥ ξ(β * m ) = 0.
Proof. Taking the first two derivatives of η a,b (β m ) = ξ a,b (β m ) + β m w.r.t. β m and using (A11), we have the following: Thus, η a,b (β m ) is convex with a unique critical point β * m = −C m,a,b (a − b) < 0. Hence, η a,b (β m ) is increasing in β m for any β m ≥ 0.