Estimating the Number of Communities in Weighted Networks

Community detection in weighted networks has been a popular topic in recent years. However, while there exist several flexible methods for estimating communities in weighted networks, these methods usually assume that the number of communities is known. It is usually unclear how to determine the exact number of communities one should use. Here, to estimate the number of communities for weighted networks generated from arbitrary distribution under the degree-corrected distribution-free model, we propose one approach that combines weighted modularity with spectral clustering. This approach allows a weighted network to have negative edge weights and it also works for signed networks. We compare the proposed method to several existing methods and show that our method is more accurate for estimating the number of communities both numerically and empirically.


Introduction
For decades, network science provided substantial quantitative tools for the study of complex systems [1][2][3][4]. Networks emerge in numerous fields including physics, sociology, biology, economics, and so forth [5][6][7][8][9][10][11][12][13][14][15]. The elementary parts of a network are nodes, links, and link weights. A network is unweighted when all link weights are 1 and weighted otherwise [16]. Networks usually have community structure such that nodes within the same community have more connections than across communities [17,18]. For example, in social networks, communities can be groups of students who belong to the same school, be of the same club, be of the same graduation year, or be interested in the same movie; in scientific collaboration networks, communities are scientists in the same field [19][20][21]; in protein-protein interaction networks, communities are proteins enjoying similar functions [22,23]. However, in practice, the latent community structure of a network is generally not directly observable and we need to develop techniques to infer community structure.
Community detection for unweighted networks has been widely studied for decades [17,18]. Numerous community detection methods have been developed to fit a statistical model that can generate a random network with a community structure. The stochastic blockmodels (SBM) [24] is a classical and popular generative model for unweighted networks. The popular degree-corrected stochastic blockmodels (DCSBM) extends SBM by considering node heterogeneity. Based on SBM and DCSBM, substantial community detection methods have been developed, such as [25][26][27][28][29][30][31][32][33][34][35][36]. However, most methods require the number of communities K to be known in advance, and this is often not the case for real-world unweighted networks. To address this problem, some methods have been developed to estimate K under SBM or DCSBM [37][38][39][40][41][42][43][44][45][46][47], where approaches developed in [46] stand out as they estimate K for unweighted networks regardless of statistical models. A significant drawback of the above SBM-based and DCSBM-based methods is that they ignore the impact of edge weights which are common in network data and could help us to understand the community structure of a network better [16]. Recently, community detection in weighted networks has become a hot topic and many statistical models have been developed to fit weighted networks, such as the weighted stochastic blockmodels (WSBM) proposed in [48][49][50][51][52][53][54], the distribution-free model (DFM) of [55], and the degreecorrected distribution-free model (DCDFM) introduced in [56]. Among these models, DFM and its extension DCDFM stand out as they allow edge weights to follow any distribution as long as the expected adjacency matrix follows a block structure related to community partition. However, similar to SBM-based and DCSBM-based methods, algorithms developed for the above models also assume that K is known in advance, which is usually impractical for real-world weighted networks. To close this gap, we provide a simple approach to estimate K for weighted networks generated from DCDFM.
The main contributions of this work include: (1) We propose a method by taking advantage of both spectral clustering and weighted modularity to estimate the number of communities for weighted networks generated from arbitrary distribution under DCDFM. The method determines K by increasing the number of communities until weighted modularity does not increase. The method is devised for DCDFM, but it can be naturally applied to weighted networks generated from DFM and unweighted networks generated from SBM and DCSBM since these three models are sub-models of DCDFM.
(2) We conduct a large number of experiments on both computer-generated weighted networks and real-world networks including signed networks. The experimental results show that our method can estimate the number of communities for weighted networks generated by different distributions under DCDFM even when the true K is 1 and it is more accurate than its competitors.

The Degree-Corrected Distribution-Free Model
In this article, we work with the degree-corrected distribution-free model proposed in [56]. We assume that there exist K perceivable non-overlapping clusters C (1) , C (2) , . . . , C (K) , and each node only belongs to exactly one cluster. Let the n × 1 vector denote the node label such that i takes value from {1, 2, . . . , K} and i is the community label for node i for i ∈ [n]. Let Z ∈ {0, 1} n×K be the community membership matrix such that Z ik = 1 if i = k and Z ik = 0 otherwise. Let θ be an n × 1 vector such that the positive number θ i is the node heterogeneity of node i. Let Θ be an n × n diagonal matrix whose i-th diagonal entry is θ i . Let P be the K × K symmetric connectivity matrix such that P's rank is K, P's elements can be any real values in [−1, 1], and max k,l∈[K] |P kl | = 1, where we let P's maximum absolute element be 1 for convenience since we consider the node heterogeneity parameter θ. For i, j ∈ [n], the DCDFM model [56] generates the (i, j)-th element of the symmetric adjacency matrix A for an un-directed weighted network N in the following way: A ij is a random variable generated from arbitrary distribution F with expectation Ω ij , where Ω is defined as Ω = ΘZPZ Θ. ( DCDFM includes several previous models. For example, when θ i = √ ρ for all i ∈ [n], DCDFM reduces to the distribution-free model [55]; when F is Bernoulli distribution and P's elements are non-negative, DCDFM reduces to the classical degree-corrected stochastic blockmodels [57]; when F is Bernoulli distribution, all elements of θ are the same, and P's elements are non-negative, DCDFM reduces to the popular stochastic blockmodels [24], i.e., SBM, DCSBM, and DFM are sub-models of DCDFM. As analyzed in [56], F can be any distribution as long as A's expectation matrix is Ω under distribution F . Meanwhile, the fact that whether P's elements can be negative depends on distribution F . For example, when F is Bernoulli, Binomial, Poisson, Geometric or Exponential distributions, P's elements should be non-negative or positive; when F is Normal, Laplace or A is the adjacency matrix of a signed network, P's elements can be negative. DCDFM can generate A for weighted networks benefiting from the arbitrariness of distribution F . When n, K, , P, and θ are set, we can generate the adjacency matrix A for any distribution F under DCDFM as long as Equation (1) holds. Given A and the known number of clusters K, ref. [56] designs an efficient spectral algorithm called nDFA to estimate the node label vector and shows that nDFA enjoys consistent estimation under DCDFM for any distribution F satisfying Equation (1). However, the method nDFA requires K to be known in advance, and this is not the case in practice. To process this problem, in this article, we aim at developing an efficient method to estimate the number of communities K when only the adjacency matrix A is known, where A is generated from DCDFM with K communities for arbitrary distribution F satisfying Equation (1).

Estimation of the Number of Communities
Our method for estimating K is closely related to the modularity for signed networks introduced in [58] and this modularity extends the popular Newman-Girvan modularity matrix [59] from unweighted networks to signed networks. Instead of simply considering signed networks, we extend the modularity developed in [58] to weighted networks with A's elements being any finite real values by considering indicator functions. We let the n × n symmetric adjacency matrix A be generated from DCDFM for arbitrary distribution F satisfying Equation (1) Letˆ be a n × 1 node label vector returned by running a community detection method M on A with k communities such thatˆ i takes value from {1, 2, . . . , k}. Based on the community partitionˆ obtained from the method M, the positive modularity Q + and the negative modularity Q − are defined as where δ(ˆ i ,ˆ j ) is the Kronecker delta function, 1 m + >0 and 1 m − >0 are indicator functions such that The weighted modularity considered in this article is defined as When all edge weights are non-negative such that m − = 0, the weighted modularity reduces to the Newman-Girvan modularity. When A has both positive and negative entries, the weighted modularity reduces to the modularity introduced in [58]. The weighted modularity obtained via Equation (2) measures the quality of community partition for a weighted network whose adjacency matrix has any finite real elements, and it is more general than the modularity introduced in [58]. Similar to the Newman-Girvan modularity, a larger weighted modularity Q M (k) indicates a better community partition.
In Equation (2), we write the weighted modularity as a function of the number of communities k and the community detection method M to emphasize that the weighted modularity may be different for different k or different community detection methods. We estimate the number of communities K by increasing k until the weighted modularity function in Equation (2) does not increase. Suppose there is a cardinality choice of K such that K locates in {1, 2, . . . , K 0 }. For a community detection algorithm M, our strategy for estimating K is In this paper, to estimate the number of communities for weighted networks generated from DCDFM, we choose the method M as the nDFA algorithm designed in [56] because nDFA enjoys consistent estimation of community memberships under DCDFM and it is computationally fast. For convenience, when M is the nDFA algorithm, we call our method for estimating K via Equation (3) as nDFAwm, where "wm" means weighted modularity. The details of the nDFA algorithm [56] are written below.
• LetÃ =ÛΛÛ be the top-k eigendecomposition of A.

Experimental Results
In this section, we present both simulation results and real-world experiments to compare our nDFAwm with three model-free methods in the literature for estimating the number of communities: the modularity eigengap (ME for short) method proposed in [60], the non-backtracking (NB) method designed in [46], and the Bethe Hessian matrix-based method BHac developed in [46].

Simulations
In this section, we investigate the performance of nDFAwm and competing algorithms to adjacency matrices generated from nine distributions under DCDFM. For each parameter setting, we report the accuracy rate over 100 repetitions for each method, where the accuracy rate is the fraction of times that the estimated number of clustersK equals the true number of clusters K.
To generate simulated weighted networks from DCDFM, first, we need to define n, K, θ, Z, and P. For n, unless specified, we let n = 50K. For Z, we let each node belong to one of the K clusters with equal probability, i.e., there are around 50 nodes in each cluster. For θ, unless specified, we let θ i = rand (1) √ ρ, where the positive number ρ controls network sparsity and rand(1) is a random number drawn from the uniform distribution in the interval (0, 1). We set n, K, P, and ρ independently for each simulation. After setting these model parameters, we generate A under DCDFM for several distributions F satisfying Equation (1). For our nDFAwm, we set K c = 20 since the largest K in our simulations is six. In this paper, we consider Bernoulli, binomial, Poisson, geometrical, exponential, normal, laplace, and uniform distributions, where details on probability mass function or probability density function of these distributions can be found in http://www.stat.rice.edu/~dobelman/courses/texts/distributions.c&b.pdf (accessed on 9 November 2022). Meanwhile, we also consider the signed network case in our simulation studies.

Bernoulli Distribution
and DCDFM reduces to DCSBM for this case. By the property of Bernoulli distribution, E[A ij ] = Ω ij satisfies Equation (1) and Ω ij is a probability ranging in [0, 1]. So, ρ's range is (0, 1], and all elements of P should be non-negative. For Bernoulli distribution, we consider the following simulations. Experiment 1 (d): connectivity across communities. Let K = 2, ρ = 1, P's diagonal entries be 1, P's off-diagonal entries be β, and β range in {0.1, 0.2, . . . , 0.8}. Figure 1 shows the accuracy rate of Experiment 1. Panel (a) of Figure 1 shows that as the network becomes denser, all methods provide more accurate estimations of the number of clusters. For Experiment 1 (a), all methods perform similarly. For Experiment 1 (b), from panel (b) of Figure 1, we see that our nDFAwm performs the best. From panel (c) of Figure 1, we see that our nDFAwm performs poorer than NB and BHac while ME fails to work. Meanwhile, except ME, all methods perform better as the network becomes denser for Experiment 1 (c). From panel (d) of Figure 1, we see that all methods perform poorer as the off-diagonal entries of P are closer to the diagonal entries and our nDFAwm performs slightly poorer than ME while it outperforms NB and BHac.

Binomial Distribution
When F is binomial distribution such that A ij ∼Binomial(m, Experiment 2 (d): connectivity across communities. Let K = 2, ρ = 1, m = 5, and P be the same as Experiment 1 (d). Figure 2 shows the accuracy rate of Experiment 2. For Experiments 2 (a), 2 (b), and 2 (c), the results are similar to that of Experiments 1 (a), 1 (b), and 1 (c), respectively, and we omit the analysis here. For Experiment 2 (d), panel (d) of Figure 2 says that our nDFAwm perform similarly to NB and BHac while ME performs best.

Poisson Distribution
When F is Poisson distribution such that A ij ∼Poisson(Ω ij ), i.e., A ij is a non-negative integer for i, j ∈ [n]. By the property of Poisson distribution, E[A ij ] = Ω ij satisfies Equation (1) and Ω ij is non-negative. So, ρ's range is (0, +∞) and all elements of P should be non-negative. Experiment 3 (d): connectivity across communities. Let K = 2, ρ = 2, and P be the same as Experiment 1 (d). Figure 3 shows the accuracy rate of Experiment 3. The results are similar to that of Experiment 2, and we omit the analysis here.

Geometric Distribution
When F is a geometric distribution such that A ij ∼Geometric( 1 Ω ij ), i.e., A ij is positive integer for i, j ∈ [n]. For geometric distribution, since P(A ij = m) = 1 Ω ij (1 − 1 Ω ij ) m−1 for m = 1, 2, . . . , and 0 < 1 Ω ij ≤ 1, all elements of P must be positive. By the property of geometric distribution, we have E[A ij ] = Ω ij satisfying Equation (1). For convenience, we let θ i = √ ρ for i ∈ [n] to make DCDFM reduce to DFM for this case. Then, we have Ω = ρZPZ . Since Ω ij ≥ 1 for i, j ∈ [n], we have ρmin k,l∈[K] P kl ≥ 1. Experiment 4 (d): connectivity across communities. Let K = 2, ρ = 10, and P be the same as Experiment 1 (d). Figure 4 shows the accuracy rate of Experiment 4. Unlike Experiments 1-3, the numerical results of Experiment 4 say that our nDFAwm successfully estimates the number of communities for all cases while NB and BHac fail to work when the network is generated from geometric distribution under the DCDFM model. For the method ME, it fails to work when the true K is 1 and it performs similarly to our nDFAwm for other cases.

Exponential Distribution
When F is a exponential distribution such that A ij ∼Exponential( 1 Ω ij ), i.e., A ij ∈ R + for i, j ∈ [n]. For exponential distribution, since 1 Ω ij > 0, all elements of P must be positive and ρ range in (0, +∞). By the property of exponential distribution, E[A ij ] = Ω ij satisfies Equation (1).
Experiment 5 (d): connectivity across communities. Let K = 2, ρ = 5, and P be the same as Experiment 1 (d). Figure 5 shows the accuracy rate of Experiment 5. In general, we see that our nDFAwm estimates K more accurately than its competitors except Experiment 5 (d) where ME performs slightly better than our nDFAwm. From panels (a) and (c) of Figure 5, it is interesting to find that NB and BHac perform poorer as ρ increases. Panels (b) and (d) of Figure 5 say that NB and BHac fail to work for Experiments 5 (b) and 5 (d).
Experiment 6 (d): connectivity across communities. Let K = 2, σ 2 = 1, ρ = 2, P's diagonal entries be 1, P's off-diagonal entries be β, and β range in {−0.5, −0.4, . . . , 0.9}. Figure 6 shows the accuracy rate of Experiment 6. In general, we see that our nDFAwm outperforms its competitors except for Experiment 6 (d) where it performs similarly to ME. From panels (a), (b), and (d) of Figure 6, we see that NB and BHac fail to work. Panel (c) of Figure 6 says that though NB and BHac perform poorer than our nDFAwm, they provide more accurate estimations as ρ increases for Experiment 6 (c).

Laplace Distribution
When F is laplace distribution such that A ij ∼Laplace(Ω ij , σ 2 2 ), i.e., A ij ∈ R for i, j ∈ [n], where Ω(i, j), σ 2 are the expectation and variance terms of laplace distribution, respectively. Similar to normal distribution, E[A ij ] = Ω ij satisfies Equation (1), all elements of P are real values, and ρ's range is (0, +∞).  Figure 7 displays the accuracy rate of Experiment 7. The numerical results are similar to that of Experiment 6 and we omit the analysis here.

Uniform Distribution
When F is uniform distribution such that A ij ∼Uniform(0, Ω ij ). For this case, E[A ij ] = Ω ij satisfies Equation (1), all elements of P are non-negative, and ρ's range is (0, +∞) because A ij ∈ (0, max i,j∈[n] Ω ij ) and it has no limitation on ρ as long as ρ is positive.    Figure 9 displays the accuracy rate of Experiment 9. We see that our approach nD-FAwm provides a more accurate estimation of the number of clusters than its competitors except Experiment 9 (d) where it performs similarly to ME. For ME, it fails to work in Experiments 9 (a) and 9 (c). For NB and BHac, they fail to estimate K except for Experiment 9 (c) where they have better estimations as ρ increases.

Real-World Networks
For real-world networks, we consider eight data sets in Table 1. The ground truth numbers of communities of these eight networks are known and they provide a reasonable baseline to compare estimators. The Karate club (weighted) network is a weighted network with non-negative edge weights, the Gahuku-Gama subtribes is a signed network, the Slovene Parliamentary Party network is a weighted network with positive and negative edge weights, and the other five data sets are unweighted. For visualization, Figure 10 displays adjacency matrices of weighted networks considered in this paper. The Karate club (weighted) network can be downloaded from http://vlado.fmf. uni-lj.si/pub/networks/data/ucinet/ucidata.htm#kazalo (accessed on 12 November 2022) and it is the weighted version of the classical Karate club network. The Gahuku-Gama subtribes network can be downloaded from http://konect.cc/networks/ucidata-gama/ (accessed on 12 November 2022) and its ground truth of node labels can be found in Figure 9 (b) of [61]. The Slovene Parliamentary Party network can be downloaded from http://vlado.fmf.uni-lj.si/pub/networks/data/soc/Samo/Stranke94.htm (accessed on 12 November 2022). The other five data sets with ground truth of node labels can be downloaded from http://www-personal.umich.edu/~mejn/netdata/ (accessed on 12 November 2022). In particular, for the Dolphins network, as analyzed in [62], both K = 2 or K = 4 are reasonable.    For real-world networks, we compare our nDFAwm with the modularity eigengap (ME) [60], NB [46], BHm [46], BHa [46], BHmc [46], and BHac [46]. For our nDFAwm, we take K c = n. Figure 11 displays the weighted modularity from Equation (2) by the nDFA algorithm for different choices of the number of clusters and we can find the nDFAwm's estimated K of the eight real-world networks from Figure 11 directly. Table 1 shows the estimated number of clusters for these networks. For all networks except for the Political books network, our nDFAwm successfully determines the correct number of communities. For the ME method, it estimates the correct K for Karate club (weighted), Slovene Parliamentary Party Network, Dolphins, and Political blogs while it fails for the other four networks. For NB and BHm methods, they only estimate K correctly for Dolphins, Karate club, and Political books. For BHa, BHmc, and BHac, they only estimate K successfully for Dolphins and Karate club. In particular, the non-backtracking method and Bethe Hessian matrix-based methods proposed in [46] fail to estimate the number of communities for the three real-world weighted networks in Table 1. As a result, our nDFAwm outperforms its competitors in these real-world networks.

Conclusions and Future Work
In this paper, we proposed a method for determining the number of communities for weighted networks in DCDFM. The method is designed based on a combination of weighted modularity and a spectral clustering algorithm. This estimation method enables us to estimate the number of communities even in the case where there is only one community in a weighted network generated by different distributions under DCDFM. Through substantial computer-generated weighted networks from DCDFM and several realworld networks, the numerical results show that the estimation accuracy of our approach is better than its competitors and our method also works for signed networks.
There are some open questions. First, building a theoretical guarantee on the consistency of our estimator for the true number of clusters under DCDFM is an attractive and challenging task. Second, determining the exact condition under which estimating the number of clusters is possible under DCDFM is a challenging problem. Third, in this paper, we are mainly interested in DCDFM for non-overlapping weighted networks, but the idea can be extended to overlapping weighted networks [70]. Fourth, in this paper, we estimate the number of communities for weighted networks generated from DCDFM by Equation (3) when we choose the method M as the spectral method nDFA. If we let M be algorithms developed in [48][49][50][51][52][53][54] to fit their weighted stochastic blockmodels for weighted networks, we wonder that we can also estimate K for these models through Equation (3). We leave them for the future.