Spectral Clustering Community Detection Algorithm Based on Point-Wise Mutual Information Graph Kernel

To address the problem that traditional spectral clustering algorithms cannot obtain the complete structural information of networks, this paper proposes a spectral clustering community detection algorithm, PMIK-SC, based on the point-wise mutual information (PMI) graph kernel. The kernel is constructed according to the point-wise mutual information between nodes, which is then used as a proximity matrix to reconstruct the network and obtain the symmetric normalized Laplacian matrix. Finally, the network is partitioned by the eigendecomposition and eigenvector clustering of the Laplacian matrix. In addition, to determine the number of clusters during spectral clustering, this paper proposes a fast algorithm, BI-CNE, for estimating the number of communities. For a specific network, the algorithm first reconstructs the original network and then runs Monte Carlo sampling to estimate the number of communities by Bayesian inference. Experimental results show that the detection speed and accuracy of the algorithm are superior to other existing algorithms for estimating the number of communities. On this basis, the spectral clustering community detection algorithm PMIK-SC also has high accuracy and stability compared with other community detection algorithms and spectral clustering algorithms.


Introduction
A complex network is an important cross-cutting research branch of computer science, statistical physics, and systems science and an indispensable tool for analyzing and studying interaction events in many real systems.Various kinds of networks and network-like systems are ubiquitous in real life, such as interpersonal networks [1], infectious disease networks, biological systems [2], and so on.In complex networks, the independent individuals in the system are usually referred to as nodes; the connections between these individuals are called edges, and clusters of closely connected nodes are called communities.
Community structure is an important feature of complex networks, and how to effectively perform community detection has attracted many scholars in various fields.Watts et al. first proposed a small-world network model by observing and studying the "small-world phenomenon" in real-world complex networks (known as the six-degree separation theory) [3].Barabási et al. found that the degree distribution of real-world complex networks obeys a power-law distribution, indicating the scale-free nature of complex networks [4].The study and analysis of the community structure help to uncover the laws of the dynamic evolution of the network, to find the weak points of the system, or to verify the corresponding functions of the system, which are of great importance to predict the future development of the network and its possible dynamic behavior.
Community detection is the process of revealing the latent community structure in complex networks, which has important applications in real life, such as mining social groups with common interests and similar social backgrounds in social networks for accurate content recommendation and building dynamics models in infectious disease networks to predict the development of epidemic trends and facilitate accurate measures for epidemic prevention and control.
Since Newman et al. proposed the classical GN algorithm [5], research on community detection algorithms has been enduring, and the spectral clustering algorithm based on spectral graph partitioning theory is one of the classical community detection algorithms.Through the comparative study of various community detection algorithms, including LPA-based, modularity-based, and information-entropy-based approaches, we found that, compared with these algorithms, spectral clustering-based algorithms can obtain higher accuracy on the condition that a suitable proximity matrix and the exact number of clusters are provided.However, traditional spectral-clustering-based algorithms can hardly handle community detection on complex networks, mainly for the following two reasons: First, most spectral clustering community detection algorithms cannot effectively work on complex networks with an unknown number of communities; second, limited by the singularity of structural features extracted from proximity matrixes, the generalization capability is not strong enough to effectively reflect the complex structural information in the network.Aiming at these two aspects, the BI-CNE algorithm and the PMIK-SC algorithm are proposed.
BI-CNE is a fast number of communities (or CN for short) estimation algorithm based on Bayesian inference.It is used to solve the problem that spectral clustering algorithms require prior knowledge about CN, while the traditional CN estimation methods are slow and inaccurate.The algorithm first performs a fast pruning reconstruction of the network, then performs Bayesian inference based on the degree-corrected stochastic block model, and finally obtains the CN estimated by Monte Carlo sampling.Meanwhile, in order to speed up the sampling process, the network reconstruction result is used as the initial state of sampling, and the overall sampling acceptance rate is improved by controlling the node transfer direction in the sampling procedure.Experiments show that the algorithm outperforms existing CN estimation algorithms.
Our preliminary research indicates that mutual information can effectively measure the relationships between communities in a network.Mutual-information-based community detection methods, such as MINC-NRL [6] and AMI-MLPA [7], can achieve accurate community detection.Similarly, the relationships between nodes can also be measured within an information-theoretic framework.We constructed a Laplacian kernel based on point-wise mutual information, referred to as the PMI kernel, and proposed a spectral clustering algorithm, PMIK-SC.The PMI kernel proves to be effective in addressing the problem that the proximity matrix used by the traditional spectral clustering algorithm cannot obtain the complete structure information for a specific network.This enhancement contributes to the accuracy of community detection tasks.Experiments show that the algorithm achieves better performance compared with state-of-the-art graph-kernel-based spectral clustering algorithms.
The rest of the article is organized as follows: Section 2 summarizes the current research progress in estimating the number of communities and graph-kernel-based spectral clustering algorithms.Section 3 introduces a large number of community estimation algorithms based on Bayesian inference.Section 4 derives the definition of a point-wise mutual information graph kernel and introduces the spectral clustering community detection algorithm based on the kernel.Section 5 conducts corresponding experiments to verify the effectiveness of the algorithm for the above two algorithms, and finally, Section 6 concludes the article with a summary of its contents.

Traditional Algorithms for Estimating the Number of Communities
Most of the current spectral clustering community detection algorithms cannot efficiently partition complex networks with an unknown number of communities [8].If information on the number of communities (CN) is included, the accuracy of these community Entropy 2023, 25, 1617 3 of 27 detection algorithms can be greatly improved.Many scholars have proposed community structure estimation algorithms for this purpose, among which the most well-known is the modularity optimization method, which is essentially a class of algorithms that combines community size selection and performs community partitioning operations [9].Modularity is an evaluation metric to measure the quality of community detection, proposed by Newman et al. in 2004 and updated by definition in 2006 [10].The basic idea is that since there is no community structure in a random network, a community partition is better if it has a larger difference compared with a rule-based random network.
Although the modularity method is widely used, there are still two problems: first, for most of the real-world complex networks, the modularity value of ground-truth community partition does not reach optimal modularity; second, the modularity metric has a resolution limit problem [11], i.e., modularity-based methods cannot detect the very small communities in complex networks but prefer to divide the network into multiple large communities.Therefore, when estimating the CN in large-scale complex networks, such methods tend to obtain a smaller value, which may differ greatly from the actual CN.
The topological potential method is another common type of CN estimation algorithm.The basic idea is to extend the concept of potential and field in physics to complex networks and partition the network by high and low potential values to estimate the number.With the help of node topological potential, scholars propose a number of community estimation algorithms based on the hill-climbing method to search for local extrema.After calculating the topological potential of nodes in a complex network, the hill-climbing method traverses the nodes in the direction of rising potential and uses the local extremal points searched as community centers to obtain the number of communities.
The traditional hill-climbing method needs to calculate the topological potential of nodes, where the complexity of calculating the shortest path of any two nodes in the network is O(n 3 ), and the complexity of searching all the locally extremely potential nodes in the end is O(n 2 ), which is not reasonable, especially for large-scale networks.Secondly, the number of locally extremely potential nodes obtained by the algorithm is not necessarily the final CN because a community centroid may connect other community centroids and result in more than one community hiding at one local potential extreme point.To address this, scholars improved the algorithm by introducing a network concavity parameter by definition to search for potential local potential maximal nodes [12].However, the improved algorithm can hardly handle the following two cases: First, multiple extreme points detected in the same community need to be identified and merged.Furthermore, the potential extreme points of different communities need to be split if they are covered by an edge to a larger extreme point.Therefore, the topological-potential-based algorithm for estimating the number of communities cannot be applied well to large-scale complex network analysis.
Neither modularity optimization algorithms nor heuristic algorithms, such as topological potential-based methods, can give satisfactory estimates of the number of communities.Therefore, some scholars tried to derive the actual number of communities by maximizing the approximation to the data likelihood from generative network graph models [13].Among these methods, the most commonly used generative graph model is the stochastic block model (SBM) [14,15].The estimation based on SBM mainly focuses on how to sample the probability space of the parameters and, therefore, needs to determine the likelihood of the parameters.Newman et al. proposed an algorithm for estimating the number of communities based on statistical inference using the SBM [8] to estimate the number of communities by Monte Carlo sampling.However, this method cannot be applied to large-scale networks due to computational speed limitations.Riolo et al. performed sampling acceleration optimization based on this method.They used an improved Chinese restaurant process to determine the parameters prior, which can be applied in large-scale networks, but the sampling speed is still far from satisfactory, and the estimation in large-scale networks is not completely accurate [16].

Graph-Kernel-Based Spectral Clustering Algorithm
To address the shortcomings of traditional clustering algorithms, especially the problem that the clustering algorithms are easily trapped in local optima, researchers proposed spectral clustering algorithms to solve the problem by introducing the spectral graph theory.In a spectral clustering algorithm, a suitable proximity matrix or similarity matrix needs to be constructed, which directly affects the final result of spectral clustering [17].There are several methods to construct a proximity matrix; one of them is the kernel method.The core idea is to map the linearly inseparable data onto a linearly separable kernel space.The kernel function directly calculates the inner product of the mapping to represent the similarity among data without finding specific mapping relationships; the latter are usually difficult or impossible to solve.
The study of kernel methods in complex network structures falls into two main categories: graph embedding, which uses kernel functions to embed network structures into vector spaces, and graph kernels, which are mainly used as a way to measure the similarity of structures.Graph embedding yields a vectorized representation of the network structure, which is then processed by applying a vector-based kernel function.However, because the network data are downscaled to vector space, much structural information in the network cannot be preserved.In contrast, the graph kernel is directly oriented to the network structure data, and by defining a suitable kernel, the input network data are mapped from the original space to a high or infinite dimensional feature space, and the structured information in Hilbert space is preserved efficiently and completely.In the following, graph kernels or feature extraction methods based on the adjacency matrix, Laplacian matrix, and path length are briefly described, respectively.
The communicability kernel [18] is a graph kernel based on the adjacency matrix and belongs to the symmetric exponential diffusion kernel.In complex networks, the communicability between nodes is usually considered the shortest path length.The strategy of the communicability kernel is that a node can communicate with another node through all paths, but the longer the path, the lower the contribution to the node's communication function.
The heat kernel [19,20] is a graph kernel based on the Laplacian matrix and belongs to a symmetric exponential diffusion kernel-like communicability kernel.Since the decay coefficient of the heat kernel is which the decay rate is higher, some scholars have applied it to large-scale networks for community detection and obtained better results [20].
The commute time kernel [21] is a kind of path-length-based graph kernel.The commute time from the node v i to v j is usually defined as the expected time to start from the node v i , randomly wander to node v j and back to node v j .The commute time kernel is proved to be equal to the pseudo-inverse of the Laplacian matrix, and since the Laplacian matrix must have eigenvalues 0, L is irreducible, and the pseudo-inverse is usually computed using the Moore-Penrose generalized inverse.
All the graph kernels or methods mentioned above can extract some aspects of the structural features of graphs and combine them with community detection algorithms to analyze specific types of networks.However, limited by the singularity of the extracted structural features, the generalization ability is not strong, so it is necessary to design a graph kernel that can effectively reflect the complex structural information inside the graph and, at the same time, have strong generalization.

Network Pruning Reconstruction
For real-world complex networks, especially large-scale networks, it is hard to quickly estimate the CN due to the large number of nodes and edges.If the original network can be pruned by data pre-processing, such as removing unimportant nodes and edges, it will help to improve the effectiveness of subsequent analysis.In addition, if the original network can be quickly decomposed into several small connected graphs with sparse interconnections, the computational complexity can be greatly reduced, and the amount of information lost in this network splitting method is smaller compared with the node compression method (such as the Louvain algorithm [22]).Based on these two points, the original network will be pruned and reconstructed according to the idea of common neighbors before estimating the actual CN.First, consider the definition of a clique in a network: a clique is a subgraph of a network in which any node has a contiguous edge with all the remaining nodes.That is, a clique is a complete subgraph, but because the condition of a complete graph is too strict, the size of a clique in a real network is often not large enough to use a clique for a reasonable splitting of the original network.Therefore, we consider a relaxation of the condition of the corpus and use common neighbors to transform the condition of the complete graph as below: n − 2 common neighbors exist between any two nodes in a complete graph of n nodes.The following section starts to consider pruning the nodes and edges in the network graph using the number of common neighbors.
In a network graph, the number of common neighbors between two nodes is defined as the cutoff value of these two nodes.A cutoff value of 0 indicates that there are no common neighbors between nodes, which are usually reflected as peripheral nodes or inter-community bridge nodes of the network, as shown in Figure 1, where node 3 and node 4 are inter-community bridge nodes and node 5 and node 6 are peripheral nodes.At this point, if the edges with cutoff values less than 1 are removed from the network, the original network will be decomposed into three connected graphs, including two n = 3 complete subgraphs and one stray node, as shown in Figure 2.
original network can be quickly decomposed into several small connected grap sparse interconnections, the computational complexity can be greatly reduced, amount of information lost in this network splitting method is smaller compared node compression method (such as the Louvain algorithm [22]).Based on th points, the original network will be pruned and reconstructed according to the common neighbors before estimating the actual CN.First, consider the definit clique in a network: a clique is a subgraph of a network in which any node has a ous edge with all the remaining nodes.That is, a clique is a complete subgraph, cause the condition of a complete graph is too strict, the size of a clique in a real n is often not large enough to use a clique for a reasonable splitting of the original n Therefore, we consider a relaxation of the condition of the corpus and use commo bors to transform the condition of the complete graph as below:  − 2 common ne exist between any two nodes in a complete graph of n nodes.The following sectio to consider pruning the nodes and edges in the network graph using the number mon neighbors.
In a network graph, the number of common neighbors between two nodes is as the cutoff value of these two nodes.A cutoff value of 0 indicates that there are mon neighbors between nodes, which are usually reflected as peripheral nodes community bridge nodes of the network, as shown in Figure 1, where node 3 and are inter-community bridge nodes and node 5 and node 6 are peripheral nodes point, if the edges with cutoff values less than 1 are removed from the network, t inal network will be decomposed into three connected graphs, including two  = plete subgraphs and one stray node, as shown in Figure 2.  To use the cutoff value to relax the condition of the clique, it is necessary to de the appropriate range of cutoff values.The node connectivity in the network at c [1,4] is given below, as shown in Figure 3.The solid nodes are the observed obje the hollow nodes are the common neighbors of the two solid nodes.The solid lin cate the real connected edges, and the dashed lines indicate the connected edges th to be added to form a clique.It can be found that as the cutoff value increases, the tion of dashed edges to be replenished is larger, the relaxation condition is stric sparse interconnections, the computational complexity can be greatly reduced, amount of information lost in this network splitting method is smaller compared w node compression method (such as the Louvain algorithm [22]).Based on th points, the original network will be pruned and reconstructed according to the common neighbors before estimating the actual CN.First, consider the definit clique in a network: a clique is a subgraph of a network in which any node has a c ous edge with all the remaining nodes.That is, a clique is a complete subgraph, cause the condition of a complete graph is too strict, the size of a clique in a real n is often not large enough to use a clique for a reasonable splitting of the original n Therefore, we consider a relaxation of the condition of the corpus and use common bors to transform the condition of the complete graph as below:  − 2 common ne exist between any two nodes in a complete graph of n nodes.The following sectio to consider pruning the nodes and edges in the network graph using the number mon neighbors.
In a network graph, the number of common neighbors between two nodes is as the cutoff value of these two nodes.A cutoff value of 0 indicates that there are mon neighbors between nodes, which are usually reflected as peripheral nodes o community bridge nodes of the network, as shown in Figure 1, where node 3 and are inter-community bridge nodes and node 5 and node 6 are peripheral nodes.point, if the edges with cutoff values less than 1 are removed from the network, t inal network will be decomposed into three connected graphs, including two  = plete subgraphs and one stray node, as shown in Figure 2.  To use the cutoff value to relax the condition of the clique, it is necessary to de the appropriate range of cutoff values.The node connectivity in the network at c [1,4] is given below, as shown in Figure 3.The solid nodes are the observed obje the hollow nodes are the common neighbors of the two solid nodes.The solid lin cate the real connected edges, and the dashed lines indicate the connected edges th to be added to form a clique.It can be found that as the cutoff value increases, the tion of dashed edges to be replenished is larger, the relaxation condition is stric To use the cutoff value to relax the condition of the clique, it is necessary to determine the appropriate range of cutoff values.The node connectivity in the network at cutoff in [1, 4] is given below, as shown in Figure 3.The solid nodes are the observed objects, and the hollow nodes are the common neighbors of the two solid nodes.The solid lines indicate the real connected edges, and the dashed lines indicate the connected edges that need to be added to form a clique.It can be found that as the cutoff value increases, the proportion of dashed edges to be replenished is larger, the relaxation condition is stricter, and the degree of network decomposition is higher.Assuming that the cutoff value is k, the number of real edges in the network is 2k + 1, and the number of supplementary edges is k(k − 1)/2, and when k = 6, the number of supplementary edges is higher than the number of real edges.In other words, the number of supplementary edges required to restore the edges from two real nodes to a clique in a local network exceeds the number of existing edges.Moreover, real-world complex networks are usually sparse, i.e., the average degree is low, and partitioning the network with a high cutoff value will result in a large number of free nodes, which is not consistent with the original idea of network reconstruction.Therefore, the network should be pruned by a reasonable value of k selected in the range of [0, 6].
ber of real edges.In other words, the number of supplementary edges requ the edges from two real nodes to a clique in a local network exceeds the num edges.Moreover, real-world complex networks are usually sparse, i.e., the a is low, and partitioning the network with a high cutoff value will result in a of free nodes, which is not consistent with the original idea of network r Therefore, the network should be pruned by a reasonable value of  selecte of [0,6].

Bayesian Inference
After completing the pruning reconstruction of the network, the BI-C uses the degree-corrected stochastic block model to fit real-world complex the generation process of a random network by the original stochastic b shown in Figure 4. Firstly, given the number of nodes  and the number o  of the network, nodes are randomly assigned to  communities with comm ment probability  =  |  ∈ [1, … , ], ∑  = 1 .In turn, the second step signs nodes to  communities according to the connected edge probability nects nodes, i.e., the probability that there exists a connected edge between n in community  and node  located in community  is  .In an undirected  .Since the probability of connecting edges between nodes in the process edges depends only on the community to which they belong, and the specif utes have no effect on the probability of connecting edges, the original st model can only fit networks whose degrees obey Poisson distributions.Thi why the original stochastic block model for CN estimation or community not give good results in real-world networks.The degree-corrected stochastic block model makes it possible to fit a an arbitrary degree distribution by differentiating the node connectivity pro the process of generating a random graph is shown in Figure 5.The main tween it and the original stochastic block model is that in the second step, th probability between nodes depends not only on the community to which th also on the degree distribution of the nodes, i.e., the connectivity probabi node  located in community  and a node  located in community  is  degree sequence of the four nodes of the red community in Figure 5 is [1 represents the expected degree of each node, the ratio of the edges connecte community to the nodes in the community will be controlled as 1: 2: 3: 3. Si

Bayesian Inference
After completing the pruning reconstruction of the network, the BI-CNE algorithm uses the degree-corrected stochastic block model to fit real-world complex networks, and the generation process of a random network by the original stochastic block model is shown in Figure 4. Firstly, given the number of nodes n and the number of communities k of the network, nodes are randomly assigned to k communities with community assignment In turn, the second step randomly assigns nodes to k communities according to the connected edge probability matrix ω connects nodes, i.e., the probability that there exists a connected edge between node v i located in community r and node v j located in community s is ω rs .In an undirected graph, ω rs = ω sr .Since the probability of connecting edges between nodes in the process of connecting edges depends only on the community to which they belong, and the specific node attributes have no effect on the probability of connecting edges, the original stochastic block model can only fit networks whose degrees obey Poisson distributions.This is the reason why the original stochastic block model for CN estimation or community detection does not give good results in real-world networks.
of free nodes, which is not consistent with the original idea of network reconstruction.Therefore, the network should be pruned by a reasonable value of  selected in the range of [0,6].

Bayesian Inference
After completing the pruning reconstruction of the network, the BI-CNE algorithm uses the degree-corrected stochastic block model to fit real-world complex networks, and the generation process of a random network by the original stochastic block model is shown in Figure 4. Firstly, given the number of nodes  and the number of communities  of the network, nodes are randomly assigned to  communities with community assignment probability  =  |  ∈ [1, … , ], ∑  = 1 .In turn, the second step randomly assigns nodes to  communities according to the connected edge probability matrix  connects nodes, i.e., the probability that there exists a connected edge between node  located in community  and node  located in community  is  .In an undirected graph,  =  .Since the probability of connecting edges between nodes in the process of connecting edges depends only on the community to which they belong, and the specific node attributes have no effect on the probability of connecting edges, the original stochastic block model can only fit networks whose degrees obey Poisson distributions.This is the reason why the original stochastic block model for CN estimation or community detection does not give good results in real-world networks.The degree-corrected stochastic block model makes it possible to fit a network with an arbitrary degree distribution by differentiating the node connectivity probabilities, and the process of generating a random graph is shown in Figure 5.The main difference between it and the original stochastic block model is that in the second step, the connectivity probability between nodes depends not only on the community to which they belong but also on the degree distribution of the nodes, i.e., the connectivity probability between a node  located in community  and a node  located in community  is    .As the degree sequence of the four nodes of the red community in Figure 5 is [1,2,3,3], which represents the expected degree of each node, the ratio of the edges connected into the red community to the nodes in the community will be controlled as 1: 2: 3: 3. Since the size of The degree-corrected stochastic block model makes it possible to fit a network with an arbitrary degree distribution by differentiating the node connectivity probabilities, and the process of generating a random graph is shown in Figure 5.The main difference between it and the original stochastic block model is that in the second step, the connectivity probability between nodes depends not only on the community to which they belong but also on the degree distribution of the nodes, i.e., the connectivity probability between a node v i located in community r and a node v j located in community s is θ i θ j ω rs .As the degree sequence of the four nodes of the red community in Figure 5 is [1, 2, 3, 3], which represents the expected degree of each node, the ratio of the edges connected into the red community to the nodes in the community will be controlled as 1 : 2 : 3 : 3. Since the size of each community node is not consistent, the parameter θ within each community needs to be normalized.

R REVIEW
7 of 28 each community node is not consistent, the parameter  within each community needs to be normalized.On the basis of the degree-corrected stochastic block model, consider the inverse process of its generation of random networks, i.e., deriving the number of real communities of the network backward from the existing network structure.The specific derivation process is given below.Given an undirected network  = (, ), the number of nodes is || = , and the number of edges is || = .Let the adjacency matrix be , then  denotes the number of connected edges between nodes  and , and the diagonal element  is equal to twice the number of self-looping edges of node .A community partition is defined as where  denotes the number of the community to which node  belongs.The connected edges between nodes  and  located in community  and community , respectively, obey a Poisson distribution with mean    , , and  and  are the model parameters.For computational convenience, the parameters  are constrained and normalized: where  is the number of nodes within the denoted community , (,  ) is the Kronecker function defined as: Given the number of communities , community partition , parameters , , the probability of generating the specified network with adjacency matrix  is: where the first concatenated part represents the inter-community edge probability between nodes, and the second concatenated part represents the edge probability between nodes within the same community.Substituting Equation (1) into Equation ( 3) and neglecting the constant multiplier, we obtain: where  denotes the degree of node  and  denotes the total number of connected edges between community  and community .Since the intermediate parameters  and On the basis of the degree-corrected stochastic block model, consider the inverse process of its generation of random networks, i.e., deriving the number of real communities of the network backward from the existing network structure.The specific derivation process is given below.Given an undirected network G = (V, E), the number of nodes is |V|= n, and the number of edges is |E|= m .Let the adjacency matrix be A, then A ij denotes the number of connected edges between nodes i and j, and the diagonal element A ii is equal to twice the number of self-looping edges of node i.A community partition is defined as g = {g i | i ∈ [1, . . . ,n]}, where g i denotes the number of the community to which node i belongs.The connected edges between nodes v i and v j located in community r and community s, respectively, obey a Poisson distribution with mean θ i θ j ω g i ,g j , and {θ i } and {ω rs } are the model parameters.For computational convenience, the parameters θ are constrained and normalized: where n r is the number of nodes within the denoted community r, δ(r, g i ) is the Kronecker function defined as: Given the number of communities k, community partition g, parameters θ, ω, the probability of generating the specified network with adjacency matrix A is: where the first concatenated part represents the inter-community edge probability between nodes, and the second concatenated part represents the edge probability between nodes within the same community.Substituting Equation (1) into Equation ( 3) and neglecting the constant multiplier, we obtain: where d i denotes the degree of node v i and m rs denotes the total number of connected edges between community r and community s.Since the intermediate parameters θ and ω are irrelevant to the problem modeling, their priori selection and integral elimination are considered.Here, the reference [16] for the priori selection of parameters θ and ω yields the final likelihood function as shown in Equation ( 5), where κ r = ∑ i d i δ r,g i is the sum of the degrees of all nodes within the community r.
Entropy 2023, 25, 1617 On the basis of determining the parameter prior and obtaining the likelihood function, the Bayesian model can be used for the task of inferring the number of communities k.Given the network adjacency matrix A, the probability distribution P(g, k | A) is shown in Equation (6).
where the likelihood function P(A|g, k) has been determined.The joint probability distribution P(g, k) can be calculated using Equation (7) as in [16].At this point, the Bayesian inference process for the degree-corrected stochastic block model has been completed, providing a theoretical basis for Monte Carlo sampling in the next subsection.

Monte Carlo Sampling
After completing the priori derivation of the key parameters, the complete expression of the conditional probability P(g, k|A) is determined by ignoring the observed data P(A).The posterior probability P(k|A) of the number of communities k can be obtained by counting all community partition cases g, so that the most probable number of communities k can be deduced for the purpose of estimation.However, since the total number of all possible cases of partitions is k n , which are impossible to exhaustively traversed.Here, the Monte Carlo method is introduced to sample these cases.The pair (k, g) is considered as the "state" of the network to be sampled, and k is counted during sampling; finally, the k corresponding to the maximum value of P(k|A) is the estimated number of communities.The sampling process consists of two main types of sampling steps:

•
Move node v i from community r to an existing community s, then • Move node v i to a new community, then k = k + 1.
An effective Monte Carlo sampling algorithm needs to satisfy ergodicity and detailed balance.Ergodicity requires that each state of the system be accessible to each other through a finite sequence of Monte Carlo steps.For this reason, the above process of moving a single node from one community to another satisfies this condition.And for careful equilibrium needs to be satisfied: the ratio R(g, k → g , k ) of traversing from the current state (g, k) to another state (g , k ) and the ratio R(g , k → g, k) of returning back must satisfy: A traditional acceptance/rejection pattern is used in each step, where the move operation is performed with probability π, and the operation is accepted with probability α.
The final Monte Carlo sampling process is as follows: (1) Initialization: Disorder the nodes and assign them to the given maximum k max communities; note that there is no empty community here.(2) Sampling: Execute Operation 1 with probability 1 − 1/(n − 1) or Operation 2 with probability 1/(n − 1).

Sampling Acceleration
Experiments on the Monte Carlo sampling proposed in the previous section reveal that there are some points that can be optimized in the initialization step and the sampling Operation 1 of the sampling process.For the initialization in Step (1), nodes are randomly assigned to k max communities, and there is a high probability that nodes assigned to the same community are not connected to each other, so the subsequent sampling process requires a large number of iterations to reach a more reasonable community partition state.To address this problem, the network reconstruction method in Section 3.1 can be applied to assign nodes belonging to the same connected clique after reconstruction to the same initial community and all free nodes to their own separate communities.
For Operation 1, i.e., randomly selecting communities r, s and randomly moving a node v i from community r to community s, the problem is the low sampling acceptance rate.Since the selection of community s is random and only a few communities have edges linked to node v i , this node-transfer community operation will be rejected with high probability.This leads to a slow rate of sampling, so it is necessary to control the selection of community s.First, if a node's community transfer operation is accepted, it means that the new partition state of the node's community is more likely to reflect the original network topology information than that before the change.Therefore, in the case that the node v i is not connected to the other nodes of the original community r, it is considered that there is a high probability that it will transfer to any other community, so we still select the community s equiprobably; while in the case that node v i is connected to the other parts of community r, a certain weight is assigned to each community according to the nodes' connectedness to determine the probability of selecting community s.The weight of the node v i to transfer from the community r to community s is calculated as shown in Equation ( 12): where m rs denotes the number of connected edges between community r and community s, n t denotes the total number of nodes in community t, and k is the number of current communities.The communities that are connected to the node v i are called node-neighboring communities, and the control parameter α ensures that each community has a certain probability of being selected, regardless of whether the number of connected edges m ts of the community with node-neighboring communities is 0, and α can be set to 1. β it denotes the weight of the number of contiguous edges between node v i and community t in its own degree, which is calculated as: where A ij is the number of edges connecting node v i and node v j .δ(g i , t) is the Kronecker function.d i is the degree of node v i .The main idea of the community transfer weight formula Equation ( 12) is that the weight of node v i transferring to community s will be greater if: • Node v i is more closely connected to its neighboring communities; • Community s is more closely connected to the neighboring communities of node v i ; • The size of the community s is smaller.
The final node transfer probability is calculated as shown in Equation (14).
So far, the derivation and algorithm design of a large number of community estimation algorithms based on Bayesian inference have been completed.Figure 6 shows the main procedure of the algorithm.The specific experimental results are shown in Section 5.2.
function. is the degree of node  .The main idea of the community transfer weight mula Equation ( 12) is that the weight of node  transferring to community  wil greater if:

•
Node  is more closely connected to its neighboring communities; Community  is more closely connected to the neighboring communities of nod • The size of the community  is smaller.
The final node transfer probability is calculated as shown in Equation (14).

PMIK-SC Algorithm 4.1. PMI-Kernel Derivation
One of the key problems in spectral clustering algorithms is how to construct a suitable proximity matrix, and graph kernel is one of the methods to construct such a matrix.For structural information, graph kernels of different construction methods can extract different structural features of networks, such as kernels based on the shortest path, which may give more weight to short edges in the network, while kernels based on sub-tree structure can obtain richer information about the graph structure but also have the defect of path backtracking, etc.Currently, many graph kernels are defined based on R-convolution theory for construction, but such kernels have three drawbacks:

•
A large amount of structural information in non-isomorphic subgraphs is ignored.

•
The positions of isomorphic sub-structures in the original network cannot be reflected by the kernels.

•
The kernels only deal with small-size sub-structures, which cannot fully reflect the structural information of the network.
From the perspective of information theory, we introduce point-wise mutual information (PMI) and design the PMI-Kernel based on the exponentially decaying diffusion model.Point-wise mutual information is used to measure the information-theoretical correlation between two variables.
Compared to R-convolution-based kernels, PMI-based kernels have the following features: First, instead of directly extracting the features of nodes, it aims to reveal the correlation between nodes.This transforms the community detection problem into a node clustering problem and minimizes the information loss during the clustering process as much as possible.In addition to this, the point-wise mutual information matrix constructed based on the infinite-order transition probability matrix can not only express the local information of each node's neighborhood but also retain the global information of the target network.
Therefore, graph kernels based on PMI avoid the drawbacks of R-convolution-based kernels.PMI-based kernels no longer need to consider the isomorphism and position of the substructure, and the neighborhood substructure information of a specific node is directly stored in the corresponding row of the PMI matrix in the form of multiple-order accumulation.In addition, since the PMI matrix is constructed based on the infiniteorder transition probability matrix, the global information of the network is preserved, so PMI-based kernels can more fully reflect the structural information of a network.
For a pair of discrete random variables x and y, the point-wise mutual information is defined as the logarithm of the ratio of the product of the joint probability distribution and the marginal probability distribution, as shown in Equation (15): The value range of PMI is: The PMI value indicates the correlation between two random variables.If x and y are independent, pmi(x; y) is 0; if there is a negative correlation between x and y, pmi(x; y) will be negative.
In order to apply PMI to the network structure, it is necessary to select suitable probabilities as edge probabilities and joint probabilities in Equation (15).First, consider the first-order transfer probability matrix, P 1 : where D and A are, respectively, the degree matrix and the adjacency matrix of the network.Experiments show that if the first-order transfer probability matrix is used to calculate the PMI values of a network with a lower average degree, the accuracy of community detection is low.This is due to the fact that global information cannot be obtained by the first-order transfer probability matrix in a sparse network.To address this problem, the exponentially decaying diffusion model based on the transfer probability matrix is introduced to calculate the infinite-order transfer probability matrix P of a specific network, as shown in Equation (18).
The exponentially decaying diffusion model follows the principle that the influence between nodes decays with the increase in their distance in the network.Such information on pairwise influence contributes to finding the closely connected nodes in a network, and that is exactly the goal of community detection.
The sum of the elements in each row of matrix P is not necessary 1; here, it needs to be row normalized by: where D P is a diagonal matrix in which elements on the diagonal are the sum of the elements of the corresponding row in matrix P, and other elements 0, as shown in Equation (20): In the normalized transfer probability matrix ∼ P, each element ∼ P(i, j) is equal to the sum of the 1st, 2nd, . .., and h-th order transfer probabilities from node i to node j.Then, the elements of the PMI matrix M PMI can be calculated by substituting each Note that the PMI matrix obtained at this point cannot be directly used as a proximity matrix or kernel matrix for spectral clustering.That is because:

•
The matrix is not symmetric since the transfer probability from node i to node j is not necessarily equal but determined by the degree of the two nodes and all possible paths starting from node i and node j.

•
From the range of PMI values: [−∞, min[−log p(x), −log p(y)]], we know that there may be negative elements in the matrix.
To address the above two issues, the following equation is first used to symmetrize the matrix: Then, for the negative values, the elements of matrix M PMI are normalized to 0 ∼ 1 in order to retain the original structural information of networks with different average degrees, as shown in Equation ( 23):

Implementation
Based on the BI-CNE algorithm and PMI-Kernel, the implementation of the spectral clustering community detection algorithm based on the point-wise mutual information graph kernel (PMIK-SC) will be described in detail in terms of the number of communities, graph reconstruction, and graph cut criterion.

Number of Communities
For the input requirement of the number of communities k for the PMIK-SC algorithm, the BI-CNE algorithm proposed in Section 3 is used to quickly estimate the number of communities k for the network.This is conducted by Monte Carlo sampling various community partition states of the network and selecting the k value corresponding to the maximum value of P(k | A) as the number of communities; in short, the k value with the most frequency.Since the time complexity of the BI-CNE algorithm is only O(kn), and the number of iterations required for the algorithm to converge is greatly reduced by the acceleration of sampling, the number of communities k can be quickly obtained as the input to the PMIK-SC algorithm.

Graph Reconstruction
The PMI-Kernel matrix K PMI derived in the previous section is used as the proximity matrix for spectral clustering.For small-scale and medium-scale networks, the infiniteorder transfer probability matrix can extract rich topology information from the network, and the similarity between any two nodes can be represented by the corresponding element values in the PMI-Kernel matrix.This helps the spectral clustering algorithm better delineate the community structure, as shown in Figure 7.In contrast, for large-scale sparse networks, the computational cost can be reduced by limiting the influence range of nodes.For example, if the influence range of each node is set to l hops, the topological structural information of the network can be extracted by an l-order transfer probability matrix.This can be simply implemented by replacing ∞ by l in Equation (18).The impact of replacing the infinite-order transition probability matrix with an l-order transfer probability matrix is experimentally analyzed, as in Section 5.3.1.

Implementation
Based on the BI-CNE algorithm and PMI-Kernel, the implementation of the spectra clustering community detection algorithm based on the point-wise mutual information graph kernel (PMIK-SC) will be described in detail in terms of the number of communi ties, graph reconstruction, and graph cut criterion.

Number of Communities
For the input requirement of the number of communities  for the PMIK-SC algo rithm, the BI-CNE algorithm proposed in Section 3 is used to quickly estimate the numbe of communities k for the network.This is conducted by Monte Carlo sampling variou community partition states of the network and selecting the k value corresponding to the maximum value of ( | ) as the number of communities; in short, the k value with the most frequency.Since the time complexity of the BI-CNE algorithm is only (), and the number of iterations required for the algorithm to converge is greatly reduced by the ac celeration of sampling, the number of communities k can be quickly obtained as the inpu to the PMIK-SC algorithm.

Graph Reconstruction
The PMI-Kernel matrix  derived in the previous section is used as the proximity matrix for spectral clustering.For small-scale and medium-scale networks, the infinite order transfer probability matrix can extract rich topology information from the network and the similarity between any two nodes can be represented by the corresponding ele ment values in the PMI-Kernel matrix.This helps the spectral clustering algorithm bette delineate the community structure, as shown in Figure 7.In contrast, for large-scale sparse networks, the computational cost can be reduced by limiting the influence range of nodes For example, if the influence range of each node is set to  hops, the topological structura information of the network can be extracted by an -order transfer probability matrix.Thi can be simply implemented by replacing ∞ by  in Equation (18).The impact of replacing the infinite-order transition probability matrix with an -order transfer probability matrix is experimentally analyzed, as in Section 5.  To reconstruct the graph for spectral clustering using the PMI-Kernel matrix K PMI , a distance matrix S is constructed first by using the proximities to distances formula [23]: where S(i, j) is the elements of S. The distance matrix is then used to construct a weight adjacency matrix W, using the K-nearest neighborhood (KNN) construction: where KNN(i) denotes the K N neighbor nodes of node i, in which the distances between nodes are measured using S. σ is the variance of the Gaussian distribution, which is set to 1.0 on implementation.Finally, a graph cut criterion is used to partition the reconstructed graph to obtain the result of spectral clustering.

Graph Cut Criterion
The normalized cut (NCut) is selected as the cut criterion to partition the reconstructed graph.This criterion can measure both the degree of similarity between nodes within the same community and the degree of difference between nodes in different communities, which performs better on spectral clustering compared with other graph-cut criteria.Note that the NCut criterion requires an eigendecomposition and eigenvector clustering of a symmetric normalized Laplacian matrix L sym , which is constructed by the following definition: where D w is the degree matrix of W in which elements on the diagonal are the sum of the elements of the corresponding row in matrix W, and other elements 0, as: D W (i, i) = ∑ j W(i, j).L is the Laplacian matrix constructed by L = D W − W.
On this basis, the community partition results are obtained by calculating the eigenvectors corresponding to the smallest k eigenvalues of L sym , and finally, clustering using the k-means algorithm.
The procedure of the algorithm is as follows: Input: adjacency matrix A of network G, the number of communities k (1) Calculate the first-order transfer probability matrix P 1 = D −1 A and the infinite-order transfer probability matrix P according to Equation ( 18). ( 2) Calculate the PMI matrix M PMI , according to Equation ( 21).
(3) Symmetrize and normalize M PMI to obtain the PMI-Kernel matrix K PMI .(4) Calculate the distance matrix S using Equation ( 24). ( 5) Reconstruct the network based on S by Equation ( 25) and obtain the weight adjacency matrix W. (6) Construct the symmetric normalized Laplacian matrix L sym using Equation ( 26).(7) Eigendecompose the Laplacian matrix L sym to obtain the first k smallest eigenvalues and the corresponding eigenvectors to form the feature matrix.(8) Perform k-means clustering on the row vectors of the feature matrix to obtain the final community partitioning result g.
The pseudo-code of the algorithm is shown in Algorithm 1.

Algorithm 1. PMIK-SC algorithm
Require: Adjacency matrix A, number of communities k Ensure: Community partition g 8: end for 9: K PMI ← symmetrize and normalize M PMI 10: S ← proximities_to_distances(M PMI ) 11: for i, j ← 0 to n do 12: if i in KNN(j) or j in KNN(j) then 13: W(i, j) ← exp (−S(i, j)/2σ 2 ) 14: else 15: W(i, j) ← 0 16: end if 17: end for 18: The experiments are conducted on both real-world network datasets and the LFR synthetic network datasets [24].The main properties of them are shown in Tables 1 and 2, respectively, where n is the number of nodes, m is the number of edges, K is the groundtruth number of communities, d is the average degree of nodes, and µ is the mixing parameter for LFR networks.Among them, L1-L4 have the same scale, which is used to compare the accuracy of BI-CNE and PMIK-SC algorithms under different mixing parameters (µ).L5-L9 are generated using the same d and µ, but with different numbers of nodes, to measure the performance of PMIK-SC under different network scales.

Evaluation Indexes
For the CN estimation, the ground-truth number of communities is used as the evaluation index.It is better that the CN estimated by the algorithms is closer to the ground truth.For the community detection task, the evaluation indexes are the normalized mutual information (NMI) and (Q).NMI [25] can be used to measure the difference between the partitions obtained by the community detection algorithm and the groundtruth partitions.The larger the NMI value, the closer the community partition result is to the ground-truth partitions.
where A represents the ground-truth partition, and B represents the partition obtained by a community detection algorithm; C A and C B represent the numbers of communities of A and B respectively; C is a confusion matrix, in which the element C ij represents the number of nodes both in community i in A and in community j in B. C i. and C .j are respectively the sum of the elements in the ith row and in jth column of C, and N represents the total number of nodes of the network.
Modularity is an evaluation index to measure the quality of partition given a partitioned network [5,10], which can be calculated by Equation ( 28): where u, v denotes a pair of nodes, and k u , k v are the degrees of them.m denotes the total number of edges in the network.A is the adjacency matrix of the network, in which the element A uv denotes the number of edges connecting node u and node v. c u , c v denote the communities to which nodes u and v belong.δ(c u , c v ) is the Kronecker delta defined as Equation (2).Conductance has been widely used in the study of graph cuts before it was applied to community detection [26,27].For a cluster c, conductance is defined as: where m int c is the number of intra-cluster edges of cluster c, and m ext c is the external edges link the cluster to other parts of the network.Here, we use the average conductance (AC) of all communities to evaluate the result of community detection: Intra-cluster density [28] is used to measure the density of edges within a cluster, which is defined as: where n c and m c is respectively the number of nodes and edges of cluster C. The denominator denotes the number of possible edges within the cluster, which is equal to n c (n c − 1)/2.
Similarly, we use the average intra-cluster density (AICD) to measure the quality of a partition, which is exactly the average of the intra-cluster density of all communities in a partition: The experiments are conducted single-threaded on a PC with the following technical details in Table 3. related to the average node degree, node degree distribution, and actual community size distribution of the network itself, which can be more easily observed in the experiments of LFR synthetic networks.The node degree distribution and community size distribution of the LFR synthetic network both obey a power-law distribution, such as dataset L1, which has a mixing parameter of 0.3, and its community structure is clearer compared with the network with a mixing parameter of 0.5; its average node degree is 15, and the number of connected sub-graphs obtained by pruning reconstruction at a cutoff value of 5 is 48, which is close to the actual CN 49.For a dataset with a large average degree, such as the L4 dataset, the network still cannot be split at a cutoff value lower than 4, and the number of connected graphs is 1.But accordingly, fewer nodes are removed when pruning such a network, and the obtained connected sub-graphs can preserve more information about the original community structure.In a word, the pruned network is more similar to the real community partition, and as the initial state of the system for Monte Carlo sampling, it will help to reduce the number of iterations required to reach the equilibrium state.

Comparison Results
The following algorithms are selected to compare with our CN estimation algorithm: • Algorithm A1: the topological potential-based CN estimation algorithm [12].

•
Algorithm A3: the statistical inference-based fast CN estimation algorithm [16] Ten 10,000-round Monte Carlo sampling experiments are conducted on each of the four real-world network datasets, and the one with the highest average likelihood P(A|g, k) is selected, and the frequency of the number of communities k during their sampling is counted, as shown in Figure 8.The value with the highest frequency, i.e., the k value corresponding to the highest posterior probability P(A|g, k) , is taken as the estimated number of communities.The results of the estimation on four real-world networks are shown in Table 6, compared with Algorithms A1, A2, and A3.The estimated number is considered better if it is closer to the ground-truth number of communities K.It shows that the BI-CNE algorithm is able to exactly estimate the number of real communities on the first three datasets, including Karate, Dolphins, and Polbooks.For the Football dataset, the number of real communities is 12, which contains 11 communities corresponding to 11 soccer federations, and the 12th community is actually the soccer teams that do not belong to any soccer federations.The number of communities estimated  It shows that the BI-CNE algorithm is able to exactly estimate the number of real communities on the first three datasets, including Karate, Dolphins, and Polbooks.For the Football dataset, the number of real communities is 12, which contains 11 communities corresponding to 11 soccer federations, and the 12th community is actually the soccer teams that do not belong to any soccer federations.The number of communities estimated by the BI-CNE algorithm is 11, which is the closest to the ground-truth number of communities among the existing community estimation algorithms.The experiments show that the BI-CNE algorithm generally outperforms the other comparative algorithms on real-world networks.
To verify the generalization capability of the BI-CNE algorithm, 10 times 100,000 rounds of Monte Carlo sampling experiments were conducted on each of the LFR synthetic network datasets.In the same way, the k value with the highest frequency was taken as the final estimated number of communities.The results of the comparison on LFR synthetic networks are shown in Table 7.The experimental results show that Algorithm A1, based on topological potential, can only estimate a small number of communities on a synthetic network of 1000 nodes, which is caused by the lack of a rigorous theoretical basis for the heuristic operation of the algorithm itself and the simple secondary processing of the results.Algorithm A2 cannot converge on the synthetic networks and fails to estimate the number of communities due to its low sampling acceptance rate and slow operation speed.Algorithm A3 estimates a close number on L1, but if the average degree is lower or the mixing parameter is higher, it tends to overestimate the number of communities.That is because the community structure will be more obscure with a lower average degree and a higher mixing parameter.
The BI-CNE algorithm accurately estimates the number of communities on L1 and L2 with the same mixing parameter of 0.3 and obtains the closest results compared to the other algorithms on L3 and L4.

Sampling Acceleration
To verify the effectiveness of Monte Carlo sampling acceleration, the number of iterations required to converge to a smooth distribution and the sampling acceptance rate during the convergence process are counted separately on the BI-CNE algorithm before and after the acceleration is applied.For statistical convenience, the state of convergence is defined as the final estimated CN being sampled 10 times consecutively.For the LFR synthetic network datasets, a comparison of the number of iterations needed to converge before and after applying acceleration is shown in Figure 9.
Figure 9 shows that the number of iterations needed for convergence increases if the network has a lower average node degree and a higher mixing parameter, which usually means the structure of the network is more obscure.It also shows that after applying the acceleration to sampling, the number of iterations needed is far less, and the algorithm can quickly enter a smooth state for parameter probability sampling.There are two main reasons for the significant acceleration effect.First, the Monte Carlo sampling without acceleration starts with a random initial probability distribution to iterate, while the one with acceleration uses the connected sub-graphs after network pruning and reconstruction as the initial state.The initial probability distribution of the pruned network is closer to the final converged target probability distribution, thereby reducing the number of redundant iterations.Secondly, more effective operations are achieved in each round of sampling after applying the acceleration, which raises the sampling acceptance rate of the operations and makes Monte Carlo sampling reach the smooth state faster.

Sampling Acceleration
To verify the effectiveness of Monte Carlo sampling acceleration, the number of iterations required to converge to a smooth distribution and the sampling acceptance rate during the convergence process are counted separately on the BI-CNE algorithm before and after the acceleration is applied.For statistical convenience, the state of convergence is defined as the final estimated CN being sampled 10 times consecutively.For the LFR synthetic network datasets, a comparison of the number of iterations needed to converge before and after applying acceleration is shown in Figure 9. Figure 9 shows that the number of iterations needed for convergence increases if the network has a lower average node degree and a higher mixing parameter, which usually means the structure of the network is more obscure.It also shows that after applying the acceleration to sampling, the number of iterations needed is far less, and the algorithm can quickly enter a smooth state for parameter probability sampling.There are two main reasons for the significant acceleration effect.First, the Monte Carlo sampling without acceleration starts with a random initial probability distribution to iterate, while the one with acceleration uses the connected sub-graphs after network pruning and reconstruction as the initial state.The initial probability distribution of the pruned network is closer to the final converged target probability distribution, thereby reducing the number of redundant iterations.Secondly, more effective operations are achieved in each round of sampling after applying the acceleration, which raises the sampling acceptance rate of the operations and makes Monte Carlo sampling reach the smooth state faster.
In order to observe the improvement of the sampling acceptance rate more intuitively, a sampling acceleration comparison experiment is conducted on a large-scale network dataset, Amazon, which has 334,863 nodes and 925,872 edges.To facilitate the observation, the statistics are counted from the twentieth iteration until the sampling acceptance rate is as low as five percent, and the results are shown in Figure 10.In order to observe the improvement of the sampling acceptance rate more intuitively, a sampling acceleration comparison experiment is conducted on a large-scale network dataset, Amazon, which has 334,863 nodes and 925,872 edges.To facilitate the observation, the statistics are counted from the twentieth iteration until the sampling acceptance rate is as low as five percent, and the results are shown in Figure 10.

Complexity Analysis
The time complexity of pruning and reconstructing the network is (), where m is the total number of edges in the network.In each iteration of Monte Carlo sampling, the ratio of (|, ) before and after the operation needs to be calculated.To reduce the computational cost, it can be obtained by calculating the change of log (|, ), so as to achieve a time complexity of (), where k is the number of communities.If sampling acceleration is performed on Operation 1, the weights between the current community  and all other communities need to be calculated in each iteration of sampling to obtain the probability of selecting a specific community as community .The time complexity required for this process is also ().Therefore, the time complexity of the BI-CNE algorithm is () in summary, where  is the number of iterations.In practice, the algorithm runs much faster than other comparative algorithms.As mentioned in Section 4.2.2, when the target network is large, the time of calculating the PMI matrix can be reduced by using an -order transfer probability matrix to approximate the infinite-order one.According to Equation ( 18), the infinite-order transfer

Complexity Analysis
The time complexity of pruning and reconstructing the network is O(m), where m is the total number of edges in the network.In each iteration of Monte Carlo sampling, the ratio of P(A|g, k) before and after the operation needs to be calculated.To reduce the computational cost, it can be obtained by calculating the change of log P(A|g, k) , so as to achieve a time complexity of O(k), where k is the number of communities.If sampling acceleration is performed on Operation 1, the weights between the current community r and all other communities need to be calculated in each iteration of sampling to obtain the probability of selecting a specific community as community s.The time complexity required for this process is also O(k).Therefore, the time complexity of the BI-CNE algorithm is O(Rk) in summary, where R is the number of iterations.In practice, the algorithm runs much faster than other comparative algorithms.

l-Order Transfer Probability Matrix for Approximating the Infinite-Order One
As mentioned in Section 4.2.2, when the target network is large, the time of calculating the PMI matrix can be reduced by using an l-order transfer probability matrix to approximate the infinite-order one.According to Equation ( 18), the infinite-order transfer probability matrix is essentially the weighted sum of transition probability matrices from 0-th order to h-th order, where h approaches infinity, and the weight for the h-th order is e −h .When h goes large, the weight becomes very small, which means that by setting a threshold l, one can ignore the transfer probability matrices beyond the l-th order to obtain an approximate matrix P a for the infinite-order one.
In order to avoid the occurrence of zero values in the cumulative transfer probability matrix, which would result in the inability to compute PMI, we assume that for the (l + 1)-th and larger orders, the probability for any node to reach other nodes in the network is equal.Therefore, when computing the (l + 1)-th order transition probability matrix, we replace P 1 with a matrix where each element has a value of 1/n, where n is the size of the matrix.
Calculating the matrix P a just requires performing l − 1 matrix multiplications.If the Coppersmith-Winograd algorithm [29] is used for matrix multiplication, the time complexity can be reduced to O(l•n 2.3729 ).In the case of large network scales, replacing the infinite-order transfer probability matrix P with the l-th order cumulative transfer probability matrix P a can accelerate the computation of generating the PMI matrix.
In order to analyze the impact of using an l-order transition probability matrix for approximating the infinite-order one, we measure the difference between the PMI matrices produced by the two methods and the corresponding computation time on synthetic networks of different scales.
The difference between the two PMI matrices is quantified using average square error (ASE), defined as: where n is the size of the two matrices PMI 1 and PMI 2 .Table 8 shows the ASE values of the PMI matrix generated by l-order transition probability matrices with different l values and that generated by the infinite-order ones.
The impact on the calculation time of using l-order instead of infinite-order is also analyzed.Table 9 shows the time taken to generate a PMI matrix using infinite-order and l-order transition probability matrices.The results show that, in general, when using an l-order transition probability matrix, more time is needed for more orders.It can also be inferred from the results that using l-order instead of infinite-order has a time advantage only when the scale of the network is large (for example, more than 1000 nodes).The larger the network is, the more obvious the time advantage will be.

Complexity Analysis
First, the algorithm needs to compute the PMI matrix for the given network.An approximate calculation is performed using the l-th order cumulative transfer probability matrix, with a time complexity of O(l•n 2.3729 ).Then, the PMI is computed using the cumulative transfer probability matrix with a time complexity of O(n 2 ).Next is the use of KNN for network reconstruction and Laplacian matrix generation, with complexities of O(nK N ) and O(n 2 ), respectively.In the process of spectral clustering, the complexity of finding the top k eigenvectors of the matrix is O(kn 2 ).The value of k can be set to the estimated maximum possible number of communities.Therefore, the time complexity of clustering using k-means is O(knT).Since l, T, and k are all constants much smaller than n, the overall time complexity of the algorithm is O(kn 2 ).

Community Detection Tasks
To evaluate the effectiveness of the PMIK-SC algorithm, we applied it to the community detection task and compared it with multiple benchmark algorithms.Since PMIK-SC is essentially a spectral clustering algorithm, we first incorporate some typical spectral clustering algorithms into the benchmark algorithms, including Comm [30], Heat [20], Katz [31], SCCT [32], and PPR [33].Secondly, the kernel of PMIK-SC, which measures the relationships between nodes, is generated within the information-theoretic framework.We have also introduced some mutual information-based algorithms, including MINC-NRL [6] and AMI-MLPA [7].Finally, we have also considered some state-of-the-art community detection algorithms, including the node-embedding-based algorithm GEMSEC [34], Egonet-based algorithms DEMON [35], and Ego-splitting [36], as well as the motif-enhanced algorithm EdMot [37].The theoretical time and space complexity of these algorithms are shown in Table 10.
To evaluate the accuracy and efficiency of the PMIK-SC algorithm on community detection, we run the algorithm on real-world networks Karate, Dolphins, Football, Polbooks, and synthetic networks L1, L2, L3, and L4, comparing them with the benchmark algorithms., respectively, show the NMI, modularity, average conductance, and average intra-cluster density of the results, and Table 15 shows the time required for that the boundaries between a community and the other part of the network are blurred), PMIK-SC can still reach an NMI of 0.990.The accuracy of the algorithms MINC-NRL and AMI-MLPA based on mutual information has also reached a high level, but the accuracy of MINC-NRL on synthetic networks is not perfect.It can be observed from Table 11 that even when µ = 0.1 on dataset L1 (which means the community boundaries are very obvious), MINC-NRL can only achieve an NMI of 0.926.This may be caused by MINC-NRL learning some information that is irrelevant to community structure in some dimensions of the embeddings during the process of network representation learning.AMI-MLPA can achieve an NMI of 1.0 on L1, but when µ increases, its accuracy also drops significantly.
As can be inferred from Table 12, when using modularity as the evaluation index, the motif-enhancement-based algorithm EdMot performs best, and the Ego-net-based algorithm Ego-splitting also achieves good results.The average modularity of the results obtained by these two algorithms in the experimental datasets exceeds PMIK-SC, but their NMI is not high.This also prompts us to think about whether modularity is more inclined to consider the local density of the network than NMI, which represents the ground-truth partition.We will conduct further research on this point in the future, trying to introduce some local enhancement mechanisms to improve the performance of modularity for the PMIK-SC algorithm.
In terms of running time, PMIK-SC and other spectral-clustering-based algorithms are also less efficient than local-enhancement-based algorithms like EdMot, DEMON, and Ego-splitting.However, compared with the other spectral clustering algorithms, PMIK-SC is more efficient on larger networks with larger mixing parameters due to the approximate acceleration used in generating PMI kernels and matrix decomposition.

Conclusions
In this article, a fast Bayesian inference-based number of communities estimation algorithm (BI-CNE) based on a degree-corrected stochastic block model is proposed.The algorithm first prunes and reconstructs the network, then performs Bayesian inference and designs the corresponding Monte Carlo sampling process for obtaining the number of community estimates.At the same time, the connectivity graph results of network pruning and reconstruction are applied to the initial state of sampling, and the overall sampling acceptance rate is improved by controlling the node transfer direction in sampling.Experimental comparisons with other CN estimation algorithms in real-world networks and LFR synthetic networks show that the estimation speed and accuracy of this algorithm outperform other existing CN estimation algorithms.The time complexity of O(Rk) also allows the algorithm to be efficiently extended to large-scale complex networks for community size estimation tasks.
With the premise of obtaining the CN estimation, a new graph kernel based on pointwise mutual information is proposed and applied to the spectral clustering algorithm.Experiments on real-world networks and synthetic networks show that the algorithm has higher accuracy and stability compared with existing community detection algorithms.The comparison results with the current optimal graph-kernel-based spectral clustering algorithm also indicate that the point-wise information kernel can extract network topological information more effectively.
The two algorithms are complementary to community detection.From one point of view, PMIK-SC can reach high accuracy by providing an exact number of communities; from another perspective, taking the number of communities as a priori significantly improves spectral-clustering-based community detection algorithms compared with other approaches.
Due to the symmetry of the PMI matrix, one limitation of the current PMIK-SC algorithm is that it can only be applied to undirected networks.The other limitation is the relatively high time complexity of eigendecomposition, which affects the performance of the PMIK-SC algorithm on large-scale networks.

Figure 3 .
Figure 3. Connection status of solid nodes under different cutoffs.

Figure 3 .
Figure 3. Connection status of solid nodes under different cutoffs.

Figure 3 .
Figure 3. Connection status of solid nodes under different cutoffs.
of the ∼ P matrix, i.e., the sum of the values of all elements of matrix V∼ P .∼ P(i, •) and ∼ P(•, j), respectively, denote the sum of all elements of a row and that of a column in matrix ∼ P.

Figure 7 .
Figure 7. PMI matrix of Football network before and after re-arrangement by clustering.(a) Original PMI matrix.(b) Clustered PMI matrix.

Figure 7 .
Figure 7. PMI matrix of Football network before and after re-arrangement by clustering.(a) Original PMI matrix.(b) Clustered PMI matrix.

Figure 8 .
Figure 8. Frequency statistics of k in BI-CNE.

Figure 9 .
Figure 9.The number of iterations of the convergence process of LFR networks.

Figure 9 .
Figure 9.The number of iterations of the convergence process of LFR networks.

28 Figure 10 .
Figure 10.Comparison of sampling acceptance rate before and after acceleration.

Figure 10 .
Figure 10.Comparison of sampling acceptance rate before and after acceleration.
then delete community r and renumber the communities, and that makes k = k − 1.•Operation 2: Randomly select a community r.Randomly select a node v i from community r and move it to a new empty community k + 1.If node v i is the last node of community r, this operation is rejected, and k remains unchanged.Otherwise, it makes k = k + 1. (3) Accept the operation: The operation in Step 2 will be accepted following the acceptance probability: • Operation 1: Randomly select communities r,s.Randomly select a node v i from community r and move it to community s.If node v i is the last node of

Table 1 .
Main properties of real-world networks.

Table 2 .
Main properties of LFR synthetic networks.

Table 6 .
Number of communities estimated on real-world networks.

Table 6 .
Number of communities estimated on real-world networks.

Table 7 .
Number of communities estimated on LFR networks.

Table 9 .
Time spent (seconds) to generate a PMI matrix.