Modularity-Based Incremental Label Propagation Algorithm for Community Detection

: Label Propagation Algorithm (LPA) is a fast community detection algorithm. However, since each node is randomly assigned a di ﬀ erent label at ﬁrst, there is serious randomness in the label updating process of LPA, resulting in great instability of detection results. This paper proposes a modularity-based incremental LPA (MILPA) to address this problem. Unlike LPA, MILPA ﬁrst assigns all nodes the same label, and then repeatedly uses divide strategy to split locally dense connected nodes into a community and give them a new label. After that, MILPA uses modularity gain as the optimization function to ﬁne-tune the label of nodes so as to obtain an optimal partition. The proposed MILPA has been compared with LPA and other known methods. Experimental results show that MILPA has the best and most stable performance in LFR benchmark networks and is comparable to the best algorithm in many real networks.


Introduction
Community structure is an important feature of complex networks [1,2]. It can help us understand the nature or function of networks [3]. For instance, communities are likely to group proteins having the same specific function within the cell and they play a particularly important role in our understanding of how specific biological functions are encoded in cellular networks [4][5][6]. In most cases, however, the community structure of a network is not known in advance and needs to be detected by algorithms. Therefore, developing algorithms to detect community structure, i.e., community detection, has been one of the most widely studied topics in network science. From a classical view, community detection is a clustering process to detect communities in a large network, because the edge density of nodes inside each community is greater than that between communities. However, the current view focuses on the probability that edges may be generated between nodes. It indicates that community is a preferential linking pattern [7,8].
In the last two decades, many scholars have performed extensive studies on how to identify the community structure of a complex network. A number of powerful methods have been put forward to address this problem, which can be generally divided into three categories: divisive, modularity optimization based, and agglomerative algorithms.
Divisive algorithms aim to detect inter-community edges and then remove these links from the network, including the GN algorithm proposed by Newman [9]. However, GN has a high computational complexity, O(n 3 ), where n is the number of nodes in a network. To improve the efficiency of GN, Radicchi et al. [10] proposed a fast splitting algorithm to find the edge set with the In addition, LPA has linear time complexity so it can identify the community structure of large networks. However, as LPA always randomly selects a node to start spreading labels, which makes some nodes that are neither tightly connected nor sparsely allocated to different communities during each update process, resulting in some instability of detection results. To solve this problem, Barber et al. [19] proposed a modularity-specialized LPA (LPAm) which uses modularity as an objective function for optimization, so that the results of community detection are always in the direction of increasing modularity. However, one obvious defect of this algorithm is that it is easy to fall into a local optimal and it may misclassify the community of nodes. Based on LPA and LPAm, Li et al. [20] proposed another improved LPA, called LPAMP, which first optimizes the modularity to obtain a coarse clustering of nodes and then performs label propagation. LPAMP can reduce the randomness of the LPA and improve the stability and accuracy of the community detection results, but it is slow and cannot be used for large networks.
To solve the instability problem of LPA mentioned above, a modularity-based incremental LPA (MILPA), is proposed in this paper. First, the node degree and the node membership are introduced to determine initial community of nodes. After all nodes have initial labels, we define an objective function based on modularity gain to guide the label updating of nodes. Experiments on both synthetic networks and real-world networks show that the proposed MILPA greatly reduces the randomness of the label updating process and has stable and great performance in community detection.

LPA
Here, we first introduce the basic steps of LPA [21]. Nodes in a network are initially given unique labels. All nodes, in a random sequential order, perform this operation where each node takes the label shared by the majority of its neighbors. If there is no unique majority, one of the majority labels is picked randomly. In this way, labels propagate across the network: most labels will disappear, and others will dominate. The process reaches a convergence when each node has the majority label of its neighbors. Communities are defined as groups of nodes having identical labels at convergence. By construction, each node has more neighbors in its community than in any other communities. The algorithm does not deliver a unique solution. Due to the random initialization of labels and many ties encountered along the process of label propagating, it is possible to derive completely different partitions every time running the LPA on the same network, resulting in an instability problem, as mentioned above.

Normalized Mutual Information
For a network with a known community structure, normalized mutual information (NMI) [22] can be used to evaluate the coincidence degree between the ground truth and the result detected by one algorithm, so as to measure the quality of this community detection algorithm. NMI is defined as where C A represents the ground truth, C B denotes the result detected by one algorithm N is a mixing matrix where its row number is the number of real communities and its column number is the number of communities detected by the algorithm. N ij represents the number of nodes in real community i in community j obtained by the algorithm, N i and N j denote the sum of i row and the sum of j column, respectively. Following can be known by analyzing Equation (1): 1. NMI = 1 when the community partition generated by one algorithm is consistent with the real community structure.

2.
NMI = 0 when the community partition generated by one algorithm is the opposite of the real community structure.

3.
NMI ∈ [0, 1] when the community partition generated by one algorithm is partly similar to the real community structure. The closer to 1 the NMI value, the closer to the real community structure the community detection result, and the better the performance of the algorithm.

Modularity
When the real community structure of a network is unknown, modularity [23] proposed by Newman is the most popular quality function to assess the community detection result of one algorithm. According to the general definition of community structure, nodes within the same community are densely connected, and nodes in different communities are sparsely connected. Modularity is used to measure whether a network has such a community structure, defined as where m is the total number of edges in a graph, n c is the number of communities, l c is the number of edges joining nodes of community l c and d c is the sum of degrees of the nodes of C. In Equation (2), the first term of each summand is the part of edges of the network within the community, whereas the second term represents the expected part of edges that would be there if the network was a random graph with the same expected degree for each node. High values of modularity indicate good community structure.

MILPA
To solve the instability problem mentioned in the introduction, this paper proposes the MILPA to improve the performance of LPA. There are two steps in the label propagation process of MILPA. The first step is incremental label propagation, and the other step is modularity-based label updating. At the starting point, N nodes are given N unique labels in LPA. Unlike LPA, MILPA uses the opposite strategy that all nodes are assigned the same label so that there is only one label at first. Then the whole graph is partitioned into two groups and a new label is emerged. As this procedure goes on, the number of labels increases gradually, so it is called incremental label propagation. After the end of this process, our algorithm uses the modularity gain as the optimization function to fine-tune the label of nodes, i.e., modularity-based label updating, which greatly reduces the randomness of the label updating process. The entire label propagation process of the algorithm is shown in Figure 1 and the flow chart of the proposed algorithm is shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 13 the end of this process, our algorithm uses the modularity gain as the optimization function to finetune the label of nodes, i.e., modularity-based label updating, which greatly reduces the randomness of the label updating process. The entire label propagation process of the algorithm is shown in Figure  1 and the flow chart of the proposed algorithm is shown in Figure 2.  Given an undirected network G with n nodes and m edges, the relationship between nodes u and v is denoted by where W uv = 1 indicates that there is a connection between nodes u and v, while W uv = 0 indicates that there is no connection between them. The goal of the algorithm is to find a good partition with k communities, C 1 , C 2 , . . . , C k , to maximize the modularity. Note that there is no overlap between any two communities C i and C j , that is, each node belongs to only one community. There are two important concepts in MILPA, intensity and membership degree. The intensity of node u is defined as where F (u) denotes the neighbors of u. In an undirected network, I u is the degree of u, i.e., k u . The membership degree of a node indicates the degree to which it belongs to a community C, defined as M(u, C) = v∈F (u) and v∈C w uv I u . (5) In fact, M(u, C) is equivalent to the ratio between the number of the neighbors of node u in community C and the degree of u. A high value of membership degree indicates that the node is more likely to belong to the community. Given an undirected network G with nodes and edges, the relationship between nodes and is denoted by Assign initial label i=1 to all nodes in G and put them into a set τ Input network G For v in τ, put the node with the highest intensity and its neighbors into a set C, i = i+1  Incremental label propagation. The detailed steps of this process are as follows: 1.
All nodes in the network are assigned into a set τ and given the same label i = 1.

2.
The intensity of all nodes in set τ is calculated according to Equation (4).

3.
Among the nodes in τ, the node with the highest intensity and its neighbors are put into a new set, denoted by C.

4.
For any node v ∈ C if the membership degree of v to C is less than ε, it indicates that node v is not closely connected with other nodes in C, and then node v is deleted from C. When all nodes in C satisfy M(u, C) ≥ ε, they are marked F.

5.
A new label, i = i + 1, is assigned to these nodes in C, and then the set C is cleared. 6.
Steps (2) to (5) are repeated to create new densely connected subgraphs until no new label is generated, resulting in the initial community partition. The threshold of membership degree ε = 0.5 in this paper since it is proved to be the best for most networks.
Modularity-based label updating. Incremental label propagation can greatly reduce the randomness of LPA to obtain more stable community partition. However, when there are two or more nodes having the highest intensity in step (3), one has to randomly select one from them so that the solution derived from incremental label propagation may not be unique. Moreover, the solution may not be the best partition. To improve the performance of our method, MILPA uses the modularity gain as an optimization function to fine-tune the labels of nodes based on the initial community partition. In particular, one calculates the modularity values before and after a node is moved to the community of its neighbors from its original community. This can be seen in Figure 1, before and after node i in community C 1 is moved to community C 2 , the modularity respectively is where k i,c 1 denotes the number of edges connected to node i in community C 1 and k i,c 2 denotes the number of edges connecting node i with nodes in community C 2 . Therefore, one has the modularity gain If the number of the label of a node's neighbors is more than two, there will be multiple communities for the node to move in. One can compute the ∆Q generated by each move and select the community with the largest ∆Q (if ∆Q > 0). If there are two or more communities with the optimal ∆Q, then one moves this node to one of them randomly. If the maximum of ∆Q is not positive, then this node remains in the original community. Besides, when a node is moved to a new community, its label is also updated. Until the labels of all nodes no longer change or a specified number of iterations is reached, our MILPA method yields the final community partition.

Experimental Datasets
To verify the performance of the proposed MILPA, experiments are implemented on both LFR benchmark datasets [24] and eight real network datasets, as shown in Tables 1 and 2. The performance of various community detection algorithms, MILPA, LPA, LPAMP, Fastgreedy, GN and Louvain were compared in this paper. LFR benchmark is a widely used program to build artificial networks to test the performance of community detection algorithms. It can flexibly generate high quality synthetic networks which are close to the real networks. The LFR program provides a series of configuration parameters for users to define, including the size of a network N, i.e., the number of nodes in a network; the maximum and minimum of nodes' degree, maxk and mink respectively; the maximum and minimum of nodes in a community, maxc and minc respectively; the ratio of the number of links between communities and the total number of edges in the network, mu, which indicates the significance of community structure in the network. The smaller the value of mu is, the more obvious the community structure of a network is. In this paper, the LFR program is used to generate two types of benchmark datasets, i.e., Group A and Group B. Each network in Group A has 1000 nodes and different mu values ranging from 0.1 to 0.5. Similarly, Group B has five networks with 5000 nodes. The two sets of data simulate small networks and large networks, respectively.

Real-World Networks
The real network is not random but has some characteristics. For example, many nodes have a small degree and also several nodes have a large degree. The nodes with a large degree play a very important role in the entire network. In addition, the distribution of edges is not all uniform, but is distributed more within the community, but less between the community. The network has a community structure, which means that the nodes in the community are likely to have some common attributes or play a similar role in the network. Community structures exist in many real networks, such as social networks, biological networks, engineering networks, and political networks. These real-world networks that are often used in community detection problems are applied to test MILPA, which are Karate [25], Dolphins [26], Polbooks [27], Email [28], Football [29], Hamsterster [30], DM-CX [31] and Facebook [32], as shown in Table 2.

Results
All the experiments were implemented in pycharm with python3.6 on a PC with 16 GB memory, Intel core i7 processor and Win10 system. In the process of writing the algorithm, the networkx library was mainly used.

NMI
In this experiment, LFR networks are applied to test the performance of MILPA, LPA, LPAMP, Fastgreedy and Louvain. GN has a worst-case complexity y(n 3 ) on a sparse graph that is infeasible in large networks, so it is not taken into consideration. The NMI of each method in different networks of each group is shown in Figure 3. The mu value in Every group is sampled at intervals of 0.1 at [0.1, 0.5].
In this experiment, LFR networks are applied to test the performance of MILPA, LPA, LPAMP, Fastgreedy and Louvain. GN has a worst-case complexity (n 3 ) on a sparse graph that is infeasible in large networks, so it is not taken into consideration. The NMI of each method in different networks of each group is shown in Figure 3. The mu value in Every group is sampled at intervals of 0.1 at [0.1, 0.5]. From the results of Figure 3, the proposed MILPA is proved to be the best, followed by Louvain and LPAMP. Especially in those networks with 5000 nodes, the NMI values of other methods show a significant downward trend, whereas that of MILPA are always close to 1 as the mu varies. LPA and Fastgreedy have rather worse performance that the NMI values of the two methods significantly decrease as mu grows in both groups. Besides, LPA is unstable due to its large performance gap between the two groups. In those networks with 1000 nodes, when ≥ 0.3, LPA has a sharp decline of NMI and finally cannot detect communities at all.
MILPA, LPAMP and Louvain, as modularity-based methods, all have relatively stable performance in the LFR networks. MILPA and LPAMP which are based on LPA show their improvements comparing to LPA. As a greedy algorithm, Fastgreedy is easy to fall into local optimal solutions, resulting in the worst performance.

Modularity
To test the ability of MILPA to detect the community structure of real networks, eight real-world datasets are applied, including four small networks and four large networks. To better show the effect of community division, three small networks are selected and their results of community detection are shown in Figure 4. The nodes in a community are marked with the same color and form a circle. It can be seen that most edges are inside the community and a few of edges are between the communities. From the results of Figure 3, the proposed MILPA is proved to be the best, followed by Louvain and LPAMP. Especially in those networks with 5000 nodes, the NMI values of other methods show a significant downward trend, whereas that of MILPA are always close to 1 as the mu varies. LPA and Fastgreedy have rather worse performance that the NMI values of the two methods significantly decrease as mu grows in both groups. Besides, LPA is unstable due to its large performance gap between the two groups. In those networks with 1000 nodes, when mu ≥ 0.3, LPA has a sharp decline of NMI and finally cannot detect communities at all.
MILPA, LPAMP and Louvain, as modularity-based methods, all have relatively stable performance in the LFR networks. MILPA and LPAMP which are based on LPA show their improvements comparing to LPA. As a greedy algorithm, Fastgreedy is easy to fall into local optimal solutions, resulting in the worst performance.

Modularity
To test the ability of MILPA to detect the community structure of real networks, eight real-world datasets are applied, including four small networks and four large networks. To better show the effect of community division, three small networks are selected and their results of community detection are shown in Figure 4. The nodes in a community are marked with the same color and form a circle. It can be seen that most edges are inside the community and a few of edges are between the communities.  Then, to show the stability of the proposed MILPA, twenty independent tests are performed on these networks with LPA, LPAMP, and MILPA, and the results of three small networks are shown in Figure 5. Firstly, it can be clearly seen from this figure that the average modularity of MILPA is much higher than that of LPA and LPAMP. Secondly, the modularity distribution of LPA is relatively scattered, and the difference between the maximum and minimum of its modularity in each network is much larger than MILPA. Besides, while the modularity of LPAMP may sometimes be greater than that of MILPA, the overall stability of MILPA is better than that of LPAMP. The same observations can be obtained from other networks. Then, to show the stability of the proposed MILPA, twenty independent tests are performed on these networks with LPA, LPAMP, and MILPA, and the results of three small networks are shown in Figure 5. Firstly, it can be clearly seen from this figure that the average modularity of MILPA is much higher than that of LPA and LPAMP. Secondly, the modularity distribution of LPA is relatively scattered, and the difference between the maximum and minimum of its modularity in each network is much larger than MILPA. Besides, while the modularity of LPAMP may sometimes be greater than that of MILPA, the overall stability of MILPA is better than that of LPAMP. The same observations can be obtained from other networks. Then, to show the stability of the proposed MILPA, twenty independent tests are performed on these networks with LPA, LPAMP, and MILPA, and the results of three small networks are shown in Figure 5. Firstly, it can be clearly seen from this figure that the average modularity of MILPA is much higher than that of LPA and LPAMP. Secondly, the modularity distribution of LPA is relatively scattered, and the difference between the maximum and minimum of its modularity in each network is much larger than MILPA. Besides, while the modularity of LPAMP may sometimes be greater than that of MILPA, the overall stability of MILPA is better than that of LPAMP. The same observations can be obtained from other networks. For a further comparison, three other methods are also used for community detection, i.e., Fastgreedy, GN, and Louvain. To ensure the effectiveness of the results, every experiment is performed ten times independently and the average modularity of each method in each network is shown in Figure 6. These methods are sorted according to their modularity in descending order in each chart of this figure. Note that in four large networks, GN and LPAMP take a lot of time, so they are not considered. It is evident that, among these networks, MILPA shows outstanding performance compared to LPA, LPAMP, GN, and Fastgreedy, and it is only second to the best method Louvain. Moreover, in four small networks, MILPA is almost as good as Louvain and always better than other methods. Another significant phenomenon is that with the rise of network size, the performance gap between methods becomes larger. For instance, the performance of MILPA is improved by no more than 14% in four small networks compared with LPA, while it achieves 16.6%, 23.9%, 32.3% and 58.1% improvement in four large networks, Email, Facebook, DM-CX, and Hamsterster, respectively. There is no doubt that LPA is the worst of these methods since its modularity is the lowest in six of eight networks. For a further comparison, three other methods are also used for community detection, i.e., Fastgreedy, GN, and Louvain. To ensure the effectiveness of the results, every experiment is performed ten times independently and the average modularity of each method in each network is shown in Figure 6. These methods are sorted according to their modularity in descending order in each chart of this figure. Note that in four large networks, GN and LPAMP take a lot of time, so they are not considered. It is evident that, among these networks, MILPA shows outstanding performance compared to LPA, LPAMP, GN, and Fastgreedy, and it is only second to the best method Louvain. Moreover, in four small networks, MILPA is almost as good as Louvain and always better than other methods. Another significant phenomenon is that with the rise of network size, the performance gap between methods becomes larger. For instance, the performance of MILPA is improved by no more than 14% in four small networks compared with LPA, while it achieves 16.6%, 23.9%, 32.3% and 58.1% improvement in four large networks, Email, Facebook, DM-CX, and Hamsterster, respectively. There is no doubt that LPA is the worst of these methods since its modularity is the lowest in six of eight networks.
It can be observed from above results that, no matter in a small network or a large network, MILPA can achieve a high modularity value and greatly improve the stability of the standard LPA in community detection.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 13 Figure 6. The modularity of different methods in eight real-world networks.
It can be observed from above results that, no matter in a small network or a large network, MILPA can achieve a high modularity value and greatly improve the stability of the standard LPA in community detection.

Conclusion and Future Work
This paper proposed a new approach, namely MILPA, to detect community structures in networks. While it is an improved version of LPA, MILPA has several unique features. The major difference is that it takes the incremental strategy to carry out label propagation which is the opposite of LPA. In the incremental strategy, MILPA always centers on the node with the highest intensity and starts to find other neighbors that are closely connected, so it greatly reduces the randomness of the algorithm. Experiment results have shown that this strategy can speed up the convergence of label propagation and greatly reduce the randomness of the process. Another feature of MILPA is the introduction of modularity optimization, which enables some nodes to be moved into more appropriate communities to produce a better partition. This is a key factor of the proposed method

Conclusions and Future Work
This paper proposed a new approach, namely MILPA, to detect community structures in networks. While it is an improved version of LPA, MILPA has several unique features. The major difference is that it takes the incremental strategy to carry out label propagation which is the opposite of LPA. In the incremental strategy, MILPA always centers on the node with the highest intensity and starts to find other neighbors that are closely connected, so it greatly reduces the randomness of the algorithm. Experiment results have shown that this strategy can speed up the convergence of label propagation and greatly reduce the randomness of the process. Another feature of MILPA is the introduction of modularity optimization, which enables some nodes to be moved into more appropriate communities to produce a better partition. This is a key factor of the proposed method to have better performance than LPA. Due to this operation, however, MILPA is slower than LPA which has linear time complexity. Complexity analysis and improvements of MILPA will be important tasks in our future research.
One point that cannot be ignored is the execution order of the algorithm, that is, incremental label propagation first and modularity-based label updating later. This combination, i.e., staged operation, makes MILPA outperform GN, LPAMP, and Fastgreedy which are also based on modularity optimization. The three existing methods do not take this operation but find the best community by modularity optimization every time they move nodes, so it is easy for them to fall into local optimal solutions. Interestingly, Louvain, the best algorithm which is based on modularity optimization, takes the staged operation too. Thus, this phenomenon sheds some light on the design of community detection algorithms.
There is also one important advantage of MILPA. Experiments on LFR benchmark networks demonstrated that it can find community structure of a network even it is not obvious. As the community structure of synthetic networks becomes more and more difficult to distinguish, the performance of other algorithms (including Louvain) declines significantly, but the proposed algorithm maintains a high detection performance. MILPA therefore may be viewed a high-quality competitor when new methods are tested in LFR networks.
In summary, the proposed MILPA has a number of major improvements over traditional LPA and it can be considered as an important supplement of community detection techniques for its performance superiority to many modularity-based methods. In the future work, we will focus on the time complexity of the algorithm to further improve the performance of the algorithm. We will also consider developing our algorithm to weighted networks and directed networks and generalizing it to larger networks.