Algorithm for Detecting Communities in Complex Networks Based on Hadoop

: With the explosive growth of the scale of complex networks, the existing community detection algorithms are unable to meet the needs of rapid analysis of the community structure in complex networks. A new algorithm for detecting communities in complex networks based on the Hadoop platform (called Community Detection on Hadoop (CDOH)) is proposed in this paper. Based on the basic idea of modularity increment, our algorithm implements parallel merging and accomplishes a fast and accurate detection of the community structure in complex networks. Our extensive experimental results on three real datasets of complex networks demonstrate that the CDOH algorithm can improve the efﬁciency of the current memory-based community detection algorithms signiﬁcantly without affecting the accuracy of the community detection.


Introduction
In the era of Web 2.0, objects are connected to each other by various technologies such as the Internet and the Internet of Things, and form a variety of complex networks such as interpersonal interaction, essay reference, transportation, and protein interaction networks. Various complex networks are widely used in sociology, management, computer science, operations, biology, and other disciplines, while their wide application prospects have attracted the interest of many researchers. For example, Watts and Strogatz [1] applied the complex network theory in the field of biology and considered the nervous system to be a complex network of large numbers of nerve cells connected by nerve fibers. Faloutsos [2] applied the method of complex network analysis to study computer networks and evaluated their stability by analyzing their robustness. Sen et al. [3] mapped the transportation network to a complex network and implemented an optimal planning and configuration of the transportation network using dynamic analysis of the complex network. Xiao et al. [4] constructed a directed and weighted complex network based on the Beijing traffic network, analyzed the traffic network load-bearing pressure, and mined the corresponding regional centers, which provided a theoretical support for optimizing urban public transport network systems. Based on the characteristic analysis of the complex network itself, Ruguo [5] proposed a method for social coordination governance and provided ideas for solving mass public events based on a characteristic analysis of complex networks.
Many studies analyzed the inherent characteristics of complex networks and discovered the relationships between node attributes and connections within networks. To discover feature of complex networks, several community detection algorithms have been proposed. The so-called "community" is a sub-network composed of a group of nodes closely connected with their internal nodes and sparsely connected with other external community nodes. The community structure is a common feature of complex networks made up of one or more communities. The accurate identification of the community structure in complex networks play an important theoretical role for public opinion monitoring, interest recommendation, identification of the network internal structure and other related research. As a result, many researchers have studied community detection algorithms from the aspects of modularity and edge structure. For examples, Newman and Girvan [6] proposed the concept of modularity and mined the complex network community structure, Yang et al. [7] introduced a method for analyzing the edge structure and node properties allowing to improve the accuracy of the detection of the complex network community structure. The accurate identification of the community structure in complex networks have broad applications, such as influence maximization, influences discovery within a community, interest recommendation, edge intelligence empowered recommendation [8], and so on.
However, the existing studies about complex network community detection algorithms focused on small-scale data sets and limited to the improvement of the community detection accuracy while neglecting its efficiency. At the same time, the number of nodes in complex networks demonstrates an explosive growth trend considering the advent of big data era, increasing number of network users, and exponential increase of the generated contents. At present, many social networking platforms such as WeChat, Weibo, Facebook, and Twitter, have more than 100 million on-line users and various interaction forms, including follow-ups, comments, and sharing. The large-scale complex network data sets generated by such platforms have the characteristics of node diversity, complex structure, multi-complexity fusion, which challenges the accuracy of the traditional complex network community detection algorithms. Furthermore, the traditional community detection algorithms are based on matrix iterations, which make the algorithms unable to adapt to the requirements of real-time and flexibility.
In this paper, we propose a new complex network community detection algorithm based on Hadoop framework (called Community Detection on Hadoop (CDOH)). Hadoop is a distributed system infrastructure developed by the Apache Foundation. Our contributions are as follows: • Based on the idea of the maximum modularity, and combining the distributed characteristics of the Hadoop platform, a new modularity matrix update method is proposed and a corresponding community merging strategy is constructed to implement a fast and accurate detection and discovery of complex network community structures; • We theoretically analyze our proposed CDOH algorithm, and show the computational cost of our algorithm can achieve O(n) computational cost when we use enough parallel nodes; • Experimental results on 3 real datasets demonstrate that CDOH significantly outperforms the traditional complex network community detection algorithm in terms of both the efficiency and accuracy of the community detection of complex networks.
The rest of our paper are organized as follows. Section 2 introduces the related works. Section 3 describes our proposed CDOH algorithm and analyzed its computational complexity. In Section 4, we show the experimental results with theoretical analysis. Section 5 concludes the paper and presents the future works.

Related Works
Since Newman [6,9] proposed the module optimality algorithm, the modularity-based community detection approach has been used in many network community mining algorithms such as the classic fast Newman community division algorithm [9] and CNM algorithm [10]. The fast Newman community detection algorithm is an agglomerative hierarchical clustering algorithm that starts with a state, in which each node is the sole member of n communities, and repeatedly joins communities together in pairs, choosing a joint at each step, which results in the greatest increase (or smallest decrease) in modularity. Recently, domestic researchers such as Lei et al. [11] implemented an edge community mining algorithm based on the local information of the considering network. Xiong [12] proposed a community discovery algorithm that combined the user closeness with clustering algorithms. Weiping [13] proposed the concept of new gravity of users for an accurate community discovery; Leng [14] proposed a new network community detection algorithm based on a greedy optimization technology. Zhang et al. [15] further improved the fast Newman algorithm by introducing an improved index for the closeness centrality to classify overlapping nodes; the proposed method demonstrated a high classification accuracy in detecting overlapping communities with a time complexity of O(n 2 ).
Blondel et al. [16] improved the modular incremental solution method by merging communities iteratively using a new calculation formula to achieve good results. Parsa et al. [17] used a probability vector model based on a single variable edge distribution algorithm, that combines an evolutionary algorithm with a community discovery method to enable the community detection; Oliverira et al. [18] used an improved Kuramoto coupled oscillator synchronization model to analyze networks from their dynamic factors and implemented a method for community discovery in complex networks. Ling Xing et al. [19] proposed a method that combines the sliding time-window method with the hierarchical encounter model based on association rules to increase the fidelity of the extracted networks by alleviating the homophily effect. Yuhui Gong et al. [20] focused on the customers' conformity behaviors in a symmetry market where customers are located in a social network. Simulation results have shown that topology structure, network size, and initial market share have significant effects on the evolution of customers conformity behaviors. Recently, Aceto et al. [21,22] and Ruoyu Wang et al. [23] applied deep learning and machine learning technologies in the research about social networking.
Recently, researchers have proposed complex network community detection algorithms based on big data platforms. Clauset [24] proposed a community-based parallel detection method based on the CNM algorithm. The basic idea of the algorithm proposed in [24] is to calculate the maximum community modularity in parallel and recognize the communities of large-scale networks by decreasing the communication overhead. The limitation of this algorithm is that it fails to run when the network scale increases and the amount of data rises to a certain level. Jinpeng [25] proposed a link community recognition algorithm based on the Hadoop platform. While this algorithm resolves the limitation of the linked community method that cannot store and process large matrices when analyzing big networks, its efficiency is still not efficient enough. Furthermore, its processing time reaches more than 5000 seconds when the scale of nodes reaches 15,000. Riedy et al. [26] used servers with multi-core processors to calculate the maximum community modularity in parallel to identify communities. However, the proposed method has strong hardware dependencies.
Moon et al. [27] proposed a parallel GN algorithm [6] based on Hadoop that can be divided into 4 stages. Each stage includes the map and reduce process. In the first stage, the tuples of all node pairs are generated; in the second and third stages, the edges with large edge betweenness values are identified and removed, respectively; in the fourth stage, the tuples are recalculated according to the new network. The experiment results demonstrated that the efficiency of the algorithm increases linearly with the increase of the number of reducers which are in charge of reduce process. Weijiang et al. [28] proposed a parallel Louvain algorithm that solved the main time-consuming problem of calculating the modularity and ergodic modularity increment in the Louvain algorithm [29]. This proposed algorithm outputs the information about all neighbors of a node in the map phase and decides the new home community of the node in the reduce phase accordingly. When computing a new community of a node, it is necessary to ensure that the neighbor's community is up-to-date, which is hard to be guaranteed in a distributed environment. Therefore, it is easy to face the problems of "community interchange" and "community ownership delay," which can be solved by resolving the associated connected graph. To solve the problem of high complexity of the fast Newman algorithm [10] in calculating the modularity of nodes. Bingzhou [30] proposed a parallel fast Newman algorithm based on Hadoop that calculates the modularity increment of each node merged with its neighbors in the map stage in parallel. In the reduce stage, the 2 nodes with the largest modularity increment are found and merged. The map and reduce processes are executed iteratively until all nodes are merged into 1 community. To deal with the problems of the fast-unfolding algorithm in processing large-scale networks. Bingzhou [30] also proposed a parallel fast-unfolding algorithm based on Hadoop and the divide and conquer principle. First, a large-scale network is partitioned and merged separately, then the network is reconstructed according to the merging results of each partition, and finally the network is merged iteratively and reconstructed until the structure of community does not change any more. Conte et al. [31] proposed an algorithm which was able to find large k-plexes of very large graphs in just a few minutes and scale up to tens of machines with tens of cores each. Vincenzo el al. [32] proposed a novel algorithm for community detection in social networks based on game theory, and showed this algorithm outperformed other algorithms in terms of computational complexity and effectiveness. However, this algorithm cannot scale to a huge number of nodes and edges.
The traditional community detection algorithms focused on small-scale data sets and hard to scale to a large scale data sets. While parallel community detection algorithms are more scalable, they cannot achieve a good trade-off between the efficiency and accuracy. In order to overcome the shortcomings of traditional community detection algorithms and parallel community detection algorithms, we propose a new complex network community detection algorithm based on Hadoop, which effectively implements a fast and accurate detection of complex network community structure. Compared with traditional community detection algorithms, it can scale to a large scale data set. Compared with parallel community detection algorithms, it achieves a good trade-off between efficiency and accuracy.

Complex Network Community Detecting Algorithm Based on Hadoop
The proposed CDOH algorithm is based on the idea of the maximal modularity increment, which employs a new modularity matrix updating method and a community merging strategy.

Definitions
This section provides formal definition of the basic concepts involved in the proposed complex network community detection algorithm. The symbols and their meanings are shown in Table 1. Table 1. Symbols and Definitions.

Symbols
Meanings Denotes the connection between node v i and node v j , if they are connected, e ij is 1; Otherwise e ij is 0. Here, V = {v i | i = 1, 2, · · · , n} represents a set of nodes in a complex network, and E = {e ij | v i , v j ∈ V} represents a set of edges in a complex network, where e ij denotes the connection between nodes v i and v j . If they are connected, then e ij = 1; otherwise, e ij = 0.

Definition 2.
(Node degree) In a complex network N = (V, E), the node degree d i of each node v i is defined as the number of edges connected to node v i , which is defined by Equation (1), Figure 1 illustrates a simple network community structure. According to Definitions 1 and 2, there are 12 nodes in the network (from v 1 to v 2 ), where e 12 = 1, e 19 = 0, and v 1 has a node degree d 1 = 4.  (2).

Definition 3. (Modularity) The modularity of a network M is defined by Equation
Here, C = {c i | i = 1, 2, · · · , k} denotes the detected set of network community structures, l c denotes the total number of edges interconnected between nodes within the community c, m denotes the total number of edges in the network, and where D c denotes the sum of the node degrees of all nodes in the community c, and D c equals to 2 times of the sum of l c and the total edges of connecting the community c and other external communities. According to Equation (2), the modularity of complex networks measures the degree of closeness within the community and the degree of sparseness between the communities. The closer the internal connection of the community is and the thinner the connection between the communities is, the greater the modularity M is, and vice versa. Thus, when the modularity M of a complex network is the largest, the community detection results are optimal. However, it is quite difficult to determine directly whether M has reached its maximum. Therefore, the concept of the modularity increment M proposed by Newman is adopted, where the increase or decrease in the modularity M caused by merging communities c i and c j , which is defined as Equation (4).
Here, R ij denotes the number of connection edges between communities c i and c j in which i = j. Then the modularity M increases progressively when M > 0. On the contrary, if M < 0, the modularity M is the maximum and the process of the community detection ends.
When the number of nodes and edges in a complex network are kept the same, and different communities are merged to form a new community, the number of edges among nodes within the new community is the sum of the number of edges within the 2 merged communities and the number of edges between the 2 merged communities. Accordingly, [14] points out that when the number of nodes and edges are kept the same, the increase of the modularity between the new communities formed by merging multiple known communities and other communities can be established as Equation (5).
Here, c z denotes the new community after merging, c k denotes the old community that does not belong to c z , c i denotes the old community merged to c z , and < c i , c k > denotes the edge set from community c i to community c k .
Taking the network structure in Figure 1 as an example, we can see that each node represents a community. Equation (4) can be used to calculate the modularity increment M among any 2 communities and form a matrix as shown in Table 2, where the first row and column represent the community number. We focus only on the 2 same communities need to be merged, and the changes within the community need not be considered, so the diagonal of the matrix can be initialized to 0. From the values of the matrix, we can observe that communities that can be merged in this example are c 2 and c 4 , c 2 and c 5 , c 7 and c 12 , c 11 and c 12 , where M is the maximal value, that is, 0.036. Taking the community c 13 formed by merging c 2 and c 4 as an example, the results after merging are listed in Table 3.
As can be noticed from Tables 2 and 3, the modularity increment between the community c 13 and other communities is the sum of the modularity increment between the communities c 2 , c 4 , and the corresponding communities. For example, in Table 3, the modularity increment of the communities c 1 and c 13 is 0.021, which is the sum of the modularity increment, 0.033, of c 1 and c 2 , and the modularity increment, −0.012, of c 1 and c 4 , as shown in Table 2.
Considering that the modularity matrix update algorithm has the characteristics of merging communities in parallel and conforms to the characteristics of parallel processing on the Hadoop platform, we select the modularity incremental update method represented by Equation (5) to construct the proposed CDOH algorithm. According to the modularity increment represented by Equation (4), we initialize the entire network, treat each node as a community, and calculate the modularity increment when merging any 2 communities. Then, we iterate consecutively to find new communities. Based on the MapReduce parallel programming model, all the 2 communities with the maximum modularity increment are identified and merged in parallel. Equation (5) is used to update the modularity increment when merging any 2 communities in parallel. The community discovery process ends when the maximum modularity increment is negative. Finally, the CDOH algorithm stores the node set V as (vId, cId), where vId denotes the node number and cId denotes the community number, and the edge set E is represented as (s, d, M), where s denotes the source node of the edge, d denotes the destination node of the edge, and M is the modularity increment corresponding to this edge.

The CDOH Algorithm
Based on the research framework of complex network community detection algorithm on the Hadoop platform shown in the Section 3.1. The CDOH has 4 steps, that is, first, we will initialize the parameters; second, we will find the maximum modularity increment; third, we will merge the communities and update the modularity increment; finally, we will generate the final community discovery results. Step 2 and step 3 will be repeated to find new communities until the maximum modularity increment is negative. We shown the flow charts of CDOH algorithm in Figure 2. Here, step 1 (Parameter initialization), step 2 (Finding the maximum modularity increment), and step 3 (Merging communities and updating the modularity increment) are implemented based on MapReduce parallel programming model of Hadoop.

Parameter Initialization
The initialization phase is responsible for calculating the necessary parameters of the algorithm, which includes the total number of nodes n, total number of edges m, degree d of each node, vector a, and the modularity increment M between each pair of nodes. The process is listed in Algorithm 1, the main steps of which include the following: • First, we load the complex network data from the input file, then calculate the number of nodes n and edges m of the complex network, and broadcast the number of edges (m) to all nodes; • Second, we calculate the degree d of each node and the vector a according to Equation (3); • Finally, we use Equation (4) to calculate the modularity increment M between each pair of nodes, and construct a new network N using this modularity increment.

Input
Here, we firstly divide the n × n matrix into multiple sub matrix, then we deploy multiple mappers, and let each mapper calculate the vector a of each node and calculate the modularity increment between each pair of nodes of each sub matrix. Each mapper works in parallel.

Find the Maximum Modularity Increment
After completing the modularity increment calculation, we initiate the iterative community discovery, and find multiple community pairs with the largest modularity increment, and merge them into the corresponding new communities. Taking the network shown in Figure 1 as an example. According to the M matrix shown in Table 2, communities c 2 and c 4 , c 2 and c 5 , c 7 and c 12 , c 11 and c 12 can be merged. Clearly, communities c 2 , c 4 , c 5 and c 7 , c 11 , c 12 should be merged to the community c 13 and community c 14 , respectively. Algorithm 2 describes the steps involved in finding the modularity increment, which has 4 steps.
• First, we compare the M value of each edge e in network N, find the maximum modularity increment max( M), and broadcast it to all nodes in the cluster; • Second, we get the cartesian product T of the edge set E and node set V, T = (s, sc, d, dc, M), s denotes the number of the source node, d denotes the number of destination node, sc and dc denote the community numbers of the source node and destination node respectively, and M denotes the modularity increment between the source node and destination node; • Third, we find the sub-set MC in the set T, where M equals to max( M); • Finally, to organize the merged communities, we obtain the community number (i) of the source node and the community number (j) of the destination node, which represent the current communities to be merged. If i or j already belongs to a new community in C, we will get the new community to merge i and j into it, or merge i and j into another new community, whose number is n + 1. The final output is the community C after merging.  (i, j) = getCommuNum(t); 9: if i ∈ C or j ∈ C then 10: k = Get the new number of community i or j from C; 11: c k = insert(i,j); 12: else 13: n = n+1; 14: c n = insert(i, j); Here, we find the maximum modularity increment max( M) based on the MapReduce. After dividing the n × n matrix into multiple sub matrix, in the map phrase, each mapper finds the maximum modularity increment of each sub matrix and output the results to the reducer, and then in the reduce phrase, the reduce output the maximum modularity increment max( M). Afterwards, we find the community pairs with the largest modularity increment based on MapReduce. Each mapper finds the community pairs with the largest modularity increment of each sub matrix in parallel.

Merging and Updating Communities
Merging and updating communities are the core of the proposed algorithm. Since after step 2, the community pairs with the maximum modularity increment are identified to be merged, the mapper updated the number of the communities that need to be merged and the community number of the corresponding nodes to their corresponding new community number in parallel, and the M of any 2 communities are updated by the mapper in parallel.
The steps of merging and updating of communities listed in Algorithm 3 are the following.
• First, we obtain the Cartesian product T of the node set V and edge set E. Then, we look for the new community number corresponding to sc and dc in t = (s, sc, d, dc, M). Let X to be the set of community numbers to be merged in this round contained by the new community of the community t.sc and Y to be the set of community numbers to be merged in this round contained by the new community of the community t.dc; • Second, using Equation (5), we will merge and update community i in X and community j in Y. If there is an edge connecting communities i and j, then the modularity increment between new communities X and Y should include the modularity increment between communities i and j. However, if there is no edge connecting communities i and j, the modularity increment between new communities X and Y should be reduced by the doubled product of vector value a i of community i and vector value a j of community j.

Algorithm 3 Merging and Updating Communities
Input: if (tsc ∈ C or tdc ∈ C) and tsc = tdc then 7: X = a set of community numbers to be merged in this round contained by the new community corresponding to t.sc; 8: Y = a set of community numbers to be merged in this round contained by the new community corresponding to t.dc; 9: for each community i in X and each community j in Y do 10: if there exists at least an edge connecting i and j then 11:

Generating Community Discovery Results
After the community discovery finishes, redundant data in the data set (primarily the matrix data) should be cleared, while the initial node set and their community number should be kept. Here, the node storage structure in the network is considered to be V = (vId, cId), where vId denotes the node number and cId denotes the community number indicating which community each node belongs to. Algorithm 4 presents the process of generating the results of the community partitions, which has 2 steps:

•
We will first traverse all nodes and keep the nodes with the same community number cId together. If cId is already in C, it means that the corresponding community of cId has already appeared. The node Ids in the community cId that have been stored in C need to be taken out, merged with the current node Id, and then stored in C; otherwise they are stored in C directly; • Then we store the community and community's node set on the Hadoop distributed file system (HDFS) one by one. Thus, CDOH stores the final results of community discovery with a set of the tuple (cId, vIds), and finishes the detection and discovery of complex network communities on Hadoop platform.

Computational Complexity Analysis of the CDOH Algorithm
As presented before, in step 1, we let multiple mappers take charge of the initializing process of n × n sub-matrix. Supposed the matrix is divided into m matrices, and let each mapper takes charge of each sub-matrix in parallel, so the computational complexity of the initializing process of the matrix is the computational complexity of the initializing process of the sub-matrices, that is O( n 2 m ). In step 2, the maximum modularity increment max( M) and the community pairs with the largest modularity increment is found based on MapReduce. Again, if we divide the matrix into m sub-matrices, and let each mapper takes charge of each sub-matrix in parallel, the computational complexity of step 2 is also O( n 2 m ). In step 3, the mapper updated the number of the communities that need to be merged and the community number of the corresponding nodes to their corresponding new community number in parallel, whose computational complexity is O(1). After merging, the M values of any 2 communities are updated by the mapper in parallel. Supposing that each mapper works on a sub-matrix, the computational complexity of updating M is O( n 2 m ). In step 4, all nodes are traversed and the nodes with the same community number are kept together, whose computational complexity is O(n). Since step 2 and step 3 are repeated until the the maximum modularity increment max( M) becomes negative, and after some iterations, the n × n matrix will shrink to a constant computing cost. As a result, our algorithm can achieve a performance that is in reverse proportion to the number of sub-matrices, which is determined by the number of nodes in the Hadoop platforms. Supposing we have n nodes to conduct the parallelly computing, we can achieve a O(n) computing cost.

Datasets and Evaluation Algorithms
To evaluate the accuracy and running time of CDOH, 3 real complex network data sets obtained from the Stanford Network Analysis Project (SNAP) were selected. The data sets contain the nodes and connection status of real complex networks and mark the communities to which the nodes belong. Table 4 gives the characteristics of the data sets used in the experiments. To evaluate our algorithm, we use 2 state-of-the-art algorithms in our experiments, that is, the traditional complex network community detection algorithm Fast Community Detection (FCD) proposed by Newman [9] and the non-overlapping community detection algorithm Non-Overlapping Community Detection Idea (OCDI) proposed by Zhang et al. [15].
All the algorithms were implemented with Java, and our algorithm was deployed on Hadoop cluster made of 3 different computers, of which 1 serving as a master node and the other 2 serving as slave nodes. The following experimental results are represented as average across the 10 runnings.

Analysis of Community Detection Accuracy
We used the community detection accuracy (CDA) metric to measured the accuracy of community detection. CDA is defined as the ratio of the number of nodes in the correctly identified communities to the total number of nodes in the network, which is shown in Equation (6).
Here, C = {c 1 , c 2 , · · · , c k } denotes the original and accurate community set, C = {c 1 , c 2 , · · · , c l } denotes the community set identified by the community detection algorithm, max{|C i ∩ C j | | C j ⊂ C i } denotes the maximum number of the common nodes between all community sets and the i-th accurate community c i , and n denotes the number of nodes. As can be seen, the larger the value is, the higher the accuracy of a community detection algorithm is and the better the quality of the resulting community is. Figure 3 shows the community discovery accuracies of the considered algorithms on the 3 different data sets.

Soc-Epinions1
Web-NotreDame Soc-Pokec It can be noticed from Figure 3 that the accuracy of the CDOH algorithm is slightly lower than that of the FCD algorithm (on average by 1.7%) and similar to that of OCDI. The reason for this is that CDOH and OCDI have similar community merging strategies and module update principles. While multiple communities are merged at one time in the same iteration according to CDOH and OCDI, FCD only supports one-time merging of 2 communities in a single iteration, which results in the accuracy gap between FCD and the other 2 algorithms.
We also used the normalized mutual information (NMI) to evaluate our algorithm in comparison to the other 2 algorithms. NMI [33] is a standard factor which is often used to detect the difference between the results of the division and the true partition of the network. NMI can be described in Equation (7), in which H(X) is the entropy of X, and H(X|Y) = H(X, Y) − H(Y).
We can see from Figure 4 that the NMI of the 3 algorithms can reach at least 75%. Our algorithm, CDOH, has a very similar NMI score to the FCD algorithm and has a slightly higher score than OCDI. Again, we consider this is due the fact that our algorithm has similar community merging strategies and module update principles.
However, the computing cost of our algorithm is much better than that of FCD, which will be discussed in Section 4.3.

Analysis of Community Detection Efficiency
CDOH is a community detection algorithm based on Hadoop platform for large-scale complex networks. For processing large scale data, the run time of the algorithm is an important metric to evaluate its performance of the algorithm. Figure 5 shows the comparison of the run time of the 3 considered algorithms.
It can be noticed from Figure 5 that CDOH is highly efficient. To compared with OCDI and FCD, we can see that CDOH is about 2.1 times and 3.2 times faster, respectively, which is mainly determined by the number of slave nodes on the Hadoop platform. Compared with the traditional community detection algorithms, CDOH uses significantly less time required for community merging and modularity updating.

Conclusions
In this paper, we proposed a community detection algorithm called CDOH based on the Hadoop platform to implement accurate and fast community identification in large-scale complex networks. The algorithm was based on the modularity increment calculation method, which employed the theory of complex networks to find multiple communities satisfying certain merging conditions. The parallel merging and modularity updating of communities based on MapReduce used in the proposed algorithm reduce the number of iterations. CDOH was compared with traditional complex network algorithms using real large-scale complex networks. The experimental results evaluated the effectiveness of CDOH in large-scale network community detection.

Future Works
Our proposed CDOH algorithm is independent of the underlying big data platform. To prove its effectiveness and efficiency, we implemented the CDOH algorithm and other complex network community detection algorithms based on the Hadoop platform. However, in the Hadoop platform, the MapReduce intermediate results are first stored in disk files, and a large number of I/O operations will affect the whole calculation time; while in the Spark platform, the intermediate results are stored in memory, which avoids the performance overhead brought by I/O. In the future, we will implement the CDOH algorithm on the Spark platform and evaluate the efficiency. Furthermore, our proposed CDOH algorithm focused on static complex network community discovery, in the future, we plan to adapt the proposed algorithm to the evolving community networks.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.