InfoFlow: A Distributed Algorithm to Detect Communities According to the Map Equation

: Formidably sized networks are becoming more and more common, including in social sciences, biology, neuroscience, and the technology space. Many network sizes are expected to challenge the storage capability of a single physical computer. Here, we take two approaches to handle big networks: ﬁrst, we look at how big data technology and distributed computing is an exciting approach to big data storage and processing. Second, most networks can be partitioned or labeled into communities, clusters, or modules, thus capturing the crux of the network while reducing detailed information, through the class of algorithms known as community detection. In this paper, we combine these two approaches, developing a distributed community detection algorithm to handle big networks. In particular, the map equation provides a way to identify network communities according to the information ﬂow between nodes, where InfoMap is a greedy algorithm that uses the map equation. We develop discrete mathematics to adapt InfoMap into a distributed computing framework and then further develop the mathematics for a greedy algorithm, InfoFlow, which has logarithmic time complexity, compared to the linear complexity in InfoMap. Benchmark results of graphs up to millions of nodes and hundreds of millions of edges conﬁrm the time complexity improvement, while maintaining community accuracy. Thus, we develop a map equation based community detection algorithm suitable for big network data processing.


Introduction
Formidably sized networks are becoming more and more common, including in social sciences, biology, neuroscience, and the technology space, where the number of nodes and edges may exceed millions or billions.In such cases, the sheer size of the network presents challenges in the processing, visualizing, understanding, or even storing the network [1][2][3].
When the network data exceeds the memory or disk storage capacity of a single computer, big data technology, including distributed filesystems and distributed processing techniques, can be used to store and process the data [1][2][3].The caveat with big data technology is that parallel algorithms have to be designed and implemented in place of the original, serial algorithms [4].
Whilst big data technology provides a means to network storage and processing, for visualization and analytical purposes, smaller sized networks are much favored.Thus, coarse-grained representations of the big networks that preserve important network properties is paramount.For instance, many social systems show homophily in their network representations: nodes with similar properties tend to form highly connected groups called communities, clusters, or modules.Community detection algorithms [5][6][7][8][9][10] have been an active area of research, with ample algorithms to identify network communities.
Different approaches to community detection exists [5][6][7][8][9][10].A more statistically oriented approach is the clustering approach, where sets of points (nodes in the graph; edges information tend not to be used) are categorized into groups called clusters, based on distance or density distributions in the state space [11].One popular algorithm is the k-means cluster algorithm, where each point is iteratively assigned to the cluster with the nearest arithmetic mean [11].
When graph edges are considered, community detection algorithms often have an implicit heuristic intra-community, wherein connections are more abundant than inter-community ones.Thus, the most popular approach is based on maximizing some measures of modularity that quantify the ratio of intra-and inter-community edges, relative to a random network, such as the Louvain algorithm [12].However, such approaches suffer from problems of the "resolution limit", where communities that should be distinct are merged into a bigger community [13].
Another approach is the information theoretic approach, where we interpret the edges of the network as transportation or flow between nodes.The map equation [14] provides an associated information cost for describing the movements within the network, given a community partitioning.If a network has regions in which a PageRank random surfer tends to stay for a long time, minimizing the cost of movement as described by the map equation would identify such regions as communities.Thus, this approach takes advantage of the duality between finding community structure in networks and minimizing the description length of a random surfer s movements on a network.Compared to the modularity approach, which tends to view graph edges as structural connections, and where the detected communities reflect how the network formed, the information theoretic approach sees graph edges as flow, and detects communities reflect network dynamics [7].In particular, the InfoMap algorithm [15] is a greedy algorithm that uses the map equation to partition a network into communities, performing well on both synthetic benchmarks and real networks [7].Further, it does not suffer from any resolution limit problems [16].It is undergoing active research with successful extensions to include those to capture higher-order flow, capturing time-dependent networks [17], overlapping communities [18], and multi-level communities [19].
In this paper, we adapt InfoMap into a distributed algorithm, given its strength in capturing network dynamics, and numerous extension possibilities.Similar projects exist to parallelize InfoMap [20][21][22][23], usually involving parallelizing a certain subset of the serial algorithm, with further assumptions of thread-locking or graph locality on top of InfoMap.In this paper, we propose two advancements: first, we develop discrete mathematics to adapt InfoMap into distributed computing framework.This is distinct from other existing works, since the entire algorithm and all data structures are parallelized and distributed, while keeping the algorithm identical to the serial InfoMap.Second, we further develop the mathematics for a greedy algorithm, InfoFlow, which has logarithmic time complexity, compared to the linear complexity in InfoMap.Benchmark results of graphs of up to millions of nodes and hundreds of millions of edges confirm the time complexity improvement, while maintaining community accuracy.Thus, we develop a map equation based community detection algorithm suitable for big network data processing.
This paper is structured as follows: In Section 2, we adapt InfoMap into a distributed computing framework, and develop the InfoFlow distributed algorithm.In Section 3, we perform benchmark and review results.In Section 4, we summarize and discuss future directions.

Methodology
In this paper, we build on top of the map equation and InfoMap to found the distributed algorithm InfoFlow, which has improved runtime complexity and can be easily deployed and applied to big datasets.We quickly present the map equation in Section 2.1 for easy reference.Then, we develop the discrete mathematics in Section 2.2, which allow InfoMap to be adapted to distributed computing framework.In Section 2.3, we further develop the discrete math for the InfoFlow algorithm, which has logarithmic time complexity, as compared to the linear time complexity in InfoMap.In Section 3, I perform benchmarking for the two algorithms.

The Map Equation
For a given network partition, the map Equation [14] specifies the theoretical limit of how concisely we can describe the trajectory of a PageRank random surfer on the network.The underlying code structure of the map equation is designed such that the description can be compressed if the network has regions in which the random walker tends to stay for a long time.
Given a network with n nodes indexed in Greek alphabets, we first perform PageRank analysis.The edges in the network will be interpreted as the transition probability, so that given an edge from node α to node β, with weight ω αβ , a PageRank random surfer has probability proportional to the edge weight to transit from node α to node β, not accounting teleportation probability yet.Since the edges signify transition probability, the edge weights are normalized with respect to the outgoing node, so that: α ω αβ = 1 then, each node will be associated with its ergodic frequency p α .
The map Equation [14] specifies that, if we partition the network nodes into modules, where each module is indexed with Latin alphabets, then the network will have an information entropy, or codelength, associated with its PageRank random walk: where: p i is the ergodic frequency of the module.This is simply the sum of the ergodic frequencies of the nodes within the module: and q i is the probability of exiting the module, accounting for PageRank teleportation: n i being the number of nodes within module i, and τ is the probability of PageRank teleportation.

InfoMap
Here, we develop mathematics to construct p i and q i , the ergodic frequency, and the exit probability of each module, thereby allowing calculation of Equation (1).In particular, the InfoMap algorithm [15] starts by having each node being its own module, and then in each iteration, merge two modules into one to reduce codelength.We develop maths to construct p i and q i , by providing formulae for the quantities in the merged module based on those in the two modules to be merged.
We can rewrite Equation (4) as: with: being the exit probability without teleportation.We can define a similar quantity, the transition probability without teleportation from module j to module k: Now, if we merge modules j and k into a new module with index i, the exit probability would be follow Equation ( 5) with: and the exit probability without teleportation can be calculated via: since we are looking at the exit probability of a module, there are no self-connections within modules, so that the specification of p α ω αβ given α ∈ i, β i is redundant.Then we have: which conforms with intuition, that the exit probability without teleportation of the new module is equal to the exit probability of all nodes without counting for the connections from j to k, or from k to j.We can further simplify the math's by expanding the non-inclusive set specification: Expanding gives: which by definition is: So that now, we can calculate ω i and by Equation (5) we can calculate q i .We can do similar for ω il , if we merged modules j and k into i, and l is some other module: = ω jl + ω kl (20) and similarly for ω li : With these calculations, we are able to construct the modular properties after each pair-wise merging, where the relevant properties include n i , p i , w i , and w ij .We can forget about the actual nodal properties; after each merge, we only need to keep track of modular properties.Now, we can implement the InfoMap algorithm, where initially each node is its own module, and then in each iteration, we merge the two modules that offer the greatest reduction in codelength, with the new module having modular properties according to Equations ( 8), ( 9), ( 16), ( 5), (20), and (21).The algorithm terminates when no more merges are possible to reduce codelength.Since the maximum number of merges is e − 1 merges, where e is the number of edges in the network, the number of merges have complexity O(e).A graphical illustration of the InfoMap algorithm is shown in Figure 1.  =   +   (21) With these calculations, we are able to construct the modular properties after each pair-wise merging, where the relevant properties include ni, pi, wi, and wij.We can forget about the actual nodal properties; after each merge, we only need to keep track of modular properties.Now, we can implement the InfoMap algorithm, where initially each node is its own module, and then in each iteration, we merge the two modules that offer the greatest reduction in codelength, with the new module having modular properties according to Equations ( 8), ( 9), ( 16), ( 5), (20), and (21).The algorithm terminates when no more merges are possible to reduce codelength.Since the maximum number of merges is e − 1 merges, where e is the number of edges in the network, the number of merges have complexity O(e).A graphical illustration of the InfoMap algorithm is shown in Figure 1.

InfoFlow
InfoMap merges two modules on each iteration, giving O(e) loops.One obvious improvement possibility is to perform multiple merges per loop.However, this idea is not compatible with the idea of performing pair-wise merges, unless we can make sure no module is involved with more than one merge at once.
Here, rather than focusing on making sure that no module is involved with more than one merge at once, we can explore the idea of merging multiple modules at once.Thus, we can perform parallel merges in the same loop iteration, where possibly all modules are involved in some merge.
Consider multiple modules ~Mi merging into a module M. Another way to express this equivalently is to say that a module M is partitioned into i non-overlapping subsets: Then we can expand the nodal sum over module M into the sum over all nodes in all submodules M i , the exit probability of the merged module M becomes: = = where we expand the second term with respect to the M j s to give: Combining the first and third terms, which we can recognize as: where we can immediately see that Equation ( 29) is a linear generalization of Equation ( 16), while Equations ( 30) and ( 31) are identical to previous definitions, and may be calculated iteratively as the previous algorithm.We can calculate ω M i M j by expanding on the partitioning: so that when we merge a number of modules together, we can calculate its connections with other modules by aggregating the existing modular connections.This is directly analogous to Equation (20).Thus, the mathematical properties of merging multiple modules into one are identical to that of merging two modules.This is key to developing my multi-merge algorithm, InfoFlow.
As InfoMap, each node is initially its own module.During each iteration, each module seeks to merge with a connected module that offers the greatest reduction in codelength, if only the merging of the two modules are considered.If no such merge exists, the module does not seek to merge.Then, the weakly connected modular components connected via the merge seeking are merged into one module, according to Equations ( 29) and (33).This is repeated seek-merging and bulk-merging is iterated until the codelength cannot be reduced.A graphical illustration of InfoFlow is shown in Figure 2.  (31) where we can immediately see that Equation ( 29) is a linear generalization of Equation ( 16), while Equation ( 30) and ( 31) are identical to previous definitions, and may be calculated iteratively as the previous algorithm.We can calculate      by expanding on the partitioning: so that when we merge a number of modules together, we can calculate its connections with other modules by aggregating the existing modular connections.This is directly analogous to Equation (20).
Thus, the mathematical properties of merging multiple modules into one are identical to that of merging two modules.This is key to developing my multi-merge algorithm, InfoFlow.
As InfoMap, each node is initially its own module.During each iteration, each module seeks to merge with a connected module that offers the greatest reduction in codelength, if only the merging of the two modules are considered.If no such merge exists, the module does not seek to merge.Then, the weakly connected modular components connected via the merge seeking are merged into one module, according to Equations ( 29) and (33).This is repeated seek-merging and bulk-merging is iterated until the codelength cannot be reduced.A graphical illustration of InfoFlow is shown in Figure 2. Next, n nodes are partitioned into m final modules according to InfoFlow.If we assume in each loop, k modules merge into one on average, and that there are l loop, we have: Next, n nodes are partitioned into m final modules according to InfoFlow.If we assume in each loop, k modules merge into one on average, and that there are l loop, we have: so that we have O(log k n) merges, while within each merge, there is O(k) time complexity related to the connected component with O(k) modules.Thus, the overall average time complexity is O(k log k n).
The worst case complexity comes in two cases: first, when we degenerate into InfoMap, i.e., we have O(e) loops, each loop we merge only two modules into one; second when l = 1 and k = n/m, and the overall complexity is O(k) = O(n/m).

Simulation and Results
InfoMap and InfoFlow are implemented on Apache Spark [24] using the Scala language.The code, which is open source and can be accessed online [25] is set up and run on a local desktop computer, the Dell Precision T3610, running Windows 7 Professional.For easy referencing, the specifications for the computer is Intel Xeon CPU E5-1620 v2 @3.70 GHz 3.70 GHz, with 64 Gb RAM.Apache Spark 2.1.1 and Hadoop 2.7 are used.When possible, RAM disk was used to speed up simulations.
First, we ran real world data of small to moderately sized graphs from [26][27][28], with both InfoMap and InfoFlow, and compared the runtime performance and resultant communities.The results are tabulated in Table 1, listing the nodes and edges of the graph, the number of loops ran before the algorithms completed, the runtime (for the sake of comparison, common runtime between the two algorithms, including initialization time and PageRank runtime, are not counted), and final partitioning codelength.The normalized mutual information between the community partitioning's given by the two algorithms is also tabulated.
Table 1.Benchmarking results for InfoMap and InfoFlow, for datasets of various size, on the Dell Precision T3610.To aid comparison we do not include common performance runtime, including file reading time and PageRank runtime from the table.Runtime in unit of seconds.In these test cases, while runtime complexity is linear in the case of InfoMap, it is pseudo-constant in InfoFlow, while keeping very high accuracy in the final partitioning, according to normalized mutual information (NMI).The datasets are real world graphs taken from [26][27][28].We see that, for InfoMap, the number of loops and runtime follow a linear complexity relationship with the size of the graph, while the number of loops in InfoFlow is kept within 4, and the runtime within 30 s, so that a pseudo-constant, or logarithmic complexity relationship with the size of the graph, along the estimation of Equation (36), is confirmed.As a solid example of the speedup in InfoFlow, for the graph with 27,770 nodes and 352,807 edges, there are 1000 times less loops in InfoFlow, and the runtime is nearly 3.5 h for InfoMap, and 25 s for InfoFlow, representing nearly a 500-time speedup.

Nodes
As analyzed in Equations ( 34)-(36), this speedup in runtime and complexity is a direct consequence of going from pair-wise merge in InfoMap to the multi-merging in InfoFlow.In terms of partitioning accuracy, it might be a concern whether the bulk-merging of InfoFlow might sacrifice partitioning accuracy.From the results of Table 1, we see that the codelength difference between the two algorithms are very similar, rarely exceeding a 5% difference.The normalized mutual information (NMI) is often kept higher than 80%, so the accuracy in community detection is not compromised when going from InfoMap to InfoFlow.
Having compared the runtime and accuracy between InfoMap and InfoFlow, we now apply InfoFlow to bigger data, with graphs going up to millions of nodes and hundreds of millions of edges [26][27][28], until the limit of the computing resource is challenged on the Dell Precision T3610.The results of the simulations are tabulated in Table 2.We see that the number of loops is kept within 20, so a pseudo-constant or logarithmic complexity is well followed, while the runtime is kept within a few hours.The runtime does not follow any obvious relationship with the number of nodes or edges, since the processing of the dataset challenges the limit of the computer, and complex performance issues with memory caching and paging comes into play.Importantly, the runtime is in the same order of magnitude with PageRank calculation time.Since the latter is a prerequisite for the map equation approach, the benchmarking results indicate we have optimal runtime complexity within the map equation approach to community detection.

Table 2.
Benchmarking results for InfoFlow, on the Dell Precision T3610, for datasets that challenge the limit the computer resource.The runtime for graph reading and PageRank calculations are listed to add perspective on community detection speed.Importantly, the number of loops are very small, indicating logarithmic complexity, and that the community detection runtime is on the same order of magnitude as PageRank.Since the latter is a prerequisite for the map equation approach to community detection, InfoFlow may have optimal runtime complexity for this class of algorithm.Runtime unit in seconds.The datasets are real world graphs taken from [26][27][28].

Conclusions
With a view of developing a distributed community detection algorithm, we developed discrete mathematics on the map equation to provide formulae for the modular properties for merged pairwise modules, which enabled the implementation of InfoMap algorithm on distributed computing.We then generalized this from a pairwise merge to merging arbitrary number of modules, which prompted the algorithm InfoFlow.Benchmarking results on an Apache Spark implementation confirmed that the runtime complexity of InfoFlow had logarithmic runtime complexity, compared to the linear time complexity of InfoMap, while retaining accuracy in the community results.
Similar projects to develop distributed community detection algorithms, in particular InfoMap, exist [20][21][22][23].These projects parallelize certain segments of the algorithm while keeping other segments and data structures in serial, with assumptions on thread locking or graph locality.In contrast, in this paper, we developed discrete math to adapt InfoMap into distributed computing framework, while keeping the algorithm identical, with no additional assumptions being made.In other words, we developed the mathematical formulation that enables parallel and distributed computing, rather than developing an inherently parallel algorithm.This is a significant development in InfoMap.InfoFlow was built with only one additional development of multi-merging, as opposed to the pair-wise merging in InfoMap.Benchmark results showed that this development improves runtime complexity while retaining result accuracy.Thus, the mathematics is a significant contribution to the research in InfoMap, which will be valuable future research in InfoMap extensions such as hierarchical structures, overlapping structures, and higher-order Markov dynamics [17][18][19].
The coding implementation [25] is open source and implemented in Apache Spark, which is actively maintained, with proven performance, reliability and scalability, with contributions from companies such as Google, Facebook, and IBM [29].It can be easily configured and deployed on clusters and cloud platforms.This is in contrast to, for example, the implementation in [23], which used the GraphLab PowerGraph library [30] which was not actively maintained.Another example is [20], which used the Thrill library [31], which is still in the experimental phase.The choice of distributed computing library framework, along with computational environment, is one of the major factors affecting runtime, memory consumption and other performance metrics.Meanwhile, algorithmic performance, which is agnostic to the coding implementation and library environment, may be measured via theoretical space and time complexity, where the logarithmic runtime complexity of InfoFlow shines.

Figure 1 .
Figure 1.Graph with 25 nodes as visual illustration of InfoMap algorithm.Each node is labeled with a node number for reference and is colored according to its assigned community.We show the first four merges, and last two merges according to the InfoMap algorithm.In each step, two communities are merged into one.There are 21 merges in total.For brevity, only the first four and last two merges are shown.(a) In the initial graph, each node is its own module; (b) nodes 23 and 24 are merged; (c) node 22 is merged with module 23-24; (d) nodes 17 and 19 are merged; (e) node 25 is merged with module 22-23-24; (f,g) for brevity, we skip to the last two merges, where we are left with a final partitioning of four communities.

Figure 2 .
Figure 2. The same 25-node graph, showing all merges according to the InfoFlow algorithm.There are totally 2 loops, compared to 21 in InfoMap.(a) The initial graph is shown for easy reference; (b) after one loop, the 25 communities are merged into 9, which is roughly the geometric mean between 25 and 4; (c) The final partitioning is identical to that of InfoMap.

Figure 2 .
Figure 2. The same 25-node graph, showing all merges according to the InfoFlow algorithm.There are totally 2 loops, compared to 21 in InfoMap.(a) The initial graph is shown for easy reference; (b) after one loop, the 25 communities are merged into 9, which is roughly the geometric mean between 25 and 4; (c) The final partitioning is identical to that of InfoMap. =