Community Structure Detection for Directed Networks through Modularity Optimisation

: Networks constitute powerful means of representing various types of complex systems, where nodes denote the system entities and edges express the interactions between the entities. An important topological property in complex networks is community structure, where the density of edges within subgraphs is much higher than across different subgraphs. Each of these subgraphs forms a community (or module). In literature, a metric called modularity is deﬁned that measures the quality of a partition of nodes into different mutually exclusive communities. One means of deriving community structure is modularity maximisation. In this paper, a novel mathematical programming-based model, DiMod, is proposed that tackles the problem of maximising modularity for directed networks.


Introduction
It is increasingly clear that a wide range of systems across different disciplines can be described using network representations.The edges in a network can be binary, weighted, directed and with a positive or negative sign, thus rendering networks a suitable tool to model diverse types of interactions [1].One common observation is that real world networks are not random graphs with homogeneous edge distribution, rather, a high density of edges exist within subsets of the network, while edge density across these subgraphs is much lower, giving rise to the emergence of a property known as community structure [2].
Community structure detection refers to the procedure of identifying the inherent higher order structure of a network by partitioning nodes into different modules.The modularity metric (Q) was introduced by Newman and Girvan [3] for undirected networks and measures the quality of a network partition into communities.A modularity value close to 1 suggests strong community structure, i.e., a larger proportion of edges falling within modules than at random.
Community detection is often formulated as an optimisation problem, where the partition that yields the maximum modularity for a target network is sought.This metric, however, has been proven to be NP-complete [4], and, therefore, exhaustive search algorithms are only applicable to small networks, as the number of possible partitions increases at least exponentially with the number of nodes.Modularity has also been shown to have a resolution limit [5] and a degenerate solution space [6].There have been attempts in the literature to address resolution limit issues by adding a resolution parameter, but, as this parameter is hard to determine and increases method complexity, the problem is still not fully resolved [7].
While modularity optimisation for undirected networks has been well studied over the past decade, little has been done for module detection in directed networks.Many real world networks are inherently directed, including the World Wide Web [18], brain neural networks [19] and metabolic networks, where the directed edge represents material flow from one substrate to a product that may be irreversible [20].In literature, the conventional manner of tackling the problem of community detection in directed networks is simply to treat the networks as undirected and apply the methods described above [21]; however, such approaches lose valuable information carried in edge directionality.Some algorithms use edge directionality to transform directed graphs to weighted networks or bipartite networks [22].Recent algorithms propose the use of spectral optimisation [23], local extraction of communities [24] and blockmodeling or statistical inference models [25].
In this study, we use the mathematical description of modularity for directed networks introduced by Leicht and Newman [26] that explicitly considers the in and out-degree distributions of nodes in the network.We propose a two-step algorithm named DiMod composed of two mathematical programming models to detect modules in directed networks by maximising modularity.The two models have equivalent objective functions, but the terms of the equations are rearranged differently to accentuate a desired property of the mathematical model, as described in the section below.We compare the results of our method to three algorithms implemented in the Radatools software (v.4.0,DEIM, Tarragona, Spain) [27] on synthetic and real networks used as test cases.

Iterative Mathematical Programming Model for Modularity Optimisation on Directed Networks
We propose an iterative procedure that contains two major mathematical models to optimise modularity for directed networks.The first is a Mixed Integer Non-linear Programming Model (MINLP) that can provide a good partition of the network quickly but is likely to converge to a local optimal region, while the second model, a Mixed Integer Linear Programming Model (MIP), is harder to solve but is capable of finding a solution of higher quality.The algorithm starts by solving the MINLP a number of times, and the best partition is then selected and given as an initial point to the reduced MIP model that will iteratively improve the solution.We reduce the complexity of the MIP by allowing a few nodes to change modules while keeping all other nodes in their initial allocation and we solve each of these reduced MIPs for each module in an iteration.
As mentioned previously, the reason for proposing two different models for solving the same problem is that although a MINLP can find an acceptable solution for a large problem, it often converges to locally optimal solutions, and thus the quality of the solution is hard to guarantee.On the other hand, an MIP model can be solved to global optimality for small problems but consumes large computational resources for larger networks.We combine the best of these two types of models by first solving the MINLP and then the reduced MIPs to improve the acquired solution.The entire computational strategy is named DiMod and the mathematical description follows.
The sets, parameters and variables used in the models are described below: The next two sections describe the two models, and the following section describes our proposed algorithm and how both stages are employed to find the models of directed networks.

First Model-MINLP
Modularity is defined as the number of edges that fall within communities minus the expected number of edges that should fall into communities in a null configuration of the equivalent network with edges being placed at random [3].The modularity for directed networks can be formulated as below [26]: where L m L represents the fraction of (weighted) edges that fall into module m, while is the expected value for module m.
The sum of weights of edges in module m, L m , is computed as: indicating that an edge pointing from node n to node e is included in a module m if and only if both nodes belong to module m, i.e., Y nm = 1 and Y em = 1.For a given directed network, the sum of weights of edges coming into node n is denoted as parameter d in n , while the sum of weights of edges pointing from node n is d out n .For an unweighted network, d in n and d out n are simply the in-degree and out-degree of node n.For a module m, the sum of d in n and d out n over all nodes belonging to this module (Y nm = 1) are, respectively, calculated as below: We are interested in non-overlapping partitions where each node can only be allocated to exactly one module, and this is modeled via the following constraints: The first step of the algorithm, the full MINLP model, is summarised below: maximize Q subject to constraints (Equations (1)-( 5))

Second Model-MIP
The presence of non-linearity, combined with the use of integer variables, present considerable computational difficulty for finding globally optimal solutions.Solving MINLP problems typically involves repeatedly specifying different initial starting points and solving the model to identify locally optimal solutions, which can generally be realised in affordable computational time.Thus, the MINLP model in the previous section is used to provide an initial network division before a more sophisticated method can be applied to refine the division.In this section, the MINLP model is reformulated as an MIP by redefining the two non-linear constraints (Equation ( 2) and the multiplication D in m D out m in Equation ( 1)) as linear constraints.
Firstly, Equation ( 2) can be replaced by the following three sets of constraints: where LS nem are newly introduced positive intermediate variables.For edges from node n to e, LS nem is equal to its weight, β ne , if it belongs to module m; 0, otherwise.The non-linear term in the objective function, D in m D out m , can be re-written as below: where D in m Y nm is the product of a continuous variable D in m and a binary variable Y nm and can be made linear using the following set of equations: where DY nm are introduced as new positive variables to replace the term D in m Y nm , the variable DD in_out m is introduced to replace D in m D out m , and U is an arbitrarily large number.The formulation of the objective function for this model becomes: and the full model is given by the set of equations below: maximize Q subject to constraints (Equations (3), ( 5)-( 11))

Full Algorithm
The proposed algorithm has three user-defined parameters: N MI NLP , the number of iterations for the MINLP model; N MIP , the number of iterations for the MIP model and N relaxed , the number of nodes to be released for each module at each MIP run.The optimal number of modules is determined by the algorithm itself, but we set a maximum number of modules to generate the set m.After the first model is solved N MI NLP times, the best value of modularity (Q best ) and its corresponding partition (Y best nm ) are selected and used in the second model.At each iteration of the MIP, N relaxed nodes are allowed to change modules, while the remaining nodes are kept in their previous modules.Each node within the current module is relaxed once only and the procedure of random node selection and re-allocation continues until all the nodes within the current module have been relaxed.In our preliminary tests, we have found that N relaxed = 60 provided the best compromise in calculation time and quality of solution for networks of different sizes.The full proposed model is summarised in Algorithm 1.
Algorithm 1: Our proposed algorithm DiMod for detecting modules in directed networks.

Data: Directed network
Result: The solution at the end of the final iteration is taken as the final network division Q best = 0; for N runs ← Compared to solving one large MIP model that directly optimises the community memberships of all nodes simultaneously, solving a series of reduced models has the advantage of reaching global optimality for each reduced problem at a small computational cost.In the next section, a number of directed networks are used to demonstrate the applicability and efficiency of the proposed community detection algorithm.

Synthetic Networks
The performance of our proposed community detection method is tested against other established methods in literature using synthetic networks generated by the Lancichinetti-Fortunato-Radicchi (LFR) benchmark [28].We have generated seven non-overlapping directed unweighted networks with the following parameters: 500 nodes; default minus exponent for degree sequence and community size distribution, t1 = 2 and t2 = 1, respectively; maximum in-degree maxk = 50; and minimum and maximum number of nodes per community, minc = 20 and maxc = 100, respectively.Each of the networks had a different mixing parameter (µ ∈ {0.1, 0.2, . . ., 0.7}), where larger µ indicates more connections across different communities.Overall, the collection of synthetic networks used allows validating the efficiency of our algorithm in networks with different properties of community structure.
We compare our results to three popular algorithms that optimise modularity for directed networks: extremal optimisation [11], fast algorithm [10] and Tabu search [29], as implemented in Radatools [27].We have executed each algorithm individually 100 times or until they reach a time limit of 24 h, and the best network division is reported for comparison.In terms of the proposed community detection approach, we solve 10 MINLPs (N MI NLP = 10), we perform 100 iterations on the MIP step of the algorithm (N MIP = 100) and 60 nodes are relaxed for each solve of the MIP model (N relaxed = 60).We implemented the models in GAMS [30], we use SBB [31] to solve the MINLP models and CPLEX [32] to solve the MIP models, and the optimality gap is set to 0, i.e., global optimum is achieved if the solver is allowed to run for unlimited time.The CPU time limit for each model was set to 1500 s.The deterministic nature of the proposed approach means only one single execution is required.
Figure 1 shows how the solution compares to the true community structure using the Normalized Mutual Information metric (NMI) [33].We note that DiMod uncovers the true network communities even in larger mixing parameters.The Extremal Optimisation algorithm performs relatively well, while Fast algorithm and Tabu search fail to identify the ground truth communities before µ = 0.5.

Real Networks
We demonstrate the performance of our proposed method in tests of five real networks.Table 1 shows a summary of the network properties.We include the directed and weighted neural network of Caenorhabditis elegans [34].Networks for Mycobacterium tuberculosis and Plasmodium falciparum, represent biological pathway interactions as extracted from the Reactome database (May 2014) [35].The network for Roget's Thesaurus details cross-references between categories of English words, where one edge points from one category to another if a reference is provided to the latter among the words of the former [36].In order to show that DiMod can solve larger problems, we include a snapshot of the Gnutella peer-to-peer network, gnutella08, obtained from the Stanford Large Network Dataset Collection (SNAP) [37].Our tests have shown that the second model (MIP) improves the quality of network division obtained by the MINLP model in the first stage.Table 2 below outlines the improvement in modularity between the more coarse initial network division provided by the MINLP and that of the final solution identified by the MIP step.Roget's Thesaurus is the network benefiting the most from the iterative procedure, where modularity is improved by about 16%.The iterative algorithm also boosts modularity by nearly 10% for Mycobacterium tuberculosis while improvements of around 4% can be observed in C. elegans and Plasmodium falciparum networks.The gnutella08 network also has an improvement of approximately 8% from the first to the second step and Figure 2 shows how modularity is improved at the end of each MIP iteration.As for synthetic networks, results obtained through DiMod for real networks are compared against modularity maximisation approaches in the literature (extremal optimisation [11], fast algorithm [10] and Tabu search [29]).Overall, the methodology proposed here consistently outperforms other methods, as shown in Table 3, which summarises our computational results.

Conclusions
This works reports novel mathematical programming formulation to address the problem of community structure detection in directed networks.While modularity optimisation has been extensively studied for undirected networks, there is little research effort to detect modules in directed networks.A mathematical programming-based optimisation approach has been introduced to fill the gap in literature.Modularity optimisation for a directed network can be conveniently formulated as an MINLP model, which can converge to locally optimal solutions quickly even in large networks.The MINLP model is reformulated as an MIP model by re-writing non-linear terms into linear equivalents, which can then be solved to obtain globally optimal solutions for small networks.The novel community detection method proposed consists of two major steps, taking advantage of both models.Firstly, the MINLP model is solved to produce an initial coarse network division.Given the initial network division, the iterative algorithm works by repeatedly removing the community memberships of random sets of nodes, solving the reduced MILP model and re-allocating the relaxed nodes to communities.
Using synthetic and real-world directed networks covering a wide range of node and edge sizes, the proposed iterative algorithm considerably improves the quality of the initial coarse network partition.Compared with three community detection methods in literature, the proposed approach is demonstrated to identify the best network division consistently, yielding the largest modularity value.Another advantage of the proposed approach is its deterministic nature, which means multiple executions are likely to converge into the same network division.
Finally, we note that owing to the nature of the mathematical programming framework used here, modelling can be flexible enough to allow additional constraints and parameters to be easily implemented according to user requirements [38].Prior knowledge on a particular system can be incorporated, for example in the form of nodes with similar functional annotations that may be constrained to be allocated in the same community [39].In future work, we plan to study how the density of nodes and distributions of in-degree and out-degree affect the hardness and calculation time of the optimisation problem.We also plan to modify current integer programming models so as to detect and prevent the resolution limit.

Figure 1 .
Figure 1.Performance of algorithms on synthetic directed networks.
for all nodes that belong to module m (Y nm = 1) L m sum of edge weights in module m LS nem a positive intermediate variable.LS nem = β ne if both nodes n to e belong to module m; 0 otherwise D in m Y nm represent the product of D in m and Y nm , used as an intermediate variable for the MIP model DD in_out n, e node m module l ne directed edge pointing from node n to e Parameters β ne weight of edge point from node n to e d in n sum of weights over all edges points to node n; incoming edge weight d out n sum of weights over all edges points from node n; outgoing edge weight L total amount of weights over all edges in the given network Binary Variables Y nm 1 if node n belongs to module m; 0 otherwise m represent the product of D in m and D out m , used as an intermediate variable for the MIP model

Table 1 .
Summary of networks used as benchmarks in this study.

Table 2 .
Modularity improvement achieved by second step of the proposed method over the initial division network given by the MINLP.

Table 3 .
Comparison of different community detection methods on real networks.