Network Community Detection on Metric Space

Community detection in a complex network is an important problem of much interest in recent years. In general, a community detection algorithm chooses an objective function and captures the communities of the network by optimizing the objective function, and then, one uses various heuristics to solve the optimization problem to extract the interesting communities for the user. In this article, we demonstrate the procedure to transform a graph into points of a metric space and develop the methods of community detection with the help of a metric defined for a pair of points. We have also studied and analyzed the community structure of the network therein. The results obtained with our approach are very competitive with most of the well-known algorithms in the literature, and this is justified over the large collection of datasets. On the other hand, it can be observed that time taken by our algorithm is quite less compared to other methods and justifies the theoretical findings.


Introduction
The rise of on-line networking communities in real-world graphs, such as large social networks, web graphs and biological networks, have initiated the important direction of network community detection [1][2][3][4]. A network community (also known as a module or cluster) is typically a group of nodes with more interactions among its members than the remaining part of the network [5][6][7]. To extract such group of nodes of a network, one typically selects an objective function that captures the intuition of a community as a set of nodes with better internal connectivity than external [8,9]. The objective is generally NP-hard to optimize [6,8]; heuristics [10,11] or approximation algorithms [6] are used in practice to find sets of nodes that approximately optimize the objective function, which is interpreted as real communities.
Another important approach is to define communities as the output of an algorithm that converges automatically, with some intuitive hope to extract good communities [12,13]. Identified communities have some different importance in different domains. In social networks, community means an organizational unit, in a biochemical network, a functional unit, in a collaboration network, a scientific discipline, and so on [14].
Our observations regarding the development of network community detection algorithms are as follows: (1) the network community detection is not easy NP-hard, like data clustering, due to the lack of good heuristics; (2) both graph traversal-based methods and spectral methods are computationally overloaded due to the verification of the objective function value, which is required to guide the next iteration and; (3) the rich literature of clustering is not very suitable for graph data.
Some methods are available for network community detection, which tries to develop a similarity or distance function among the nodes of a complex network and to use that similarity or distance for partitioning the network [15][16][17][18][19][20][21]. Most of the methods of community detection, based on similarity or distance, mainly use the shortest path, Jaccard similarity, set similarity or Euclidean distance, and they are less successful for network community detection in terms of conductance and modularity. In some cases, weighted graph are a requirement, which is not always obtained naturally in real networks. Complex networks are characterized by a small average path length and a high clustering coefficient; the way the metric is defined should be able to capture the crucial properties of complex networks. Therefore, we need to create the metric very carefully, so that it can explore the underlying community structure of the real-life networks.
In this work, we develop the notion of a metric among the nodes using some new matrices derived from the modified adjacency matrix of the graph, which is flexible over the networks and can be tuned to enhance the structural properties of the network required for community detection. The main contributions of this work include: • A detailed study of the community detection algorithms.
• Transforming a graph to a metric space, preserving its structural properties.
• Studying the complex properties of real-world networks on induced metric space.
• Developing community detection algorithms on induced metric space.
• Analyzing the results and complexities of the developed algorithms.
• Comparing the community detection algorithms with other existing methods.
The rest of this paper is organized as follows: Section 2 describes the state of the art of the network community detection literature. In Section 3, the problem of transforming a graph into a metric space is discussed, and the properties of a real complex network are studied. In Section 4, the problem of network community detection is formulated, and several possible solutions are presented in the induced metric space. Furthermore, the initialization procedures, termination criteria and convergence are discussed in detail. The results of the comparison between community detection algorithms are illustrated in Section 5. The computational aspects of the proposed framework are also discussed in this section.

Network Community Detection
Community detection in real networks aims to capture the structural organization of the network using the connectivity information as the input [6,8]. Early work on this domain was attempted by Weiss and Jacobson while searching for a work group within a government agency [5].
Most of the methods developed for network community detection are based on a two-step approach. The first step is specifying a quality measure (evaluation measure, objective function) that quantifies the desired properties of communities, and the second step is applying algorithmic techniques to assign the nodes of a graph into communities by optimizing the objective function.
Several measures for quantifying the quality of communities have been proposed; they mostly consider that communities are a set of nodes with many edges between them and few connections with nodes of different communities. Some of the community evaluation measures are described in the next subsection.

Community Evaluation
Several measures for quantifying the quality of communities have been proposed: • Modularity: The notion of modularity is the most popular for network community detection purposes. The modularity index assigns high scores to communities whose internal edges are more than that expected in a random network model, which preserves the degree distribution of the given network.
• Internal density: Density is defined by the number of edges (m s ) in subset S divided by the total number of possible edges between all nodes (n s (n s − 1)/2). The "2" is there to cancel out duplicated edges. Internal density = m s /(n s (n s − 1)/2) • Edges inside: This is somewhat useless by itself, since it is not related to any other attributes of subset S; the total number of edges (m s ) present in subset S. Edges inside = m s • Average degree: This is the average internal degree across all nodes (n s ) in subset S. Average degree = 2m s /n s • The fraction over the median degree: This determines the number of nodes that have an internal degree greater than the median degree of nodes in subset S.
• Triangle Participation Ratio: The best measure for density, cohesiveness, and clustering within the goodness scales. Robust under random and expand perturbations. The fraction of nodes in S that belong to a triad. TPR = (number of nodes belonging to a triad)/n.
• Expansion: This measure of separability gives the average number of external connections (c s ) per node (n s ) in subset S with graph G. It can be thought of as the external degree. Expansion = c s /(n s (n − n s )).
• Cut ratio: This metric is a measure of separability and can be thought of as external density. It is the fraction of external edges (c s ) of subset Sout of the total number of possible edges in graph G.
• Conductance: This is the ratio of edges inside the cluster to the number of edges leaving the cluster (captures the surface area to volume ratio). It measures best in separability (goodness scale), measuring well-separated non-overlapping communities. It is robust under node swap and shrink perturbation. Community-like sets of nodes have lower conductance.
• Normalized cut: This represents how well subset S is separated from graph G. It sums up the fraction of external edges over all edges in subset S (conductance) with the fraction of external edges over all non-community edges.
• Maximum out degree fraction: This metric first finds the fraction of external connections to internal connections for each node (n s ) in S. It then returns the fraction with the highest value.
• Average out degree fraction: This is the sum of the individual fraction of edges outside of the community over the total connections of a node in subset S. It is then divided by the total number of nodes (n s ) in subset S.
• Flake out degree fraction: This is a fraction of the number of nodes that have fewer internal connections than external connections to the number of nodes (n s ) in subset S.
There are several other measures of quality determination for a network community. However, the most widely-used measures are modularity and conductance. The majority of the algorithms are developed using either of the measures as their optimization criteria.

Popular Algorithms
In this subsection, we give a brief list of the algorithms developed for network community detection purposes. The basic approach and the complexity of execution is also given briefly (Table 1) in this subsection.
• Fast greedy algorithm: This algorithm was developed by Newman et al. [22,23]. It is modularity based and uses a hierarchical agglomerative approach. It is called fast greedy, because it is significantly faster than older algorithms and uses a greedy method.
• Walktrap algorithm: This algorithm by Pons and Latapy [15] uses a hierarchical agglomerative method. Here, the distance between two nodes is defined in terms of a random walk process. The basic idea is that if two nodes are in the same community, the probability to get to a third node located in the same community through a random walk should not be very different. The distance is constructed by summing these differences over all nodes, with a correction for degree.
• Eigenvector algorithm: This algorithm by Newman [24] is modularity based, and it uses an optimization method inspired by graph partitioning techniques. It relies on the eigenvectors of a so-called modularity matrix, instead of the graph Laplacian traditionally used in graph partitioning.
• Label propagation algorithm: This algorithm by Raghavan et al. [13] uses the concept of node neighborhood and the diffusion of information in the network to identify communities. Initially, each node is labeled with a unique value. Then, an iterative process takes place, where each node takes the label that is the most spread in its neighborhood. This process goes on until one of several conditions is met, for instance no label change. The resulting communities are defined by the last label values.
• Spinglass algorithm: This algorithm by Reichardt and Bornholdt [25] is an optimization method relying on an analogy between the statistical mechanics of complex networks and physical spinglass models There are more algorithms developed to solve the network community detection problem; a complete list can be obtained in several survey articles [7,12,14]. Some interesting recent articles are [26][27][28][29][30][31][32].
A partial list of algorithms developed for network community detection purpose is tabulated in Table  1. The algorithms are categorized into three main groups as spectral (SP), graph traversal based (GT) and semi-definite programming based (SDP). The categories and complexities are also given in the Table  1.

Observations and Motivations
Community detection is an extensively studied research problem of network science. However, a good algorithm for a large real network is still in demand for research communities. Two major criteria to be satisfied by good algorithms are: (1) they must find a partition of the network that is optimal with respect to modularity or conductance; and (2) the algorithm should be computationally efficient on large networks. The notable pitfalls of the existing algorithms are that most of the algorithms developed based on spectral methods or semi-definite programming rely on global optimization and need to compute the costlier functions under the evaluation criteria in each iteration and increase the burden of computation drastically, thus becoming inefficient for large networks. On the other hand, graph-based algorithms rely on local heuristic method or exhaustive search. The algorithms based on exhaustive search are not suitable for large networks. However, the local methods are computationally good, but fail to achieve a close value from the optimal modularity for large networks.
A good alternative is to transform a network to a metric space, where we can achieve good optimality along with automatic convergence, thus leading to less computational burden for large networks; but, we need to create the metric very carefully, so that it can explore the underlying community structure of the real-life networks.

Graph to Metric Space Transformation
In this section, we demonstrate the procedure to transform a graph into points of a metric space and develop the methods of community detection with the help of a metric defined for a pair of points. We have also studied and analyzed the community structure of the network therein.
As discussed in sub-section 2.3, the nodes of the graph do not lie on a metric space, e.g., edges do not reflect the Euclidean distance between the nodes. The standard Euclidean distance and spherical distance defined over the adjacency or Laplacian matrices above failed to capture similarity information among the nodes of a complex network. On the other hand, the algorithms developed based on the shortest path or Jaccard similarity are computationally inefficient and have less success in terms of standard evaluation criteria (like conductance and modularity).
In this work, we have tried to develop the notion of similarity among the nodes using some new matrices derived from the adjacency matrix and the degree matrix of the graph. Let A be the adjacency matrix and D the degree matrix of the graph G = (V, E). The Laplacian L = D − A. We have defined two diagonal matrices of the same size D(λ ) and D(λ x ), where λ is a parameter determined from the given graph and can be optimized from the optimization criteria of the problem under consideration. In D(λ ), a fixed optimally-determined value is used in the diagonal entries of the matrix D, and in D(λ x ), a variable value, also optimally determined, is used in the diagonal entries of the matrix D. The similarities are defined on matrices L 1 and L 2 , where L 1 = D(λ ) + A and L 2 = D(λ x ) + A, respectively, are the spherical similarity among the rows and determined by applying a concave function φ over the standard notions of similarities, like the Pearson coefficient (σ PC ), the Spacerman coefficient (σ SC ) or the cosine similarity (σ CS ). φ (σ )() must be chosen using the chord condition to obtain a metric.

Graph to Metric Space Algorithm
In this subsection, we demonstrate the algorithm to convert the nodes of the graph to the points of a metric space preserving the community structure of the graph. The algorithm depends on the sub-modules (1) construction of L x (L 1 or L 2 ) and (2) obtaining a structure-preserving distance function. The algorithm works by picking a pair of nodes from L x and computing the distance defined in the second module.

L x Construction
The L 1 is defined as L 1 = D(λ ) + A, where A is the adjacency matrix of the given network and D(λ ) is a diagonal matrix of the same size with diagonal values equal to a non-negative constant λ .
The L 2 is defined as L 2 = D(λ x ) + A, where A is the adjacency matrix of the given network and D(λ x ) is a diagonal matrix of the same size with diagonal values determined by a non-negative function λ x of the node x.
The choice of λ and λ x plays a crucial role in combination with the function chosen in the second module for the determination of a suitable metric and is discussed later in this subsection.

Function Selection
The function selection module determines the metric for a pair of nodes. The function selector φ converts a similarity function (Pearson coefficient (σ PC ), Spacerman coefficient (σ SC ) or cosine similarity (σ CS )) into a distance matrix. In general, the similarity function satisfies the positivity and similarity condition of the metric, but not the triangle inequality. φ is a metric-preserving (φ (d(x i , x j ) = d φ (x i , x j )), concave and monotonically-increasing function. The three conditions above are referred to as the chord condition. The φ function is chosen to have minimum internal area with the chord.

Choice of λ and φ (σ )()
The choices in the above sub-modules play a crucial role in the graph to metric transformation algorithm to be used for community detection. The complex network is characterized by a small average diameter and a high clustering coefficient. Several studies on network structure analysis reveal that there are hub nodes and local nodes characterizing the interesting structure of the complex network. Suppose we have taken φ = arccos, σ CS and constant λ ≥ 0. λ = 0 penalizes the effect of the direct edge in the metric and is suitable to extract communities from a highly dense graph. λ = 1 places a similar weight of the direct edge, and the common neighbor reduces the effect of the direct edge in the metric and is suitable to extract communities from a moderately dense graph. λ = 2 sets more importance for the direct edge than the common neighbor (this is the common case of available real networks). λ ≥ 2 penalizes the effect of the common neighbor in the metric and is suitable for extracting communities from a very sparse graph.
The choice of λ depends on the data complexity for community detection (DCC) value (sub-section 4.5) of the input graph, i.e., whether it is sparse or dense, and its cluster structure.
The algorithm for transforming a graph to the points of a metric space is given in Algorithm 1.
Theorem 1. M = (V, d) constructed in the above Algorithm 1 is a metric space with respect to the metric d, i.e.,: The proof of the theorem is straight forward and satisfies the following metric properties: Algorithm 1 Mapping a graph into the metric space.
where v i , v j ∈ V and a k is the k-th row of A and φ is an affine function.

Community Detection on Induced Metric Space
In this section, we explore the k partitioning algorithm for the purpose of network community detection by using the metric space constructed above for each graph. We have also studied and analyzed the advantages of the k partitioning method over the standard algorithm for network community detection.

k-Partitioning
The community detection methods based on k-partitioning of a graph are possible using the newly-defined node distance, because the nodes of the graph are converted into the points of a metric space. The k-partitioning of a graph uses this distance converges automatically and does not compute the value of objective function in iterations; therefore, it reduces the computation compared to standard graph partitioning methods. The results of k-partitioning of a graph using a metric are competitive on the large set of networks shown in Section 5. The algorithm for community detection using k-partitioning and its detailed analysis is given below (Algorithm 2). Before that, we need to determine the value of k, and that is discussed in the next sub-section.

k Selection
Determining the optimal number of k is an important problem for community detection researchers. An extensive analysis can be found in the work of Leskovec et al. [47]. The standard practice is to solve an optimization equation with respect to k for which the optimal value of the objective function is achieved. Another method based on farthest first traversal is also very useful in terms of computational efficiency [48]. For small networks, the global optimization works better, and for a very large network, the second choice gives a faster approximate solution.

Initialization for k-Partitioning
The set of initial nodes are also a very important problem for the k partitioning algorithm: • Input: graph G = (V, E), with the node similarity sim(x a , x b ) defined on it, • Output: A partition of the nodes into k communities C 1 ,C 2 , ...,C k , • Objective function: Maximize the minimum intra-community similarity: Algorithm 2 k-center partitioning algorithm. for j = 1 to k do 5:

Convergence
Convergence of the network community detection algorithms is the least studied research area of network science. However, the rate of convergence is an important issue, and a low rate of convergence is the major pitfall of most of the existing algorithms. Due to the transformation into the metric space, our algorithm is equipped with the quick convergence facility of the k-partitioning on the metric space by providing a good set of initial points. Another crucial pitfall suffered by the majority of the existing algorithms is the validation of the objective function in each iteration during convergence. Our algorithm converges automatically to the optimal partition, thus reducing the cost of validation during convergence.
Theorem 2. During the course of the k center partitioning algorithm, the cost monotonically decreases.
Proof. Let Z t = {z t 1 , . . . , z t k } , T t = {C t 1 , . . . ,C t k } denote the centers and clusters at the start of the t-th iteration of the k partitioning algorithm. The first step of the iteration assigns each data point to its closest center; therefore, cost(T t+1 , Z t ) ≤ cost(T t , Z t ).
In the second step, each cluster is re-centered at its mean; therefore, cost(T t+1 , Z t+1 ) ≤ cost(T t+1 , Z t ).

Theorem 3. If T is the solution returned by farthest-first traversal and T o is the optimal solution, then cost(T o ) ≤ cost(T ) ≤ 2cost(T o ).
Proof. The proof of the theorem can be obtained in [48].

Data Complexity
The key characteristics of complex network are "high clustering coefficient" and "small average path length". The first property justifies the community structure of the network, whereas the second property justifies the small world phenomena of real networks. Given a network, that is given a number of nodes and a number of edges, what are the bounds of the average distance and clustering coefficient? The two properties of the optimal complex network (OCN) are (1) the minimum possible average distance and (2) the maximum possible clustering coefficient. There is usually a unique graph with the largest average clustering, which at the same time has the smallest possible average distance. In contrast, there are many graphs with the same minimum average distance, ignoring their average clustering. The objective of this work is to measure the community detectability of the complex network, G(N, m, L,C), where N is the number of vertices, m is the number of edges, L is the average path length and C is the average clustering coefficient.
Average path length: L N,m . The smallest possible average distance of a graph with N vertices and m edges we denote L N,m = 1 m ∑ u,v∈E d(u, v). Clustering coefficient: If d u (> 1) is the degree of a vertex u and t u is the number of edges among its neighbors, its clustering coefficient is C(u) = t u / d u 2 .
In some graphs, community detection is easy, and most of the algorithms work very well (e.g., disjoint cliques). On the other hand, in some graphs, community detection is very difficult, and some algorithms rarely work well (e.g., circular graph).
Data complexity of community detection: Informally, Given a graph with N vertices and m edges G (N, m), to what extent we can reveal the community structure is the data complexity for community detection of that graph. Data complexity for community detection (DCC) is denoted as (α (G(N, m, L,C))), α(G(N, m, L,C)) near zero for a graph for which is is easy to detect community and α(G(N, m, L,C)) near one with no community structure. DCC is calculated as the ratio between common edges of G * (N, m, L,C) and G(N, m, L,C) with m the number of edges of G or G * , where G * (N, m, L,C) is a graph with the same average path length constructed by adding the minimum number of edges to an empty graph of N nodes followed by the addition of more edges to obtain the total number m by maximizing the clustering coefficient.
A higher value of DCC for a particular network signifies that we can extract a good community structure of the network; however, a lower value of DCC signifies that none of the algorithms are very useful to capture the community structure of the network. Another advantage of DCC is that it can assess the quality of an algorithm. When DCC is high and the value of the evaluation measure is low, it simply signifies that there is enough room to improve the algorithm.

Experiments and Results
We performed many experiments to test the proposed network detection method via induced metric space over several real networks given in Table 2. The objective of the experiment is to verify the behavior of the algorithm and the time required to compute the algorithm. One of the major goals of the experiment is to see the behavior of the algorithm with respect to the change of values of the crucial limits of the data and the parameters of the algorithm.
Experiments are also conducted to compare the results (Tables 3, 4 and 5) of our algorithm with the state-of-the art-algorithms (Table 1) available in the literature in terms of common measures mostly used by the researchers of the domain of network community detection. The details of several experiments and the analysis of the results are given in the following subsections.

Experimental Designs
Experiment for comparison: In this experiment, we compared several algorithms for network community detection with our proposed algorithm based on metric space. The experiment is performed on a large list of network datasets. Two versions of the experiment are developed for comparison purposes based on two different quality measures: conductance and modularity. The results are shown in the Tables 3 and 4, respectively.
Experiment on the performance and time: In this experiment, we evaluated our algorithm for the performance on the network collection (Table 2). We evaluated the time taken by our algorithm on different sizes of networks, and this is shown in the Table 5.

Performance Indicator
Modularity: The notion of modularity is the most popular for network community detection purposes. The modularity index assigns high scores to communities whose internal edges are more than expected in a random network model, which preserves the degree distribution of the given network.
Conductance: Conductance is widely used in the graph partitioning literature. The conductance of a set S with complement S C is the ratio of the number of edges connecting nodes in S to nodes in S C by the total number of edges incident to S or to S C (whichever number is smaller).

Datasets
A list of real networks taken from several real-life interactions is considered for our experiments, and they are in Table 2 below. We have also listed the number of nodes, the number of edges, the average diameter, the data complexity for community detection (DCC) and the k value used (sub-section 4.2). The values of the last column can be used to assess the quality of detected communities, as discussed in the sub-section 4.5.

Computational Results
In this subsection, we compare two groups of algorithms for network community detection with our proposed algorithm based on metric space. The experiment is performed on a large list of network datasets. Two versions of the experiment are developed for comparison purposes based on two different quality measures: conductance and modularity. The results based on conductance are shown in the Table 3, and the results based on modularity are shown in the Table 4, respectively. Regarding the two groups of algorithms, the first group contains algorithms based on semi-definite programming, and the second group contains algorithms based on graph traversal approaches. For each group, we have taken the best value of conductance in Table 3 and the best value of modularity in Table 4 among all of the algorithms in the groups. The results obtained with our approach are very competitive with most of the well-known algorithms in the literature, and this is justified over the large collection of datasets. On the other hand, it can be observed that time taken (Table 5) by our algorithm is quite less compared to other methods and justifies the theoretical findings described in Sections 3 and 4.

Parameter Settings
The values of several parameters are very crucial in our algorithm. Here, we discuss the different settings of k, λ , DCC and the affine function. For each datum described in Table 2, the k value is obtained by optimizing the conductance value, as described in Subsection 4.2, and the values are provided in Table 2. For small datasets (not considered for our experiments), the results are very sensitive to k, whereas for large networks (all of the above list), the results are less sensitive to k. The value λ is taken λ = 2 in all of the computation above; however, the results can be improved more by optimizing lambda. The DCC value provides us prior information about the community structure; it can be observed that we obtained good community structure where the DCC value is high. In all of the experiments described above, the φ (σ )() is constructed with the arccos function and cosine similarity.

Results Analysis and Achievements
In this subsection, we describe the analysis of the results obtained in our experiments shown above and also highlight the achievements from the results. It is clearly evident from the results shown in Tables 3, 4 and 5 that the proposed metric-based method for network community detection provides very good competitive performance with respect to conductance modularity and time. However, a good community detection algorithm must provide the results close to the unknown optimal community structure. To assess the optimality, we have considered the best results of each class of algorithms and treated them as one of the best known estimate to the optimal community structure of the network. It is also evident from the results that our method provides results very close to the considered estimates of optimal communities.

Conclusions
Network community detection became an important research problem in recent years. In this article, we have demonstrated and analyzed a new approach to network community detection via metric space induced by the graph. The main achievement of the work was to use the rich literature of clustering in metric space. Clustering is easy NP-hard in metric space, whereas network community detection is NP-hard. The results obtained with our approach were very competitive with most of the well-known algorithms in the literature and justified over the large collection of datasets. Our algorithm converges automatically to optimal clustering. It does not require verifying the objective function value to guide the next iteration, like popular approaches, thus saving the time of computation.