Large Scale Community Detection Using a Small World Model

: In a social network, small or large communities within the network play a major role in deciding the functionalities of the network. Despite of diverse deﬁnitions, communities in the network may be deﬁned as the group of nodes that are more densely connected as compared to nodes outside the group. Revealing such hidden communities is one of the challenging research problems. A real world social network follows small world phenomena, which indicates that any two social entities can be reachable in a small number of steps. In this paper, nodes are mapped into communities based on the random walk in the network. However, uncovering communities in large-scale networks is a challenging task due to its unprecedented growth in the size of social networks. A good number of community detection algorithms based on random walk exist in literature. In addition, when large-scale social networks are being considered, these algorithms are observed to take considerably longer time. In this work, with an objective to improve the efﬁciency of algorithms, parallel programming framework like Map-Reduce has been considered for uncovering the hidden communities in social network. The proposed approach has been compared with some standard existing community detection algorithms for both synthetic and real-world datasets in order to examine its performance, and it is observed that the proposed algorithm is more efﬁcient than the existing ones.


Introduction
In a real world, various categories of networks play different roles in the society for different purposes viz.social networks, which represents social interactions among human beings in society, citation networks that represent the articles of various authors published in the particular field and their associated citations in other papers, technological networks that represent the distribution of resources, biological networks that represent protein-protein interaction in the network, etc. Social networks are considered as having interesting research domains due to their characteristics of involving human social activities.Network evolution [1], network modeling [2], centrality analysis [3], information diffusion [4], link prediction [5], and community detection [6] are some of the interesting research directions in social networks.Power-law degree distributions [7], small world networks [8], and community structures are some of the important properties observed in the social network.
Communities are found to be one of the most important features of large-scale social networks.Uncovering such hidden features enables the analysts to explore the functionalities in the social network.There exists quite a good number of definitions of community depending on the contexts pertaining to different applications.However, as per few number of commonly accepted definitions, they are considered to be a group of nodes that have a dense connection among themselves as compared to sparsity outside the group.Communities in the social network represent a group of people who share common ideas and knowledge in the network.Hidden communities can be explored through learning from social dynamics in the network [9].Their identification helps in getting insight into social and functional behavior in the social network.However, due to unprecedented growth in the size of social networks, it is quite a hard task to discover subgroups in the network within a specified time limit.Real-world social networks are observed to follow the power-law in both degree-distribution and community size distribution [7].
A distributed framework like Hadoop may be considered as a better alternative for processing a large volume of data in complex and heterogeneous social networks.Hadoop internally uses a Map-Reduce algorithm for processing computation in multiple nodes in a cluster.It uses a dedicated file system known as Hadoop Distributed File System (HDFS) for storing data across multiple nodes in the cluster.A network is said to have small-world properties if the geodesic distance between any two nodes is small.In a small world network, for a fixed average degree, the average path length between pairs of the node in the network increases logarithmically with the increase in number of nodes or, in other words, small world network exhibits pure exponential growth with respect to walk-length in the network [10].These inherent properties of real-world networks make it difficult for graph mining.There exist a plethora of community detection algorithms in the literature, where most of them emphasize maximizing the quality parameter in order to detect communities in a large-scale network.Sometimes, they are insignificant in exploring communities in a reasonable amount of time, due to the resolution limit of modularity [11].The community detection algorithm is said to be faster and efficient, only if it follows a small world network phenomenon.The small world network concept is based on the six degrees of separation principle [12].
In this study, the Map-Reduce approach has been used to uncover the hidden communities in a large-scale network.Map-Reduce algorithms always follow two crucial phases: one is mapper and another one is the reducer.In this work, mapper phase has been used in mapping the nodes to their corresponding communities.An effort has been made to discover the communities using a small world model.In reducer phase, nodes are being clustered based on their walk length and similarity index (η) with the source node.Random walk based similarity index is introduced to measure the strength of social ties.
The subsequent sections of this paper is organized as follows: in Section 3, some preliminaries about community structure, small world network, power-law degree distribution and clustering coefficient has been discussed.Section 4 presented the random walk process in the network.In this section, a new similarity index has been devised based on a random walk in the network.The proposed methodology has been presented in Section 5. Community detection and clustering phase have been discussed in this section.Section 6 presented the implementation part of the work.Experimental results have been discussed in Section 7. Section 8 presents the possible threat to validation of the work.Conclusions and future work have been discussed in Section 9.

Related Work
Community detection is similar to a graph partitioning problem.Most of the graph partition methods are based on optimizing a quality function.Girvan and Newman have proposed the first community detection algorithm, which is based on a hierarchical partitioning problem in a graph.
In their work, modularity has been chosen as the objective function for accessing the quality of obtained partition [13].In this algorithm, edges are removed iteratively in the order of their edge-betweenness value until it reaches the maximum modularity.Edge-betweenness value of an edge can be expressed as the number of shortest paths between a pair of nodes that passes through the edge.Modularity taken in this paper does not consider the information about unrelated pairs inside the network.A new modularity known as MIN-MAX modularity has been devised by R. Behera and M. Jena to optimize the community partitions in the paper [14].It not only considers dense connections within the group, but it also gives the penalty to unrelated pairs within the group.
Random graph is a kind of graph where edges are distributed randomly among the nodes, but, unfortunately, it does not resemble a real-world network.A random graph generation model was proposed by Watts and Stogatz, which helps in generating the random graph with small world properties like average short path length and high clustering coefficient [15].It has been observed that a small world network lies in between regular and random graphs.Communities are frequently observed in small world networks.
Ego network is the group of nodes consisting of a central actor and other nodes that are directly connected to it.They resemble with the properties of the small-world network.The central node in the ego network may have highest influential ability as compared to other nodes in the network.Exploring focal nodes may helps in modeling influence propagation in the network [16].The social network allows users to make a group based on the common interest or common event happening in their social life.However, automatically, group construction is a difficult task when friends are added or removed dynamically in their social life.Authors McAuley and Leskovec have proposed an efficient model that enables detecting ego circles in large-scale networks that capture both structural and user profile information [17].In this paper, the author has developed a model where a user can belong to different ego circles.This model allows the user to detect an overlapping community as well as hierarchically nested circles in a large-scale social network.Social circles in ego networks have a great impact on the evolution of the network.
Three fundamental network models come into the picture while discussing structural parameters of the network.The first one is the random network where nodes and edge distribution is random over the network.Degree distribution in the network follows binomial or Poisson distribution in a random network [18].It is similar to the homogeneous network where most of the vertices are having the same degree.The small world model is another kind of network model that lies in between random and lattice network models.It exhibits a high clustering coefficient like a lattice network and smaller average path length like the random network.Degree distribution of a small world network follows the binomial distribution.A real-world network follows the power-law degree distribution that resembles the scale-free network model.A small world network model may have the scale-free distribution like a real-world social network.Chopade and Zhan have discussed the structural and functional characteristics for community detection process in the complex social network in their paper [19].Community detection based on the structural parameter of the network topology has attracted an interest of research as compared to community detection based on the functional parameter of the network.
Several methods for community detection techniques have been developed and each has its own strength and weakness [6,13,20,21].An efficient community detection method that considers both local and global information about topological structure has been explained by De Meo et al. [22].Global information about the network topology helps to yield good results about community; however, it is not suitable for the large-scale complex network.Large scale network needs to be preprocessed through dimension scaling in order to map the global information to local one [23].Local information about network topology may lead to faster community detection but are less accurate in nature.In this paper, the community detection process is based on optimizing the modularity value based on global information about the structure and yet is able to compare scalability of the network with local methods.In this work, communities have been detected in two phases.In the first phase, walk length for each node from a source node has been detected using an information propagation model, which is based on the random walk in the network.In the second phase, Euclidean distance between the nodes has been used for clustering process to partition the network.
Steve Gregory proposed label propagation algorithm for community detection in linear time complexity [24].The main idea behind the algorithm is that a node is more likely be a part of that community, to which its maximum neighboring nodes belong.Labeling of a node is propagated through its neighboring nodes in multiple iterations until a label is confined to a group of nodes.It is the fastest available community detection method, which has been claimed to have linear time complexity.The community detection algorithm spends most of the time measuring the similarity values between a pair of nodes, especially in the case of unweighted graphs.
Community detection using random walk has been discussed by Pons et al. [25].The algorithm discussed in this paper is well known as the Walktrap algorithm.The intuition behind the Walktrap algorithm is that a walker more likely gets trapped inside the dense region if it moves randomly inside the network.In this paper, the author has made an effort in discovering clusters by observing the movement of the walker inside the network.The time complexity of Walktrap algorithm is found to be O(mn 2 ) in the worst case, where m is the number of edges and n is the number of nodes in the network.In this work, similarity between nodes has been calculated based on the random walk in the network.
Spin-Glass is a unique community detection algorithm that is based on the statistical mechanics of spin around the network [26].The expected number of communities has been overestimated in the community detection and It has the worse approximation for the community when complexity and size of the network increases.However, it works fine for a small world model but too much expensive.Similarity between objects is determined by the spin associated with objects in graph configuration.Similarity between nodes is higher if their spins are of the same order.Communities are detected based on Pott's spin-glass model.
In the literature, most of the community detection algorithms deal with the undirected network.However, the real-world complex network often resembles the directed graph.Agreste et al. have made an extensive comparison of community detection algorithms for the directed network [27].Infomap and Label propagation algorithms are the first ones to implement in the directed network.However, for the sake of simplicity, we have implemented these two algorithms on the undirected network.Peng and Lill have proposed a framework for mapping the community detection algorithm from undirected to the directed network.They have applied modularity optimization technique for obtaining optimal partitioning of the network [28].
Xiaolong et al. have proposed an optimized community detection algorithm, which is based on the vector influence clustering coefficient and directed information transfer gain of vertices in the network.They have implemented their algorithm on the directed network.In their work, they have also proposed an efficient optimization parameter (target optimal function) to evaluate the community partition in the network [29].
Rosvall and Bergstrom have developed an elegant community detection algorithm for discovering modules in a large-scale network, which is known as Infomap [30].It is based on an optimizing a map function.This algorithm is similar to the Louvain algorithm, where, initially, each node in the network is assigned to a module, and, in each iteration, nodes in the modules are migrated themselves into the nearest module in order to minimize the map function.The proposed algorithm is quite similar to this algorithm where nodes are migrated into the modules based on the detected walk length in which nodes are discovered.Nodes belong to the same module are forced to migrate into another module to optimize the map function at the time of rebuilding the network structure.As a result, nodes assigned to one module at one point may differ from the assigned module at a later point in time.The Infomap algorithm is well suited for a small network.Its accuracy is found to be best as compared to other standard community detection algorithms [31].
Since the real world network follows the power-law degree distribution and quite complex in nature, traditional algorithms are practically inefficient unless it is implemented in some parallel architecture.The proposed work is similar to random walk community detection algorithm proposed by Pons et al. [25].However, we have considered the concept of a small world model to evaluate the similarity between vertices in the network.Unlike the Walktrap algorithm, which processes the whole network for quantifying the similarity between the vertices, the proposed similarity measure has been calculated in less number of steps in a recursive manner that improves the performance of community detection.The proposed algorithm is further improved by implementing it in a Hadoop distributed platform.The proposed algorithm may surpass previously discussed community detection algorithms in terms of accuracy, as most of them are based on either a regular graph model or a random graph model [15].Both of these models have less resemblance with the real-world network.However, a small world model closely resembles the real-world network.
The proposed algorithm behaves in a more consistent manner as compared to Infomap and the Spin-Glass community detection algorithm.The Infomap algorithm is based on the information flow in the network, which is calculated through random walk probability in the network.Huffman coding is used to generate the two-level encoding schema for the network.Community partitioning is identified by optimizing the mapping function that tries to compress the encoded schema by simulated annealing.This approach seems to be unrealistic and inconsistent for a large-scale network.The optimizing criteria in the Spin-Glass algorithm are similar to the Infomap algorithm.The proposed algorithm outperforms these two algorithms due to its efficient optimizing criteria for community evaluation.

Structural Definition of Community
Social network may be represented in the form of a graph G = (V, E), where V is the set of nodes and E is the set of edges in graph G.A group of nodes c ∈ V is said to form a community if it satisfies the following condition: where e in , e out are the number of edges existing inside the community and number of edges existing from a node in the community to a node outside the community respectively.n c is the number of nodes within the community c.First and third part of Equation ( 2) corresponds to the fraction of number of links within the communities and between the communities in the graph, respectively.The middle part corresponds to the density of the graph.For a graph with community structure, a fraction of intra-community links are expected to be larger than graph density and graph density is expected to be larger than a fraction of inter-community links in the graph.

Small World Phenomenon
Small world phenomenon is one of the inherent principles behind the analysis of today's large-scale social network, which indicates that any two people in the network can be linked by a small number of acquaintances [10].It may be observed that there always exists a path of short length that can be discovered using local information.A small world network is often associated with a high clustering coefficient and its characteristic path length decreases more rapidly than the clustering coefficient as the randomness increases.

Definition 1.
A graph is said to be a small world network if average path length L avg is less than or equal to the path length L rand and average clustering coefficient C avg is strictly less than clustering coefficient C rand in a random degree distribution of the graph: L avg L rand and C avg ≺ C rand . ( This phenomenon is based on the six degrees of separation principle.
Definition 2. Six degrees of separation is the principle, which indicates that every two people in the world is connected with a chain of no more than six acquaintances.
Small world network exhibits the following important characteristics: Exhaustive search using local information.

Power Law Degree Distribution
The degree of a user in a social network is the number of relationships that the user maintains.Distribution of relationships among users is known as degree distribution.A network is said to be scale-free if it follows power-law degree distribution [12].Definition 3. A network is said to have power-law degree distribution if a fraction of nodes having degree k in the network depends on the power of k with some constant.Social network often follows power-law degree distribution as indicated by the following equation: where p d is the fraction of nodes having degree d, k and γ are power-law intercept and power-law exponent, respectively.Usually, γ ranges from 2 to 3.

Clustering Coefficient
Clustering coefficient is a measure used to define the network as the small world network.Clustering coefficient defines the friendship transitivity in a network [32].This measure can have two versions: one is local and another one is global.Local clustering coefficient is associated with each node in the network where global clustering coefficient represents the clustering density for the whole network.

Global Clustering Coefficient (GCC)
Small world network is often observed to have high global clustering coefficient.GCC of the network is defined as the ratio of a number of triangles and the possible number of connected triplets in the network.A triplet is the set of three connected nodes in the network.Each triangle in a network contributes to three triplets.GCC of a network may be framed as below:

Local Clustering Coefficient (LCC)
LCC signifies the tendency of a node to form a cluster.The higher the clustering coefficient of a node, the more chances to be involved in a cluster.It is defined as the fraction of edges existing between the neighboring nodes to the total number of edges possible.LCC for a node i can mathematically be represented as follows: Here, (u, v) is the edge between neighboring nodes, E is the set of edges in the network, Neighbor (i) is the set of neighboring nodes of i and n i is the number of elements in the set.The average of local clustering coefficients for all of the nodes may be considered as the global clustering coefficient.

Random Walk
Community detection in a large complex network can be carried out by capturing the topological structure using random walk in the network.The intuition behind random walk is that the network tends to be trapped inside a denser region (community) for a longer period of time.This idea can be used for inclusion of nodes in the community.In this paper, an efficient similarity metric based on the random walk has been proposed to include a node in the community.This metric may provide the following features:

•
Structure of the network is well captured in the process of community detection; It can be used in an agglomerative hierarchal clustering; • Computation to find community may be more efficient.
Social network can be represented in the form of a graph, where nodes represent social entities and edges represent relationships between the entities.The graph can be stored in the form of adjacency matrix A, where A ij = 1, if there is an edge, existing between nodes i and j, and 0 otherwise.
Random walk in a graph is the process of visiting a neighboring node randomly from the source node and continuing the process of visiting throughout the graph.Random walk process is well explained on the basis of Markov chain in which each node corresponds to vertices in the visited path [33].In this paper, transition matrix obtained by Random walk in the graph has been considered for detection of communities.Transition matrix describes the probability of visiting each node from every other node in k number of steps i.e., T k i,j corresponds to the probability of visiting node j from i in k number of steps.T 1 , T 2 , T 3 andT k are the transition matrices for random walk corresponding to 1, 2, 3 and k walk length, respectively.Transition probability from vertex i to vertex j in one length random walk is defined by the following equation: where A ij is the adjacency matrix of the network and d i is the degree of vertex i.
Property 1. Probability for a random walker to visit a node j from node i in walk length that tends to infinity depends only on the degree of j, rather than degree of the source or the intermediate node.It may be represented as: where d i is the degree of i and E is the total number of edges in the network.
Property 2. Ratio of probabilities for a random walker that visits a node from i to j and j to i through a fixed walk length depends only on the degree of i and j.It may be represented as follows:

Vertex Similarity Based on Random Walk
Vertices belonging to the same community seems to have similar behavior as compared to vertices outside the communities.Any two nodes inside a community look the same way as other nodes in the network.We may consider an example of a random walk of length k in graph G(V, E), which represents a social network.The probability of visiting all nodes from all other nodes in the network through k length random walk is represented by transition matrix T k .Each tuple in the transition matrix corresponds to probabilities of visiting all other nodes from node i in k walk length.These probabilities are based on structural information in the network.From the structure of the network, the following inferences may be drawn:

•
If two nodes i and j, are in the same community, the probability of visiting node j from i would be higher as compared to visiting a node outside the community.In addition, converse may not be true i.e., if the probability is high, it does not mean that they belong to the same community.

•
The probability T k i,j depends on the degree of j because the walker tends to visit towards vertices, where the degree is high.

•
Two vertices belonging to the same community tend to see all other vertices in the same way: In this paper, similarity between two vertices are identified from the transition matrix T k i,j based on the walk length k.Probability of reaching one node from another would be different for different walk lengths.Similarity between i and j for k walk length can be computed by the Euclidean distance between row vectors corresponding to nodes i and j, in matrix T k : where d l is the degree of vertex 1.

Proposed Methodology
Detecting communities in a social network having a good number of nodes in a reasonable amount of time is a challenging task due to its size and structure.In this work, distributing computing environment has been considered for processing large-scale networks.Hadoop is a framework applied to solving the complex problem by distributing the computation in multiple nodes in the cluster.Hadoop implicitly schedules the pieces of task on the different computing node.It automatically takes care of load balancing and resource scheduling over the cluster of nodes.Users need not worry about the internal execution policy.However, users can provide the application specific mapper and reducer program to the Hadoop framework.Users have the control over the Map-Reduce program structure but does not have control over execution environment.In this work, the community detection process in Hadoop has been carried out in the following two phases: Prior to the computation in Hadoop, the random walk process has been carried out in order to visit all the nodes in the network.In this work, similarity values between all pair of nodes have first been calculated from the transition probability matrix.Transition matrix has been generated separately for random walk lengths 1 to 3.
In the mapper phase, community memberships of nodes are being evaluated by using the local information available at each node.In a Small world network model, quick navigation is possible through two fundamental processes: one is to find the short chain of acquittance and the other is to use only local information regarding structure around the node.The community detection process starts from a randomly selected source node.Neighboring nodes who have similarity value less than the threshold value, η, are identified and then included in the same community, to which the source node belongs.Here, η is treated as the clustering parameter for a small world model that determines the probability of having a connection between two nodes in the network.Small world model parameters have been determined based on the random walk in the network.Once a node has been mapped with a community number, the node is temporarily deleted from the graph in order to reduce the computation.For the sake of simplicity, the undirected graph has been taken into consideration.It is observed that a real world social network follows a small world network, or, in other words, by following a small number of steps, one can visit all other nodes in the network.Mapper phase nodes are mapped into their community in the order of their reachability from the source node.Steps adopted in the mapper phase are shown in Algorithm 1.The execution flow for community detection is presented in Figure 1.In the reducer phase, the clustering process is being carried out by developing a dendrogram structure for the network.Dendogram structure is the hierarchical structure from where one can identify communities at a certain level of granularity.If the structure is partitioned at a certain level, the groups formed below the partition can be considered as individual communities.The input to the reducer phase is in the form of the key-value pair where community number detected in a mapper phase is treated as the key and the rest of the parameters such as walk length, node id, and a number of detected community together is treated as the value for the corresponding key.Dendogram structure is detected at two levels.One is at the inner level of each community and the other at a global level where each detected community is treated as a node structure.At the inner hierarchy, all of the nodes are detected in the same walk length in mapper phase belong to the same level in the dendrogram structure.Nodes that are assigned in the same community at a particular walk length are grouped into a community.To develop the dendogram structure at higher level hierarchy, we have proposed an efficient distance measure between each pair of communities that depend on the dissimilarity between the communities.It is calculated by measuring maximum dissimilarity that may be possible between any two nodes of different communities.It can be mathematically represented as follows: where C i and C j are the two different communities and i and j are the two nodes belonging to C i and C j , respectively.Here, sim(i, j) has been already calculated from Equation (10).The pair of communities is first identified with minimum distance for grouping at the same level in the dendogram structure.
After the grouping, it is treated as a single community.The next pair of communities is then identified by following the same procedure.The process is continued until all of the same communities are involved in the structure.The steps followed for obtaining the dendogram structure is presented in Algorithm 2. The clustering process of detected communities is presented in Figure 2. for every Walk_Length in Community_Number 3.
Combine the nodes at the same level 4.
Increase the level for the next walk_Length 5.
end for 6.
Choose the next Community for further clustering 5. end for 2. for every pair of communities 3.
Find the distance by using Equation ( 11).

3.
Combine the communities with smallest distance at the same level 4.
Increase the level to choose next pair of communities 5. end for 6. 5.Return the Dendrogram structure for the Network.

Metrics for Evaluation Performance
The following evaluation metrics have been considered for measuring performance of the proposed algorithm.

•
Normalized Mutual Information (N MI): NMI is a suitable measure to compare the quality of different community partitions.It can be evaluated with the help of confusion matrix (CM), where each row corresponds to the community, present in the real partition and each column corresponds to the community, detected through the proposed algorithm.Confusion matrix has been obtained based on the number of communities and community memberships for each node, which is available as ground truth in the datasets [34,35].Each element in the confusion matrix CM ij represents the number of vertices in ith real community, which is also present in jth detected community.N MI of the detected partition may be formulated as: where X and Y are the community partition structure corresponding to ground truth and detected structure, respectively.CM i and CM j indicate the communities in true and detected community partition, respectively.
• Modularity (Q): Modularity is a metric used to quantify the quality of community partition.This measure is proposed by Girvan and Newman [13].It is defined as the difference between the number of edges existing inside the communities and the number of edges, which would have been present in a random assignment in the network with similar degree distribution.The expected number of edges between i and j with degree d i and d j , respectively, is d i d j /2.Modularity value for a given partition P = {c 1 , c 2 , .....c k } in the graph G = (V, E) is defined as follows [36]: • F-Measure: F-measure is a metric used to find the accuracy of the proposed algorithm when the ground truth about the communities are available in the dataset.It is the harmonic mean of precision and recall, where precision and recall can be obtained from the confusion matrix obtained from the experiment.Confusion matrix for community detection has been described in Precision can be defined as follows: Recall can be defined as follows: F-measure can be defined as follows: Table 1.Confusion matrix for community detection.

Ground Truth
• Execution Time: A major issue in community detection algorithms is to uncover communities in a reasonable amount of time.In this paper, performance of different algorithms has been measured in terms of execution time.Execution time includes only CPU running time without considering the external time factor.Execution time for all community detection algorithms has been measured in machines with the i7 processor with 3.4 GHz clock speed.Running times have been measured in units of seconds.

Datasets Used
Social network dataset is often represented in the form of the graph structure, where nodes in the graph represent social entities and edges represent the relationships among the entities.In this paper, the experiment has been carried out using both synthetic and real-world datasets.Details of the datasets are listed in Table 2.The Lancichinetti-Fortunato-Radicchi (LFR) benchmark has been used for generating synthetic data for the social network.This benchmark is observed to be an established one for evaluating different community detection algorithms.Synthetic networks that resemble real-world social networks have been generated by tuning a set of parameters in an LFR benchmark.Parameters in the LFR benchmark include a number of nodes in the network, degree distribution, community size distribution, the maximum and average degree of node, etc.The degree distribution in the network follows the power-law in the LFR benchmark.The probability of having a node with degree k varies with the parameter γ as mentioned below: Here, the value of γ is assigned to vary between 2 to 3 in order to resemble real-world social networks.µ is considered as another parameter in the LFR benchmark, which is also known as mixing parameter.A small value of µ indicates more sparsity between the planted communities in the network.The complexity of the network increases by scaling the µ value.In this paper, the complexity of the network has been increased by scaling mixing parameter from 0.2 to 0.5.β parameter in LFR has been considered for community size distribution in the network.β value often varies between 1 and 2. For each of the synthetic datasets, performance of algorithms has been measured by tuning mixing parameter from 0.2 to 0.5 by increasing 0.05 at each step.Thus, a total of fourteen data points have been generated from two artificial datasets listed in Table 2 and performance has been measured for these data points.

Real World Datasets
Real-world datasets are more complex and heterogeneous as compared to synthetic data.Revealing communities in real-world networks is a NP (non-deterministic polynomial)-hard problem.For measuring the performance of the proposed algorithm, the following real-world datasets have been taken into consideration: All of these datasets have been collected from the Stanford Large Network Dataset Collection (SNAP), which is publicly available for social network analysis [35].

Experimental Results
The experiment has been carried out on a cluster of five nodes, each with an i7 processor with 3.4 Ghz clock speed.The master node has a configuration with a 1 TB hard disk and 10 GB RAM.It also acts as a worker node.Each of the other four nodes acts as a slave or worker node.They all have a symmetric configuration with 1TB hard disk and 20 GB of RAM.A similarity measure based on the random walk has been used to identify the neighboring nodes for inclusion in the community.The threshold value for similarity measure has been considered to be 0.5.A number of synthetic datasets have been generated by tuning the parameters available in LFR benchmark.The proposed algorithm i.e., CDSW, has been implemented both on synthetic and real-world social network datasets.It has been compared with the following community detection algorithms, available in literature: a Infomap Community Detection (INF) [30]; b Spin-Glass Community Detection (SG) [26]; c Girvan Newman Community Detection (GN) [13]; d Walktrap Community Detection (WT) [25].
NMI is one of the accepted measures for comparing detected clusters with ground truth partitions.NMI for each partition obtained from different algorithms has been evaluated.The higher the NMI value, the better is the community partition.Figure 3a shows the box plot analysis for NMI of all data points.It is observed that the median of NMI for the proposed community detection algorithm using small world phenomenon (CDSW) is more than 0.8, which is better as compared to other algorithms.Minimum NMI for CDSW algorithm is higher than the minimum value for other algorithms.The WT algorithm performs better with respect to NMI, which is close to the performance of the proposed algorithm.
Figure 3b shows the box plot analysis of F-measure for different algorithms.F-measure is related to the accuracy of algorithms.It has been calculated using the confusion matrix for community partition.The structure of the confusion matrix has been presented in Table 1.From the Figure 3b, it is observed that the median F-measure of the CDSW algorithm is higher than 0.85.Although the maximum of F-measure values for the CDSW algorithm is not as good as those obtained for WT and GN algorithms, its average and maximum values have been observed to be higher than all other algorithms.Figure 3c shows the box plot analysis of modularity value for community partition generated from different algorithms.The higher the modularity value, the better is the community partition.Modularity value decreases when link density between the communities or the value of mixing parameter µ increases.Median modularity value for the proposed algorithm is observed to be better as compared to other community detection algorithms.Modularity values obtained from Spin-Glass and Infomap algorithms are relatively similar.From Figure 3c, it is observed that the CDSW algorithm provides better community structure as compared to other traditional algorithms.
In the social network, the number of entities and their relationships are observed to be increasing exponentially.Since community detection in the large-scale network in a reasonable amount of time is the focus of this study, an effort has been made in measuring execution times for different algorithms in community detection.Figure 3d shows the comparative study of execution time for different community detection algorithms.The structure of synthetic network changes when mixing parameters of benchmark increases.In this work, comparative analysis of execution time has been carried out separately for both synthetic and real-world datasets.The comparative analysis of execution time for

Figure 1 .
Figure 1.Execution flow for community detection using small world phenomenon.

Figure 3 .
Figure 3. Comparative study of different community detection algorithms.

Mapper (Source Node, Community Membership, Graph G) Input
: The social network graph in the form of edge lists.Output : Community Membership for each node in the form of (Key,Value) pair.Here Community Membership is the Key and (Walk Length (WL), Node ID) is the Value.

Table 1 .
In this work, all pairs of nodes are considered to get the value of a, b, c and d for each dataset considered, where a = number of pairs, in the same community in ground truth and assigned in same community after community detection.It is treated as True Positive (TP).

Table 2 .
Datasets used for evaluation.