Approximation Algorithm for Shortest Path in Large Social Networks

: Proposed algorithms for calculating the shortest paths such as Dijikstra and Flowd-Warshall’s algorithms are limited to small networks due to computational complexity and cost. We propose an e ﬃ cient and a more accurate approximation algorithm that is applicable to large scale networks. Our algorithm iteratively constructs levels of hierarchical networks by a node condensing procedure to construct hierarchical graphs until threshold. The shortest paths between nodes in the original network are approximated by considering their corresponding shortest paths in the highest hierarchy. Experiments on real life data show that our algorithm records high e ﬃ ciency and accuracy compared with other algorithms.


Introduction
The internet and its associated technologies have changed the way society conducts businesses, and the way that families and friends relate with each other.Social networking has become so popular such that users have drifted from the traditional physical media (the print and broadcasting media) to a more sophisticated social interaction.Computer scientists are looking into designing tools, models, and applications in diverse fields in internet application technologies [1] such as communication [2], recommendations, social marketing, terrorist threats [3], and key node mining [4].Efficiently computing the shortest path between any two nodes in a network is one of the most important concerns of researchers since it has a great potential for mobilizing people [5].Exact algorithms such as Dijkstra's and Floyd-Warshall's algorithms have not performed so well on large scale networks due to their high computational complexity.Even though these algorithms are so popular for their accuracy and applications, they cannot be efficiently applied to our current trend of large-scale datasets.We design an approximate algorithm to calculate the distances of the shortest paths in a modeled large-scale network.Chow [6] presented a heuristic algorithm for searching the shortest paths on a connected, undirected network.Chow's algorithm relies on a heuristic function whose quality affects the efficiency and accuracy of estimation.Rattgan et al. [7] designed a network structure index (NSI) algorithm to estimate the shortest path in networks by storing data in a structure.Construction of the structure however consumes so much time and space.Tang et al. [8] presented an algorithm, CDZ (Center Distance to Zone), based on local centrality and existing paths through central nodes (10% of all nodes) and approximates their distances by means of the shortest paths between the central nodes computed by Dijkstra's algorithm.Although CDZ achieved high accuracy on some social networks within a reasonable time, it performed poorly on large scale networks due to the large number of central nodes.Tretyakov et al. [9] proposed two algorithms, LCA (Lowest Common Ancestor) and LBFS, (Lexicographic Breadth-First Search) based on landmark selection and shortest-path trees (SPTs).
Although these algorithms perform well in practice, they do not provide strong theoretical guarantees on the quality of approximation.LCA depends on the lowest common ancestors derived from SPTs and landmarks to compute the shortest path.LBFS adopts SPTs to collect all paths from nodes to landmarks by using best coverage approach and split the network into sub networks.Based on the usual BFS traversal in these sub networks, LBFS can approximate the shortest path.Saeed Maleki [10] proposed the Dijkstra strip mined relation (DSMR) algorithm for calculating the single source shortest path in networks.With their approach, the entire network graph is passed to a distributor engine which slices the graph into subgraphs equal to the number of processors so that all subgraphs have approximately the same number of edges.After each subgraph is assigned to a processor, they are passed to optional processing procedures, pruning, and subgraph extraction.Pruning removes all edges that do not play major role in the shortest path calculation from any source vertex, while subgraph extraction extracts a subgraph from the original graph that contains most of the shortest paths traversing through that subgraph.They apply dijkstra's algorithm on each subgraph and compute the shortest distances from source to each vertex by synchronizing with all other subgraphs.Shortest path is computed only from a single source.A* [11] (A star) algorithm is an extension of Dijkstra's algorithm but in this case the direction to the shortest path is optimized by heuristics.A* calculates the cost of neighboring nodes and then select the possible path with the lowest cost to traverse to the target node.It does that by identifying which n node to the target node has the lowest cost after summing it with the neighbor node with the lowest cost.The entire graph has to be loaded into memory to find the shortest path between nodes which makes it computationally expensive.Even though A* enjoys optimality and completeness, the algorithm is not scalable to large graphs.Geisberger et al. [12] proposed a shortest path algorithm based on node contraction.Contraction is done by calculating the shortest path between all node pairs (immediate neighbors) and then introduce a shortcut path between them.The paths between two nodes can be reduced by the shortcut edge but not the cost.The shortest path between a pair of nodes is calculated in two ways, one from the shortcut edges of the starting node and the other from the ending nodes, respectively, until they meet.The total distance from both ends is taken as the shortest path between these two nodes.The major challenge with this algorithm is with the order of contracting nodes.The fewer shortcut edges introduced, the faster it is to calculate the shortest path.
Even though there exists a large variety of algorithms to calculate the shortest paths, there are few approximate algorithms based on hierarchical networks.
To deal with large scale hierarchical networks, we present a novel approximate algorithm based on the hierarchy of networks and parallel computing, which is able to efficiently and accurately scale up to large networks.To ensure high efficiency, we condense the central nodes and their neighbors into super nodes to construct higher-level networks iteratively, until the scale of the network is reduced to a threshold scale.In order to increase computational power and memory, for example if we use a machine with 32 cores, we pass a subset of the entire network (level i) hierarchy to each core evenly.After which the distances of the shortest paths in the original network are calculated by means of their central nodes in the hierarchical network.The performance of our algorithm was tested on four different real networks.Experimental results show that the runtime per query is only a few milliseconds on large networks, while accuracy is still maintained.
Based on G (parent network), we construct a series of undirected and weighted networks with different scales.The original network G is taken as the bottom level or level 0 network.We define the role of various nodes at each level of graph construction as follows:

•
Normal nodes are the immediate neighbors of nodes with the highest degree centrality at each of hierarchical graph construction.

•
Super nodes are condensed nodes that are represented by a single node in the next hierarchy.

•
Central nodes are nodes that have been selected to absorb its neighborhood nodes.After absorption, the central nodes become super nodes.

•
Sub-nodes are all other nodes that have a path to the central node, a path of radius r.
For each level of construction, i.e., from bottom to top, we iteratively perform the following steps.A normal node with the largest degree (having lot of clusters) is selected as a central node and condensed with its normal neighbors (other than super nodes) into a super node.The edges between the normal nodes are redirected to their corresponding super nodes.Condensing of current level network is completed when all nodes are merged into super nodes.These nodes are regarded as normal nodes in the next level network.Edges between two super nodes in the previous hierarchy results in a single link for two normal nodes in the next level network.The weight of an edge between two adjacent nodes in the next level network represents the approximate distance between the nodes.The topmost network is obtained when the number of nodes in the next level network is below a given threshold t.
Figure 1 shows the process of constructing a hierarchical network.Figure 1a is the current level network whose red nodes represent central nodes.Central nodes condense with their neighbors (blue nodes) to form super nodes respectively, which results in the next level network as shown in Figure 1b.All super nodes in the previous level network (Figure 1a) are considered normal nodes in the next level network as shown in Figure 1b.The red node in Figure 1a will be selected as the central node, and condense with its neighbors into a super node in the next level hierarchy.For each level of construction, i.e., from bottom to top, we iteratively perform the following steps.A normal node with the largest degree (having lot of clusters) is selected as a central node and condensed with its normal neighbors (other than super nodes) into a super node.The edges between the normal nodes are redirected to their corresponding super nodes.Condensing of current level network is completed when all nodes are merged into super nodes.These nodes are regarded as normal nodes in the next level network.Edges between two super nodes in the previous hierarchy results in a single link for two normal nodes in the next level network.The weight of an edge between two adjacent nodes in the next level network represents the approximate distance between the nodes.The topmost network is obtained when the number of nodes in the next level network is below a given threshold t Figure 1 shows the process of constructing a hierarchical network.Figure 1a is the current level network whose red nodes represent central nodes.Central nodes condense with their neighbors (blue nodes) to form super nodes respectively, which results in the next level network as shown in Figure 1b.All super nodes in the previous level network (Figure 1a) are considered normal nodes in the next level network as shown in Figure 1b.The red node in Figure 1a will be selected as the central node, and condense with its neighbors into a super node in the next level hierarchy.

Algorithm Based on the Hierarchy of Networks
After transforming the original network into a number of hierarchical networks, the distance of the shortest path between any two nodes in the lower network can be estimated by considering their corresponding central nodes in the higher-level network.
Let ˆ( , ) i d s t be the approximate distance between nodes s and t in the level i network.The distance of shortest path between nodes s and t in the original network is approximated by 0 ˆ( , ) where cs and ct are the central nodes of nodes s and t respectively.Figure 2 shows an example of shortest path approximation using Equation (1).ˆ( , )

Algorithm Based on the Hierarchy of Networks
After transforming the original network into a number of hierarchical networks, the distance of the shortest path between any two nodes in the lower network can be estimated by considering their corresponding central nodes in the higher-level network.
Let di (s, t) be the approximate distance between nodes s and t in the level i network.The distance of shortest path between nodes s and t in the original network is approximated by d0 (s, t).
where c s and c t are the central nodes of nodes s and t respectively.Figure 2 shows an example of shortest path approximation using Equation (1).di (s, c s ) a nd di (t, c t ) are the approximate distances from nodes s i and t i to their respective central nodes c s i and c t i in the level i network.di+1 (s, c s ) and di+1 (c, t s ) are the distances from nodes s i+1 and t i+1 to their common central nodes in the level i + 1 network.We define the longest distance from sub nodes to its corresponding central node as the radius  of the super node.For example, assuming normal nodes in the level i network in Figure 2 have the same radius r, then the radiuses of their corresponding super nodes will range from r to 3r + 1 (Note: the longest path to a central node is 3r).However, computing for all radii of super nodes in the network hierarchy will consume so much memory and time.For this reason, we define an appr radius  for all super nodes in the level i network by approximating the distance between two adjacent normal nodes in the level i network.As shown in Figure 3, the appr radiuses of super nodes in the level i + 1 network are the approximate distances between the centers of two adjacent normal nodes (represented as hollow circles).Normal nodes in the level i + 1 network were derived from super nodes in the level i network whose appr radiuses were same.The apprr adius of normal nodes in level i network is defined by where k is the number of hierarchical networks with different scales including the original network.
Furthermore, Equation ( 2) can be written as We approximate the distance from node s to its central node c in the level i network by The approximate distance between adjacent nodes in level i network can also be written as We define the longest distance from sub nodes to its corresponding central node as the radius r of the super node.For example, assuming normal nodes in the level i network in Figure 2 have the same radius r, then the radiuses of their corresponding super nodes will range from r to 3r + 1 (Note: the longest path to a central node is 3r).However, computing for all radii of super nodes in the network hierarchy will consume so much memory and time.For this reason, we define an appr radius r i for all super nodes in the level i network by approximating the distance between two adjacent normal nodes in the level i network.As shown in Figure 3, the appr radiuses of super nodes in the level i + 1 network are the approximate distances between the centers of two adjacent normal nodes (represented as hollow circles).Normal nodes in the level i + 1 network were derived from super nodes in the level i network whose appr radiuses were same.The apprr adius of normal nodes in level i network is defined by where k is the number of hierarchical networks with different scales including the original network.Furthermore, Equation (2) can be written as r i = 2 i − 1 when 0 ≤ i < k.We approximate the distance from node s to its central node c in the level i network by di (s, c) The approximate distance between adjacent nodes in level i network can also be written as 2r i + 1. Substituting Equations ( 2) and (3) into Equation (1), the upper bound of di (s, t) can be calculated by The distance between any two adjacent nodes in the top-level network is approximated as 4) is the length of shortest path between nodes s and t in the top-level network computed by Dijsktra's algorithm.The construction of each level of the hierarchical network is described in Algorithm 1. Algorithm 1 consumes O(n i logn i ) time to rank the n i nodes in the level i network and O(n i + e i ) time to generate the level i + 1 network, where e i is the number of edges in level i network.The space complexity of Algorithm 1 is O(n + e).

Parallelization
In order to reduce the overall computational runtime for computing the shortest path from s to t, we parallel to jobs to each processor on the computing hardware.Parallelization is simultaneously employing multiple compute resources by sharing tasks among these resources to solve a computational problem.Two common approaches to run our algorithm are via either threads or multiple processes.Threads treat each job as sub tasks of a single process and therefore would have to access the same memory location.If synchronization is not done properly, there could be conflicts for example when there is access to the same memory location at the same time.A safer approach is to submit each task to a completely separate memory location.We distribute networks evenly on to multiple cores thus passing level i networks to a cpu core for computation.Dijkstra's algorithm can only compute the shortest path from s to t after it has completed computation of shortest path d(s, w) where w is any vertex that satisfies d(s, w) < d(s, t).It is evident that this process will increase the idleness on processors that leads to a reduction in resource utilization and efficiency [13] where efficiency p was about √ p. Figure 4 shows the architecture of the various stages of our proposed algorithm.The given network is the input data to the hierarchical graph construction process at the pre-processing stage.The graph in partitioned into k number of hierarchies and distributed across the processor cores of the computing machine.

Parallelization
In order to reduce the overall computational runtime for computing the shortest path from  to , we parallel to jobs to each processor on the computing hardware.Parallelization is simultaneously employing multiple compute resources by sharing tasks among these resources to solve a computational problem.Two common approaches to run our algorithm are via either threads or multiple processes.Threads treat each job as sub tasks of a single process and therefore would have to access the same memory location.If synchronization is not done properly, there could be conflicts for example when there is access to the same memory location at the same time.A safer approach is to submit each task to a completely separate memory location.We distribute networks evenly on to multiple cores thus passing level  networks to a cpu core for computation.Dijkstra's algorithm can only compute the shortest path from s to t after it has completed computation of shortest path (, ) where w is any vertex that satisfies ||(, )|| < ||(, ).It is evident that this process will increase the idleness on processors that leads to a reduction in resource utilization and efficiency [13] where efficiency  was about . Figure 4 shows the architecture of the various stages of our proposed algorithm.The given network is the input data to the hierarchical graph construction process at the pre-processing stage.The graph in partitioned into k number of hierarchies and distributed across the processor cores of the computing machine.Algorithm 2 describes the subsequent construction hierarchies.Algorithm 1 is repeatedly executed until the top-level network is achieved which is essentially equal to the partitioning value (p) nodes given a threshold (t).The time complexity for constructing hierarchical networks is given as  ∑ ( log  +  +  ) and space complexity as ( + ) .In the top level network, Dijsktra's algorithm require ( ) time to compute the shortest paths between each pair of nodes and ( ) space to store the distances, where  is the number of nodes in the top level network.Algorithm 2 describes the subsequent construction hierarchies.Algorithm 1 is repeatedly executed until the top-level network is achieved which is essentially equal to the partitioning value (p) nodes given a threshold (t).The time complexity for constructing hierarchical networks is given as O k−1 i=0 (n i log n i + n i + e i ) and space complexity as O(kn + e).In the top level network, Dijsktra's algorithm require O m 2 time to compute the shortest paths between each pair of nodes and O m 2 space to store the distances, where m is the number of nodes in the top level network.Incorporating time and space complexity into Dijsktra's algorithm requires O kn + e + m 2 space to store hierarchical networks and corresponding distances in the top level network.After the construction of a number of networks from Algorithm 2, Algorithm 3 distributes the networks across each processor core of the computing hardware for computation.Finally, the approximate distances of the shortest paths in the original network can be calculated by Algorithm 4. Algorithm 4 requires at most O kn 2 /p time to compute the shortest paths for all pairs of nodes, where p is the number of processor cores.

Algorithm 4. Calculate Shortest Path
Preconditions: Network network = G(V, E), d[i][j] records shortest distances between nodes in the top level network calculated by Algorithm 2; k denotes the number of hierarchical networks constructed by Algorithm 2, including the original network; c s and c t are the central nodes of s and t respectively; supernode(c s ) is the super node which contains nodes c s and s and its regarded as a normal node in the next level network.1: function IterativeApproximation(s,t,i) 2: for i in parallel_mapping 3: Based on the above analysis, the time complexity for approximating the shortest distances in an undirected and unweighted network with n nodes and e edges can be calculated as O m 3 + k−1 i=0 (n i log n i + n i + e i ) + kn 2 /p , and the memory complexity as O kn + e + m 2 .We compared the complexity of CDZ [9] and LBFS [10] with our algorithm.CDZ algorithm selects c central areas in the network with n nodes and e edges and computes the distances between c and its central nodes by Dijkstra's algorithm.The time complexity of CDZ algorithm is given as O ed + nlogn + c 3 .The number of central areas in CDZ algorithm is about 10% of the number of nodes.LBFS algorithm selects M pairs of nodes from the network to obtain i landmarks by "best coverage strategy" in a network with n node and e edges.The time complexity of LBFS is given as O M 3 + le + l 2 D 2 , where D is the average size of the sub network related to the landmarks.Sometimes, the number of pairs of nodes in LBFS could be too large to make a best coverage and selection.The advantage of our algorithm over LBFS is that, the number of the nodes in the top network is relatively very small (m is below a certain threshold), which significantly reduces the complexity, moreover, optimum utilization of hardware and memory was realized by paralleling the computing tasks which yielded a significant improvement.

Completeness, Soundness and Cost Optimality
In this section, we will show that our proposed algorithm is sound, complete, and optimal.An algorithm is sound if it always returns an answer.Completeness on the other hand is the guarantee that an algorithm will always return the true results for any arbitrary inputs.Our proposed algorithm will always find the best solution to finding the shortest path even in worse case scenarios, the lower bound for the function for time complexities is used in computing the shortest path.

Sound and Completeness
Suppose we are to calculate the shortest path between two arbitrary nodes s and t.Our algorithm searches for their corresponding graph in the top-level hierarchy i.If both nodes are contained in a super node c s or c s , the shortest path distance is approximated by the distances from their respective nodes to central nodes c s or c t and c s ⊃ s and t.For c t c s , the distances are approximated by the node s's distance to Cs + the distance between node c s and c t in the next higher network (i + 1) + the distance from t to c t .If s, t c s , or c t , then the approximated distance between s and t existed within different central nodes within or across different levels of i. the algorithm will always search for the central nodes of s and t until the topmost hierarchy.If there is path between central nodes of s and t, and shortest path between s and t is not recorded at the topmost hierarchy, then the algorithm returns a path value of 0, which means there is no edge between s and t that is no path between s and t.

Optimality
An algorithm is optimal if the time complexity for finding the solution to a problem in worst case scenario is in the lowest range of the functions that describes the time complexity in worst case scenario to the problem.Our algorithm finds the shortest path distance from the smaller graph to the largest, i.e., a top bottom approach, if the path exists in the smaller graph, the algorithm selects that path terminates.Approximation between hierarchies are performed iteratively so as optimize the task rather than searching through the entire levels of graph to select the shortest path.If a path doesn't exist in a level hierarchy, that graph is discarded and the next level graph loaded, a smaller graph is always considered which in effect minimizes the cost.The algorithm always searches within the smallest ith value(s) which contain the approximated path of the two nodes in question.

Experimental Results and Discussions
Our algorithm was implemented by java programming language, running on a PC with intel core i5 4300M, CPU at 2.60 GHZ dual core and a 4 GB RAM.Performance of the algorithm was evaluated on four real undirected and weighted networks; Email-Enron [14], itdk0304_rlinks [15], DBLP [16] and roadNet.Email-Enron contains about half a million email communications among users whose nodes are the email addresses of senders or receivers and edges are the communication relationships.User's email addresses represents nodes on the dataset such that if a user i sent at least one email to another user j, then there exists an undirected edge from i to j. Tdk0304_rlinks is a CAIDA Skitter Router-Level Topology and Degree Distribution of an undirected internet router-level, which contains the relationships among nodes that access the router where the nodes are users or websites and the edges are the relationships between them.DBLP network contains information on computer science publications, in which each node corresponds to an author; two authors are connected by an edge if they have co-authored at least one publication.The roadNet is a California road traffic network composed of roads and sites network with the intersections and endpoints as nodes and the connecting roads as edges.Since most roads are unidirectional, the network graph is represented as an undirected graph.The number of nodes V, edges E, diameter of network D, average degree <k>, and the largest degree k max of these four networks are shown in Table 1.We use Path Ratio p to assess the accuracy of the algorithm.p is defined as where Pr is the total number of pairs of nodes, P f i i is the distance between pairs of nodes computed by the approximation algorithm, and P O i is the accurate distances computed by Dijkstra's algorithm.The value of p is always greater than 1 since approximate distances are always longer than their corresponding accurate distances.Table 2 shows experimental results on preprocess time, average query time, and path ratio from our algorithm compared with that of CDZ.The threshold value is set at 100.T init is the total time for preprocessing.T q is the average runtime for 10,000 random queries.From table 2, it can be construed that our algorithm performed relatively better on four different networks.It ran 10 times faster than CDZ especially on DBLP and roadNet networks.Moreover, the approximation for the shortest path is more accurate on Email-Enron and itdk0304_rlinks, compared with CDZ on DBLP and roadNet.Table 3 shows results from LCA, LBFS, and our algorithm.It can be seen from Table 3 that our algorithm outperforms LBFS in terms of efficiency and accuracy.In DBLP and roadNet, our algorithm ran twice as fast as LBFS.Compared with LCA, our algorithm approximates more accurately but with a slightly higher runtime.Moreover, parallelization improved the runtime by about 15% compared with the latter which computed shortest paths sequentially.From our experiments, we deduced that threshold t affects the efficiency and accuracy of our algorithm.Table 4 shows the influence of threshold t.From Table 4, it can be seen that runtime increases and path ratio decreases when threshold is increased.When there are more nodes at the top level network, approximation is more accurate but requires more time for Dijkstra's algorithm.When t increases from 40 to 100, Tq also increases a little, but p improves significantly.When the threshold value increases from 100 to 180, Tq increases sharply, but p remains nearly constant.We therefore set the threshold value at 100 in order to obtain a good trade-off.

Conclusions
Based on hierarchical networks, we propose a parallel approximate shortest path algorithm which is efficient and maintains high approximation accuracy on large scale networks.The algorithm condenses central nodes and their neighbors into super nodes to iteratively construct higher level networks until the scale of the top level meets a set threshold value.The algorithm approximates the distances of the shortest paths in the original network by means of super nodes in the higher-level network.The performance of our algorithm was tested on four real networks.Results from our tests show that our algorithm has a runtime per query within a few milliseconds and at the same time delivers high accuracy on large scale networks.Compared with other algorithms, our algorithm runs twice as fast as LBFS and over 10 times faster than CDZ.
The proposed algorithm mainly focuses on undirected and unweighted networks.In the future, we seek to focus on directed and weighted networks by exploring the approximate distance between a node and its central node based on hierarchical networks.We will also consider an adaptive algorithm for dynamic networks.

Figure 1 .
Figure 1.Construction of hierarchical networks.(a) graph at level i, (b) level i + 1 network.

Figure 1 .
Figure 1.Construction of hierarchical networks.(a) graph at level i, (b) level i + 1 network.

+
are the distances from nodes si+1 and ti+1 to their common central nodes in the level i + 1 network.

−
The distance between any two adjacent nodes in the top-level network is approximated as in Equation (4) is the length of shortest path between nodes s and t in the toplevel network computed by Dijsktra's algorithm.

Algorithm 1 .
First Hierarchy Notations: old = (V, E) is the original network, new = (V , E ) is the next level network, S is the super node c and its neighboring normal nodes repectively 1: function SingleCluster (old) 2: C←HighestDegree (old) 3: for c ∈ C do 4: if c S 5: S c = {c} U {c.neighbors\S} 6: S←S U {S c } 7: Edges connected to the sub nodes inside S c are redirected to S c , also the multiple edges are merged into a single one.in S are regarded as normal nodes in V of new.11: return new 12: end function 13: function HighestDegree (old) 14: for each v ∈ V let d[v] ←degree (v) 15: sort V by d[v] 16: v (i) denotes the vertex with the i-th highest d[v] 17: return sequence {v (1) , v (2) , . . . . . ., v (|V|) } 18: end function

Figure 4 .
Figure 4. Architecture of the algorithm for computing shortest path.

Figure 4 .
Figure 4. Architecture of the algorithm for computing shortest path.

Algorithm 2. Hierarchy Networks Preconditions:
Network original = (V, E), d[i][j]records the shortest distance between node i and j in the top level network.

Table 1 .
Basic information of four networks.

Table 2 .
Runtime and accuracy of CDZ and our algorithm on four networks.

Table 3 .
Results from LCA, LBFS and our algorithm.

Table 4 .
Comparison of different thresholds.