4.1. Local Communities from Personalized Centrality
Given a seed set of vertices of interest, we can calculate the personalized Katz or PageRank scores as
=
or
, respectively, where
and
with
and
are the corresponding right-hand sides as discussed in
Section 2.3. If we want the personalized scores w.r.t. vertex
i, then
=
=
. Intuitively, the resultant scores from a personalized centrality metric with respect to vertex
i answers the question of how likely we are to reach vertex
i from the rest of the graph. For the question of local community detection, this can be translated into how likely vertices in the graph are to belong to the community of vertex
i. For a community of size
R, we therefore take the top
R vertices as ranked by the personalized centrality vector
or
as the local community.
Once personalized Katz or PageRank centrality scores are computed, the local community is then formed from those vertices with highest centrality values. Sorting the entire length n vector to obtain these top entries is too computationally expensive, especially in the dynamic setting where updated results are needed quickly after changes occur. Therefore, we extract the vertices with top k values using a heap. For the first k vertices, the centrality values are added to a heap. Thereafter, each centrality score is compared to the minimum value in the heap in time and if larger, the minimum value is removed from and the new value inserted into the heap in time. In the worst case, the centrality values are in ascending order and all such checks result in a removal and insertion, leading to a running time of . However, experiments on real graphs show far fewer replacements.
4.2. Results on Static, Synthetic Graphs
This section validates using personalized centrality for local community detection. Static, synthetic graphs with known community structure provide test cases for the Katz centrality approach. We also compare our approach to the popular method of greedy expansion [
34], which is described in
Section 3.1. To test, we generate multiple stochastic block model (SBM) graphs with varying parameters, randomly choose seed vertices, and detect local communities with both personalized Katz centrality and greedy expansion.
The greedy expansion method uses conductance as its fitness function. A simple stochastic block model graph can be generated with four parameters: the total number of vertices n, the number of communities k, the average degree of vertices d, and the percentage of inter-community edges . All communities in such a graph are generated with the same parameters and are interchangeable. Note that SBM graphs can also be generated with different parameters than we use. Instead of using an average degree and proportion of inter-community edges, the parameters and can be used. These define the probabilities of placing and edge between a pair of vertices that are in the same community and between a pair in different communities, respectively. Although different parameters are used, these two models are the same when all communities are generated with the same parameters. The parameters are related as follows. For a set community size of , and . The code was implemented in Python and run on an 18 core Intel Xeon CPU at 2.10 GHz.
Table 2a–c shows the recall of communities found with each method compared to the known ground truth. The recall is the fraction of the ground truth recovered by each method. For these results, random stochastic block model graphs were created with 1000 vertices and two communities and a random seed was chosen. The minimum, mean, and maximum recall values shown are obtained from 100 runs, each with a random graph and seed vertex.
For the results shown in
Table 2a, the average degree of vertices was varies from 5 to 490, while the proportion of inter-community edges is fixed at
. All others are intra-community edges. Because the proportion is fixed, as the average degree increases, both the number of intra-community and inter-community edges increases. Overall, recall scores are at or near 1, showing that the Katz method returns good communities for all average degrees considered.
This suggests that using personalized centrality is a viable method of local community detection. In fact, on SBM graphs with low degrees, the personalized Katz method performs better than greedy expansion. This occurs because the greedy expansion method stops adding new vertices once a local quality maximum is reached. On very low degree SBM graphs, it stops expanding after adding only a few vertices, which results in very small communities and thus low recall. Therefore, we also show results for a modified version of the greedy algorithm in which expansion is forced to occur until the community reaches the desired size (labeled
Force Expand in
Table 2a–c). Normally, the greedy algorithm is not run in this way, but because we know the size of the community ahead of time, we can obtain these results. Note that results for the normal greedy algorithm and the forced expansion version tend to differ only for graphs in which the average degree is low compared to the community size.
Table 2b,c shows how the quality of communities detected varies for SBM graphs with an increasing proportion of inter-community edges. For these experiments, the average degree is fixed at 20 (
Table 2b) and 100 (
Table 2c), and the proportion of edges that are inter-community varies from
to
(thus the proportion of intra-community edges varies from
to
). As the percentage of inter-community edges increases, the community structure becomes less defined, making community detection more difficult. As expected, all methods achieve the best recall for graphs with a low proportion of inter-community edges. For graphs with a lower average degree of 20, both the Katz and greedy expansion methods return high quality communities only when a small proportion of edges exist between communities. However, when the average degree is increased to 100, both methods are less sensitive to a large proportion of inter-community edges and achieve higher recall values. Overall, the quality of communities returned by the personalized Katz method is comparable to those returned by greedy expansion. While the mean recall is sometimes lower, the minimum recall tends to be higher, making the results more consistent. Note that both the standard greedy and forced expansion greedy algorithms can return communities with very low recall. This may occur if the standard greedy method stops expanding too early or if either version returns the wrong community. Because the method greedily maximizes conductance, if there is a single seed vertex, the next vertex added is its lowest degree neighbor. If this neighbor belongs to a different community, the algorithm may detect and return the wrong community.
An interesting phenomenon occurs in
Table 2b for SBM graphs with degree 20 and
inter-community edges. The minimum recall obtained with greedy expansion increases compared to
inter-community edges. This reversal in trend occurs because, at
inter-community edges, the community structure is almost gone and the greedy algorithm returns an almost random set of vertices, including many correct vertices. For graphs with a stronger community structure, on the other hand, the minimum recall corresponded to cases in which the greedy algorithm did return a community, but not the correct one.
Next, we consider the relative running time of the personalized Katz approach compared to the greedy expansion method.
Figure 1 plots the ratio running times, where a value of
x greater than 1 indicates that the Katz method is
x times faster than greedy expansion. For these tests, we also use static, synthetic SBM graphs. As before, graphs are generated with four different parameters: the total number of vertices, the number of communities, the average degree of vertices, and the percentage of inter-community edges. For each experiment, three of these parameters are held constant, while one is varied in order to isolate its effect on the running time. The results shown use the modified version of greedy expansion in which the algorithm is forced to expand to the desired community size. We used this version because for those SBM graphs with a very low average degree compared to community size, the standard greedy algorithm stopped expanding after only a few vertices, leading to small and incorrect communities (see
Table 2). This is likely an artifact of the synthetic SBM graphs in which vertices have uniformly random degrees. For all plots in
Figure 1, the proportion of inter-community edges
is set to
. Overall, we see that using the personalized Katz approach tends to be faster than running the greedy expansion method. Next, we discuss various ways in which the structure of a graph affects the relative running times.
In
Figure 1a, speedup is shown for graphs with an increasing number of vertices, while the average degree is fixed at 20 and the number of communities is fixed at 2. This experiment shows that, with all other parameters held constant, the larger the number of vertices in the graph (and therefore the larger the community detected), the greater the speedup of using our Katz approach compared to greedy expansion. This occurs because the complexity of the greedy approach is approximately
for a community size of
c and average degree
d, while the complexity of the Katz approach is
for a graph with
m edges.
The advantage of our centrality approach compared to greedy expansion is greatest when the size of the community is large relative to the total number of vertices. This can be seen in
Figure 1b, where we vary the number of communities, while keeping the size of the graph constant at 47,104 vertices with an average degree of 20. It is clear that the speedup of the Katz method is greatest for the graphs with a small number of large communities. This occurs because the personalized Katz centrality computation is global and processes the entire graph, while greedy expansion processes only a local subgraph composed of the community and its one hop neighborhood. If, however, the community to be found is much smaller than the full graph, the greedy expansion method may be preferable.
Finally, we consider how the average vertex degree affects relative running times in
Figure 1c. The number of vertices is held constant at 1000 with 2 communities. As the average degree increases, the speedup of the Katz method over greedy expansion first increases and then decreases. The increase in speedup occurs because a higher average degree results in larger community neighborhoods and the larger the neighborhood of the community is as it expands, the slower the greedy expansion is. However, once the average degree grows large enough, few or no new vertices are added to the neighborhood and the clustering coefficient of the graph simply increases. These results show that the running time advantage of the personalized Katz method compared to greedy expansion is greatest in graphs in which the community of interest has a large neighborhood and a low clustering coefficient.