1. Introduction
Finding the community structure within a complex network that relates to its function or behavior is a fundamental challenge in Network Science [
1,
2,
3,
4,
5]. It is a highly non-trivial problem, as even defining what one means by “structure” must be specified [
6]. A variety of approaches can be used to partition the nodes of a network into communities [
7,
8]. Each approach will divide the nodes according to a different definition of the structure and, in general, will find a different partition [
9,
10]. Often, the goal is to find a partition that maximizes an objective function. However, finding such a partition can be a computationally difficult NP-complete problem [
11,
12]. Finding a guaranteed exact solution for a large network is therefore generally infeasible. Thus, for practical applications, it is important to have an approximate algorithm that has polynomial-time complexity, i.e., is fast, and that finds near-exact best solutions for large networks, i.e., is accurate.
Recently, an algorithmic scheme has been introduced that uses information in an ensemble of partitions to produce a better, more accurate partition when seeking to maximize an objective function. The scheme, Reduced Network Extremal Ensemble Learning (RenEEL) [
13], uses a machine learning paradigm for graph partitioning, Extremal Ensemble Learning (EEL). EEL begins with an ensemble
of
unique partitions. It then iteratively updates the ensemble using extremal criteria until consensus is reached within the ensemble about what the “best” partition is, i.e., the one with the largest value of the objective function. Each update considers adding a new partition
P to
as follows: If
, then the “worst” partition in
,
, is removed from
, reducing the size of
by one
, and the update is complete. If
and
P is worse than
, then again
is removed from
, reducing the size of
by one
, and the update is complete. If
and
P is better than
, then if
,
is replaced by
P in
and the update is complete, or if
,
P is added to
, increasing the size of
by one
, and the update is complete. Iterative updates continue until
. The remaining partition is the consensus choice for the best partition.
To ensure fast convergence to a consensus choice in EEL updates, RenEEL conserves the consensus within that exists up to that point each time it finds a new partition to be used in an update. It achieves this by partitioning a reduced network rather than the original network. The reduced network is constructed by collapsing the nodes that every partition in agrees should be in the same community into “super” nodes. Reduced networks are smaller than the original network and can be analyzed faster, focusing effort only on improving partitioning where there is disagreement within . The consensus within increases monotonically, and the size of the reduced networks decreases monotonically as the EEL updates are made. RenEEL creates an ensemble consisting of L partitions found by analyzing the reduced network to decide the partition to use in each EEL update. The best partition in is then used in the update.
There is wide flexibility within the RenEEL scheme. A base algorithm is used to find the partitions that initially form the ensemble and those that form each ensemble. The base algorithm can be any algorithm that finds a partition that maximizes an objective function. Multiple base algorithms can even be used. There is also freedom in choosing the values of K and L, which represent the maximum size of and the size of each , respectively. The best base algorithm choice and values of K and L depend on the network being analyzed, desired accuracy, and available computational resources.
This paper investigates the effect of varying
K and
L on the performance of RenEEL. Larger values of
K and
L will typically lead to a final, consensus best partition with larger values of the objective function [
13]. But how does the value of the objective function of the consensus partition reached typically depend on
K and
L? How quickly is the value found expected to approach its true maximum value? Given only limited computational resources, is it better to increase
K or
L? We empirically study these questions when seeking the partition of three well-known real-world networks that maximizes the objective function Modularity.
2. Results
A commonly used approach to find structure in a complex network is to partition the nodes into communities that are more densely connected than expected in a random network. In this approach, the community structure corresponds to the partition that maximizes an objective function called
Modularity [
2,
14]. For a given nodal partition
, Modularity
q is defined as
where the sum is over all pairs of nodes
,
is the community of the
ith node, and
m is the total number of links present in the network.
and
are, respectively, the degree of the
ith node and the
th element of the adjacency matrix. Thus, Modularity is the difference between the fraction of links inside the partition’s communities, the first term in Equation (
1), and what the expected fraction would be if all links of the network were randomly placed, the second term in Equation (
1). The task is to find the partition
C that maximizes
q. We denote the maximum value of
q as
Q, the value of which is called “the Modularity” of the network.
A number of algorithms with polynomial-time complexity have been developed to find a partition that maximizes Modularity. They range from very fast but not-so-accurate algorithms, such as the Louvain method [
15] or randomized greedy agglomerative hierarchical clustering [
16], to more accurate but slower algorithms [
17], such as one that combines both agglomeration and division steps [
18,
19]. The accuracy of all of the algorithms tends to decrease as the size of the network increases.
All of the fast Modularity-maximizing algorithms are stochastic because at intermediate steps of their execution, there are seemingly equally good choices to make that are randomly made. In the end, those choices can be consequential because different runs of an algorithm with different sets of random intermediate choices can result in different solutions. Because of this, multiple runs of an algorithm are often made, say 100, to analyze a network, producing an ensemble of approximate partitions. The partition in the ensemble with the largest Modularity is then taken as the network’s community structure, while all other partitions in the ensemble are discarded. RenEEL instead uses the information within the ensemble to find a more accurate partition.
Here we use a RenEEL algorithm that has a randomized greedy base algorithm [
16] to find the community structure by maximizing Modularity in real-world networks A, B, and C. Network A is the As-22july06 Network [
20]. It is a snapshot in time of the structure of the Internet at the level of autonomous systems. It has 22,963 nodes, which represent autonomous systems, and 48,436 links of data connection. Network B is the PGP Network [
20]. It is a snapshot in time of the giant component of the Pretty-Good-Privacy (PGP) algorithm user network. It has 10,680 nodes, which are the users of the PGP algorithm, and 24,316 links indicating the interactions among them. Lastly, Network C is the Astro-ph network. It is a coauthorship network of scientists in Astrophysics consisting of 16,706 nodes representing scientists and 12,1251 links representing coauthorship in preprints on the Astrophysics arXiv database [
21].
For each of the three networks, 300 different runs of RenEEL were made for independent values of
K and
L of 10, 20, 40, 80, 160, and 320, respectively. The compute time required to find the consensus partition was measured for each run. The mean and standard errors of the compute times for the runs at a given value of
K and
L were then calculated. The full results are listed in
Table A1,
Table A4 and
Table A7 in
Appendix A. For a fixed value of
L or
K, we find that the mean compute times
increase asymptotically as a power of the other ensemble size,
For example,
Figure 1 shows this power-law behavior for Network A when
L and
K have fixed values of 80.
Two-parameter, nonlinear least-squares fits to data for ensemble sizes greater than 10 were then used to determine the proportionality constant and
.
Table 1a,b show the values for
and
, respectively, that result from fits at different fixed values of
L and
K for each of the three networks. All statistical errors reported in this paper are
.
The values of the exponents
and
weakly vary with the value of
L and
K, respectively. The standard errors of
and
tend to remain consistent between smaller and larger ensemble sizes. The distribution of computed time, as depicted in
Figure A1 for Network A, does not follow a normal distribution. Consequently, increasing the ensemble size does not lead to a decrease in the standard error. For each network, however, the value of
is significantly larger than
. Thus, the expected compute time increases faster with
K than
L. Given that larger values of
K and
L typically lead to a better result, i.e., a consensus partition with a larger
Q, one might naively conclude that it is better to increase
L rather than
K. But, to determine if that conclusion is, in fact, correct, the way that
Q increases with
K and
L must be taken into account.
To this end, we begin by noting that for any finite-size network, there is only a finite number of possible partitions. Many modularity-maximizing algorithms will consistently find the actual best partition for very small networks. As the network size and number of possible partitions grow, the task becomes harder; algorithms start to fail to find the exact solution and only provide estimates of the actual, or exact, best partition. RenEEL appears to perform very well at finding the actual best partition of networks of sizes of up to a few thousand nodes [
13]. Still, even RenEEL can only find estimates of the exact best partition of larger networks, such as the three we analyze in this paper. As values of
K and
L increase, the estimates improve, and the value of
Q of the consensus partition approaches
, the Modularity of the exact best partition. To explore how the values of
Q of RenEEL’s consensus partitions approach
as a function of
K and
L, the mean and standard errors of
Q found in the runs that were made on each network were calculated as a function of
K and
L. The results are listed in
Table A2,
Table A3,
Table A5,
Table A6,
Table A8 and
Table A9 in
Appendix A. For a fixed value of
L or
K, we find that
Q approaches a maximum value,
, as a power-law of the other ensemble size,
where the
As are constants.
Figure 2 shows this behavior for Network A when
L and
K have fixed values of 80. The exact value of
is unknown for Networks A, B, and C. Three-parameter, nonlinear least-squares fits were used to determine the values of
,
A, and
.
Table 2a,b list the values of
and
, respectively, that result from fits at fixed values of
L for each of the three networks. Similarly,
Table 3a,b list the values of
and
that result from fits at fixed values of
K for each of the three networks.
The fitted values of increase systematically with increasing L and K and converge to statistically equivalent values at the largest ensemble sizes studied (320), regardless of whether L or K is increased. However, the values of are generally larger when fixing L rather than K when comparing results from when they are fixed at the same size. This fact implies that the maximum value of Q is approached faster by fixing L and increasing K rather than the opposite. The rate of convergence to is quantified by the exponents. The values of the exponents and depend on the network but only weakly vary with the values of L and K, respectively. For each network, however, the value of is significantly larger than that of . Thus, the expected compute time increases faster with K than it does with L.
To understand these results, recognize that finding the best partition is an extremal process that, when repeated, is akin to the process of record-breaking. Let us recall some of the theory of the extreme value statistics of record-breaking [
22]. Consider a sequence of independent and identically distributed random numbers
chosen from a probability distribution of the form
where
B is the maximum possible value of
x, and define the
record as the maximum value of
x in the first
t numbers in the sequence:
Then, in the limit of large
t, the mean record will approach
B as
i.e., as a power-law function with an exponent of
. From this, we see that
is a borderline case; from Equation (
5), it is the case of a uniform distribution of
x. If
, then
is maximal at
, and if
, then
vanishes as
.
While the analogy with this simple, analytically tractable model of record-breaking is not perfect, Equation (
3) can be compared with Equation (
7) by identifying
with
B and
with
. Then, the fact that empirically
and
suggests that the distributions of the
Q of the consensus partitions found by increasing
K and
L correspond to different cases. Namely, as
K is increased, the consensus partition
Q is likely to be near
as it is for
, while as
L is increased, it is more likely to have a smaller value as it is for
. To confirm this, we made 800 runs of RenEEL analyzing Network A with
and
and with
and
.
Figure 3 shows the consensus values of
Q found in those runs. As expected, the values found with
and
(red bars) are much more likely to be near the maximum value than those found with
and
(blue bars).
We can now answer the central question of this paper: Given only limited computational resources, is it better to increase
K or
L? We have found that the average compute time grows faster with
K than with
L, but also that the consensus
Q approaches
faster with
K than with
L. Does the consensus
Q approach
as a function of compute time faster by increasing
K or
L? To answer this, we invert Equation (
2) and combine it with Equation (
3) to obtain
So, the larger
is, the faster
Q approaches
as a function of average compute time.
Table 4a shows the values of
at different fixed
L for the three networks. Similarly,
Table 4b shows the values of
at different fixed
L for the three networks.
From these results, it can be clearly concluded that increasing K rather than L will cause Q to approach faster. With limited computational resources, it is therefore better to increase K rather than L. Although we have shown that this is true only for three example networks and only when maximizing Modularity, we speculate that these networks are not special and maximizing modularity, rather than a different objective function, is also not special. Therefore, the conclusion that it is more computationally efficient to increase K rather than L in RenEEL should generally be true. However, it would be interesting to explore this question when maximizing other objective functions with RenEEL.