Multi-Threading a State-of-the-Art Maximum Clique Algorithm

: We present a threaded parallel adaptation of a state-of-the-art maximum clique algorithm for dense, computationally challenging graphs. We show that near-linear speedups are achievable in practice and that superlinear speedups are common. We include results for several previously unsolved benchmark problems


Introduction
A clique in a graph is a set of vertices, each of which is adjacent to every other vertex in this subset.Deciding whether a graph contains a clique of a given size is one of the archetypal NP-complete problems [1].We consider the optimisation variant-known as the maximum clique problem-and focus upon dense, computationally challenging graphs [2].Within computing science, applications include computer vision and pattern recognition; beyond, they extend to mathematics, biology, biochemistry, electrical engineering and communications [3,4].
Multi-core machines are now the norm [5].Our goal is to adapt a state-of-the-art maximum clique algorithm to make use of the multi-core parallelism offered by today's processors.We look either to produce a speedup or to allow the tackling of larger problems within a reasonable amount of time [6].

Preliminaries
We represent a graph as a pair of finite sets, (V, E).The elements of V are known as vertices.The elements of E are edges, represented as pairs, (v 1 , v 2 ) ∈ V × V .We call vertices v 1 and v 2 adjacent if (v 1 , v 2 ) ∈ E. Throughout, we assume that our graphs are undirected, that is, (v 1 , v 2 ) ∈ E ⇒ (v 2 , v 1 ) ∈ E, and contain no loops, that is, for all (v 1 , v 2 ) ∈ E, we have The neighbourhood of a vertex, v, in a graph, G, is the set of vertices adjacent to v, denoted N(G, v).The degree of a vertex is the size of its neighbourhood.
A graph G = (V , E ) is called a subgraph of a graph G = (V, E) if V ⊆ V and E ⊆ E. The subgraph induced by V is the subgraph with vertices, V , and all edges between those vertices.A graph G = (V, E) is called complete if all its vertices are adjacent-that is, for every distinct pair of vertices, v 1 and v 2 , we have (v 1 , v 2 ) ∈ E. A complete subgraph is known as a clique.We may represent a clique by the vertex set that induces it, and we define the size of the clique to be the size of this vertex set.A clique is maximum if there is no clique with a larger size.We denote the size of a maximum clique in a graph by ω.
A colouring of a graph is an assignment of colours to vertices, such that adjacent vertices are given different colours.Computing a minimal colouring is NP-hard, but a greedy colouring may be obtained in polynomial time.

Algorithms for the Maximum Clique Problem
The starting point for our work is a series of algorithms of Tomita [7][8][9].These are sequential branch and bound algorithms using a greedy graph colouring as a bound and as an ordering heuristic.San Segundo observed that bit parallelism may be used to improve the performance of these algorithms without altering the steps taken [10,11].A recent computational study analyses these algorithms [12].We begin by outlining one variant.
Let G = (V, E) be a graph and v ∈ V .We observe that any clique, C, in G either does not contain v or contains only v and possibly vertices adjacent to v.This provides the basis for the branching part of a branch and bound algorithm: incrementally construct the set, C (initially empty), by choosing a candidate vertex, v, from the candidate set, P (initially, all of the vertices in V ) and, then, adding v to C. Having chosen a vertex, the candidate set is updated, removing vertices that cannot participate in the evolving clique.If the candidate set is empty, then C cannot grow further, so we backtrack; otherwise, the search continues, selecting from P and adding to C.
We keep track of the largest clique found so far, which we call the incumbent, and denote C max .For a bound, we make use of a greedy graph colouring: if we can colour the vertices in P using k colours, we know we cannot extend the size of C by more than k.If this is not enough to unseat the incumbent, we may abandon the search and backtrack.
Producing a new graph colouring at every step is relatively costly.Tomita's key observation is that if we are given a colouring of a graph, we know not only that we may colour the entire graph using a certain number of colours, but also how to colour certain subgraphs.We demonstrate this in Figure 1 on the following page.Colour classes have been filled greedily: we colour vertex 1 with the first colour, blue, which prevents us from giving vertices 2, 3 or 4 the same colour.We then colour vertex 5 blue, which prevents us from colouring vertex 6 blue.We now start a second colour class (green), and reconsidering uncoloured vertices, we colour vertex 2, then vertex 3. Finally, we colour vertex 4 and, then, vertex 6 in the third colour, yellow.Thus the entire graph can be three-coloured, and we may colour the subgraph induced by {1, 5, 2} or {1, 5, 2, 3} using only two colours, the subgraph induced by {1, 5} using only one colour, and so on.Consequently, as well as a bound, this process gives us an ordering heuristic for selecting a vertex from P : select vertices in non-increasing colour order.We describe this colouring method formally in Algorithm 1.This algorithm produces the same ordering and bounds as the NUMBER-SORT procedure from Tomita's MCQ [7], but in the manner of San Segundo's BBColour procedure in the bitset encoded BBMC [10].The algorithm iteratively builds colour classes (lines 9 to 15), where a colour class is an independent set (non-adjacent vertices) of similarly coloured vertices.Line 15 updates the current set of uncoloured vertices, Q, to be the set of vertices not adjacent to the most recently coloured vertex, v.The algorithm delivers as a result a pair (colour, order), where the vertex, order[i], has been coloured colour[i] and colour Algorithm 1: Sequentially colour vertices and sort into non-decreasing colour order.We may now discuss Algorithm 2, our baseline sequential algorithm.Following Prosser [12], we opt to order vertices in a non-increasing degree order at the top of the search, use a static ordering throughout, use a bitset encoding and do not include a colour repair mechanism-thus, our algorithm is a bitset encoded version of MCSa1 (although the techniques presented are not sensitive to this choice).
Algorithm 2: An exact algorithm to deliver a maximum clique.Procedure maxClique permutes the graph, such that the vertices are in non-increasing degree order (and this order is exploited by the sequential colouring), then calls the recursive search procedure, expand, on the graph, G, with an initially empty growing clique and a full candidate set.Procedure expand (lines 7 to 19) searches for cliques.First, the vertices in the candidate set, P , are coloured and sorted (line 9).The procedure then iterates from highest to lowest coloured vertex (line 10).If the size of the growing clique plus the colour number of the current vertex is sufficient to potentially unseat the incumbent, the search progresses (line 11).The ith vertex v, ordered by colour, is selected and added to the growing clique (lines 12 and 13).A new candidate set, P , is created, where P is the set of vertices in P adjacent to v. If C cannot grow any further (line 15) and is larger than the incumbent, it is saved (line 16).Otherwise, the search proceeds via a recursive call, with the enlarged clique, C, and reduced candidate set P (line 17).Regardless, having explored the option of taking vertex v (lines 12 to 17), the search then explores the option of not taking vertex v (lines 18 and 19, then iterating back to line 10).
Note that sets are encoded as bit strings, i.e., a bitset encoding similar to San Segundo's BBMC [10] is used.This allows the implementation to exploit bit parallelism.The graph should be encoded as an array of n bitsets (this encoding may be done when the graph is permuted).Thus, the call, N(G, v), in line 14 of Algorithm 2 delivers the vth adjacency bitset, and similarly, line 15 of Algorithm 1 delivers the complement of the vth adjacency bitset.To add or remove a vertex from a set, a bit is flipped, and to take the intersection of two sets, a logical-and operation is performed.
Other maximum clique algorithms exist.Recent work by Pattabiraman, Patwary, Rossi et al. [13,14] presents a maximum clique algorithm for sparse graphs and discusses parallelism.Our approach differs in that we primarily consider dense, computationally challenging graphs-on several of the benchmarks where Pattabiraman et al. aborted their run after 10,000 s, our sequential runtime is under one second.Although large sparse graphs have real world applications, we do not wish to neglect the really hard problems [2].
Pattabiraman et al. claim that Tomita's algorithms are "inherently sequential or otherwise difficult to parallelize".We agree with the first half of this statement.

Parallel Algorithm Design
We may view the branch and bound as being like a depth first search over a tree, where nodes in the tree represent recursive calls.There is a left-to-right dependency between nodes, since we do not know until after we have explored the leftmost subtree whether subtrees further to the right may be eliminated by the bound.However, we may speculatively ignore this dependency and explore subtrees in parallel, sharing the (size of the) incumbent between threads.This technique for branch and bound is generally well known [15,16], and previous experiments by the authors [17], and earlier work by Pardalos [18] suggested that the approach is feasible for a maximum clique.
Our parallel version of Algorithm 2 is presented as Algorithm 3 on the next page.We use threads for parallelism; adapting this approach to use message passing is unlikely to be trivial.For work distribution, we use a global queue of work items.Each queue item corresponds to a deferred recursive call to expand.The queue is initially populated by splitting the search tree immediately below the root, in the same order as the sequential algorithm would do.The queue may be treated as bounded when doing so, to avoid excessive memory usage.Worker threads take items from this queue and process them as if they were a sequential subproblem.There is a single variable for the incumbent that must be shared between worker threads-operations involving the incumbent require appropriate synchronisation.
The size of these subtrees can vary dramatically, and we cannot know in advance where the large trees are.Thus, without an additional work splitting mechanism, it is possible that the runtime of a single work item could dominate the overall runtime.We opt for a variation upon work stealing, which we call work donation: if the initial populating has completed, if the queue is empty and if there are idle workers, we set a globally visible flag advising workers that they may choose to place some of their subtrees onto the global queue.We provide a general illustration of this queueing mechanism in Figure 2 on page 624.
The advantage of this method is that it does not require workers to advertise what they are doing and does not require locking, except for rare queue accesses.The success of the sequential algorithm is in part due to its high throughput (we expect to be handling over 100,000 recursive calls per second on modern hardware), and we do not wish to introduce unnecessary overheads.We caution that care must be taken to ensure that worker threads do not exit until no more work can be enqueued (not simply when the queue is empty).
else if the populating thread is done, and q is empty, and there are idle workers then populate ← true if populate then enqueue(q, (C, P )) else expand(G, q, C, P , C max ) If the initial populating thread is done, and the queue is empty, and workers are idle, then busy worker threads may donate their current node and every node to its right onto the queue.

Goals of Parallelism
We define the speedup obtained from a parallel algorithm to be the runtime for a good sequential implementation divided by parallel runtime.We call a speedup of n from n cores linear and a speedup of greater than n superlinear.
Intuitively, we may expect that doubling the number of cores available could at best halve the runtime of the algorithm-that is, the speedup we obtain could at best be linear.However, for branch and bound algorithms, this is not the case, and superlinear speedup is possible [19][20][21][22].This is because we are not simply dividing up a fixed amount of work: if an additional thread finds a better incumbent quickly, we will have less overall work to do.On the flip side, we cannot guarantee that we will produce a speedup at all: it may be that some threads spend all their time exploring parts of the tree, which would have been eliminated in a sequential run, and, so, contribute nothing to the solution.
A parallel algorithm is work efficient if it does "the same amount of work" as a sequential algorithm.We ignore implementation overheads and focus upon a "representative operation", as is done when considering complexity; here, we count the number of recursive calls to expand.If an algorithm is not work efficient, we define its efficiency to be the ratio of work done by the sequential algorithm to work done by the parallel algorithm, expressed as a percentage.If the time taken to execute each "unit of work" is roughly equivalent, we would expect an efficiency of greater than 100% to be a necessary, but not sufficient, condition for obtaining a superlinear speedup (This is not entirely true, due to cache effects: it could be that the working memory will fit entirely in cache when using multiple cores, but not when using a single core.However, for our purposes, this effect is negligible.).
The goal of parallelisation is not necessarily to produce a speedup.We are also interested in being able to tackle larger problems within a reasonable amount of time [6].For this work, we consider it more important to be able to reduce a runtime from, say, days to hours, rather than from one second to one tenth of a second.
We must also be careful not to introduce the possibility of a slowdown.Ignoring overheads, this may be done by ensuring that one worker follows "the same path" (or a subset thereof) as the sequential version of the algorithm [23], which our queueing mechanism provides.We must also ensure that newly discovered incumbents are immediately visible to every worker.

Complications from Hyper-Threading
Hyper-threading "makes a single physical processor appear as two logical processors; the physical execution resources are shared, and the architecture state is duplicated for the two logical processors" [24].The system we will be using for experimental evaluations is hyper-threaded; this causes complications.
Firstly, this means that when going from using one thread per physical processor to one thread per logical processor, we should not expect to be able to double our speedup.The cited Intel literature suggests a performance increase of "up to 30%"-this figure is derived from benchmarks (which show performance increases of "21%" and "16 to 28%"), not theory.Taken at face value, this means that a speedup of around 15.6 on the 12-core, hyper-threaded system that we describe later should be considered "linear".
Secondly, and more problematically, this means that if two (software) threads are running on the same physical processing core, each will run more slowly than if it had the core to itself [25].Because we are not executing a fixed amount of work on each thread, this can lead to a slowdown anomaly-this is a variation of what Bruin et.al. describe as the "[danger of increasing] the processing power of a system by adding a less powerful processing element" [23].We will assume that so long as the number of threads we use is no greater than the number of physical processing cores, the operating system scheduler will be smart enough to allow us to ignore this issue.For larger numbers of threads, we proceed with the understanding that this could possibly make matters worse, not better (The same problem arises if we use more worker threads than can be executed simultaneously.This, however, is directly under our control.).

Implementation
We implemented the sequential and threaded algorithms in C++, using C++11 threads [26].Since we are using shared memory parallelism and since the graph data structure is not modified during execution, we may share the graph between threads to save memory (and improve cache performance).C++ provides us with strong enough guarantees to do this without locking.Sharing the incumbent requires more attention.To avoid the possibility of a slowdown, improved incumbents must be made visible to every worker.To simplify this problem, we note that we only need to share the size of the incumbent, not what its members are.Thus, we need only share a single integer and can retrieve the actual maximum clique when joining threads.We make use of an atomic to do this.
The number of worker threads to use is left as a configuration parameter-this allows experiments to be performed evaluating scalability.We do not count the initial populating thread towards the thread count, as it is expected that the amount of work performed for the population will be a very small part of the overall work.We also do not count the main program thread, which just spawns, then joins, child threads.No attempt was made to parallelise the initial preprocessing and sorting of the graph.

Experimental Data and Methodology
We work with three sets of experimental data.The first set, from the DIMACS Implementation Challenges [27], contains a smörgåsbord of random and real-world problems of varying difficulty-some can be solved in a few milliseconds, whilst others have not been solved at all.The second set consists of "Benchmarks with Hidden Optimum Solutions for Graph Problems" [28].Each of these contains a maximum clique of a known size that has been hidden in a way intended to make it exceptionally hard to find.Finally, we consider random graphs.
For timing results, following standard practice, we exclude the time taken to read the graph from the file.We include the time to do preprocessing on the graph (this is not entirely standard, but we consider it the more realistic approach).We measure the wall-clock time until completion of the algorithm.In the case of threaded algorithms, we include the time taken to launch and join threads and to accumulate results as part of the runtime.When giving speedups, we compare threaded runtimes against the sequential algorithm, not against the threaded algorithm running with a single thread, and all experiments have actually been performed on real hardware and are not simulations.We also spend the following section verifying that our implementation of the sequential algorithm is competitive with published results.In other words, our speedup figures measure what we can genuinely gain over a state-of-the-art implementation.
Except where otherwise noted, experiments are performed on a computer with two 2.4 GHz Intel Xeon E5645 processors.Each of these processors has six cores, and hyper-threading is enabled, giving a total of twelve "real" cores or twenty-four hardware threads.To get a better view of scalability, we report results using four, eight, twelve and twenty-four worker threads.Due to the nature of hyper-threading, we should not expect speedup to double when going from twelve to twenty-four threads.

Comparison of Sequential Algorithm to Published Results
The source code for the implementation by San Segundo [11] is not publicly available.However, we obtained access to a machine with the same CPU model as was used to produce the published results and compared performance for the "brock400" family of graphs from DIMACS.Although our algorithm is not identical, due to differences in initial vertex ordering, we see from Table 1 that runtimes are comparable.The final column presents runtimes using Prosser's BMCS1 (which is identical to our sequential algorithm, but coded in Java) on the same machine; we are always faster by a factor of greater than 2.5.
Table 1.Comparison of runtimes (in seconds) for our sequential implementation with San Segundo's published results for BBMCI [11] (which differs slightly from our algorithm) and with runtimes using Prosser's BBMC1 [12] (which is the same as our algorithm, but in Java).The system used to produce these results has the same model CPU as was used by San Segundo.Experimental results on graphs from DIMACS are presented in Table 2 and from BHOSLIB in Table 3.The DIMACS benchmarks include a wide variety of problem sizes and difficulties.Problems that took under one second to solve sequentially are omitted.We attempted every DIMACS instance at least with 24 threads.Some problems took over a day to solve with 24 threads, and for these, we indicate the largest clique we found in that time.Blank entries in the table indicate problems that were not attempted with that number of threads.

Problem
For four of the problems, we present results that took much longer to calculate.We did not have exclusive access to the machines in question when producing these results, and other CPU-intensive tasks were sometimes being run at the same time.Thus, these runtimes should be considered as an upper bound, rather than an indication of performance.We believe the proofs of optimality for "p hat1500-3" and "MANN a81" are new, although in both cases, the maximum clique had already been found (but was not known to be maximum) by heuristic methods.These proofs may be useful to those wishing to evaluate the strength of non-exact algorithms and demonstrate that the tried and tested approach of "throwing more CPU power at a problem" need not be abandoned now that we have more cores rather than more MHz.
Our worst speedups are 3.3 from 4 threads, 6.7 from 8 threads and 7.7 from 12 threads, all from problems where the parallel runtime is under one second.We nearly always produce a near-linear speedup, and we get superlinear speedups for half of the problems.Some of these superlinear speedups are substantial: our best speedups are 19.5 from 4 threads, 75.3 from 8 threads and 102.2 from 12 threads.For the BHOSLIB benchmarks, which are designed to be exceptionally hard to solve, our results are particularly consistent: our speedups are nearly always superlinear, with a speedup of between 11.8 and 15.0 from 12 threads.For the omitted trivial problems, we sometimes got a speedup and sometimes got a slowdown, due to overheads.For larger problems, we do not see any signs of scalability problems when using all of the cores available to us.Table 2. Experimental results for DIMACS instances.Shown are the size of a maximum clique, then sequential runtimes and the number of search nodes (calls to expand).Next is parallel runtimes, speedups and efficiency using 4 to 24 threads on a 12-core hyper-threaded system.Superlinear speedups and efficiencies greater than 100% are shown in bold; blanks indicate unattempted problems.Problems where the sequential run took under one second are omitted.When using 24 threads, we must worry about hyper-threading; nonetheless, our speedups typically (but not always) improve over those for 12 threads, and even when they do not, a speedup is still obtained.The "gen400 p0.9 75" graph benefits strongly from hyper-threading: with 24 threads, a maximum clique is found (but not proven) after 3,757 s by taking the 17th, 8th, 3rd and 4th choices given by the heuristic, followed by the first choice thereafter.With 12 threads, it takes 11,536 s to find this same maximum, resulting in a lower speedup.On the other hand, "gen400 p0.9 65" is worse with 24 threads: here, it takes 16,288 s rather than 14,868 s to find a maximum, in both cases by taking the 11th, 7th, 3rd and 2nd heuristic choices.This is because hyper-threading provides increased parallelism, at the expense of making each individual thread run more slowly [25], and our additional threads do not reduce the amount of work needed to find a maximum.In the worst case, this property of hyper-threading could even cause a slowdown, although we have never observed this.
As well as speedups, we present efficiencies.We see that, as expected, superlinear speedups are due to efficiencies of greater than 100%.The converse does not always hold, due to overheads.
The "C2000.5"instance from DIMACS provides an indication of how effective our work splitting mechanism is.Here, we see that the efficiency is in each case 100% (at least up to rounding).A speedup of 11.9 from 12 processors would be considered good even for an "embarrassingly parallel" algorithm with a regular work item size, which we do not have.The speedup of 15.4 from 24 threads shows that hyper-threading provides some increase in throughput-in this case, we almost exactly obtain Intel's claimed benefits of "up to 30%" [24].

Threaded Experimental Results on Larger Sparse Random Graphs
Experimental results on larger, sparser random graphs are presented in Table 4 on the next page.Here, G(n, p) refers to an Erdős-Réyni random graph with n vertices and edges between each pair of vertices chosen independently with probability p.We see that for G(1000, 0.1), where the average sequential runtime is 18 ms, we fail to produce any speedup with 4 threads, and we introduce an increasingly large slowdown as the number of threads rises.With G(1000, 0.2), where the average sequential runtime is 65 ms, we do manage to produce a speedup of 2.2 with 4 threads, but this falls to 1.8 with 24 threads.For G(1000, 0.3) and G(3000, 0.1), where sequential runtimes are under a second, our speedups are modest.This is due to the overheads involved in creating and joining threads and the initial sequential portion of the algorithm whose effects cannot be ignored at this scale.For the remaining problems, where sequential runtimes are longer, our results are better: we achieve average speedups between 3.9 to 4.0 from 4 threads, between 7.1 and 7.9 from 8 threads, between 9.9 and 11.3 from 12 threads and between 11.6 and 15.6 from 24 threads.
We caution that the sequential algorithm we used is not the best choice for sufficiently large and sparse graphs.The benefits of using a bitset encoding decrease as graphs become increasingly sparse, and eventually, it becomes a penalty rather than an improvement [12].The bitset encoding also forces an adjacency matrix representation of the graph, which requires quadratic memory, even for sparse graphs.For dealing with huge, but very sparse, "real world" graphs, where the difficulty is primarily due to the size of the graphs rather than the computational complexity, we would recommend considering a different algorithm, which can use an adjacent list representation.speedups here, too (One may ask why we do not then use more threads than we have cores, to make this more likely to happen.Such an approach would sometimes be of benefit, but only if we are prepared to accept the possibility of a considerable slowdown in other cases.). Figure 3.Total CPU time spent (i.e., runtimes multiplied by the number of threads) for "san400 0.9 1" from DIMACS with varying numbers of threads.Also shown is the total CPU time taken to find a maximum clique, but not to prove its optimality.A linear speedup would give a horizontal line, and downwards sloping lines are superlinear.

Conclusion and Future Work
Despite some of the more pessimistic claims that have been made in the literature regarding the suitability of sequential maximum clique algorithms for parallelisation, we have shown that making use of multi-core parallelism for hard clique problems is possible and worthwhile.This is important for two reasons.Firstly, existing work on local parallelism using bitset encodings has produced a speedup of between two and twenty over the basic algorithm [10,12].We have shown that a further speedup of around the same magnitude is possible by making use of the resources offered by today's multi-core processors.
Secondly, we have shown that superlinearity can happen in practice and not just as a rare event.Furthermore, when it does happen, the effects can be extremely significant-we are able to make some "hard" instances "easy".We conjecture that this is due to the method used to split the work, where we use parallelism to avoid an overly strong commitment to an early heuristic.Had we not used this approach, we expect we would be repeating Lai and Sahni's claim that "our experimental results indicate that such anomalous behaviour will be rarely witnessed in practice" [19].
We expect that this work can be extended in the future.Recently, Batsyn et al. have proposed a number of improvements to Tomita's algorithms [29].These changes are all compatible with our approach and also appear to be amenable to bit-parallelisation.We expect that by combining this improved algorithm with bit parallelism and extending our global parallelism approach to be usable with multiple many-core Intel Xeon Phi processors, we may be able to solve yet more of the remaining open DIMACS problems.Additionally, of course, we will be interested to find out whether other hard problems may be tackled using a similar approach.

Figure 1 .
Figure 1.A small graph, together with a constructive colouring.

Algorithm 3 :
A threaded algorithm to deliver a maximum clique threadedMaxClique :: (Graph G) → Set of Integer begin shared C max ← ∅ shared q ← an empty Queue of (Set, Set) permute G so that the vertices are in non-increasing degree order launch the populating thread do expand(G, q, ∅, V(G), C max ) launch multiple worker threads do while there is work left do (C, P ) ← dequeue(q) expand(G, q, C, P, C max ) join all threads return C max (unpermuted) expand :: (Graph G, Queue of (Set, Set) q, Set C, Set P , Set C max ) begin populate ← true if we are the populating thread, and |C| = 1, otherwise false (colour, order) ← colourOrder(G, P )

Figure 2 .
Figure 2. Work splitting and queueing mechanism for our parallel algorithm.Nodes correspond to a call to expand.Work donation occurs once when the donating worker's position is at the node marked .
but not prove optimality

Table 3 .
Experimental results for BHOSLIB instances.Shown are the size of a maximum clique, then sequential runtimes and the number of search nodes (calls to expand).Next is parallel runtimes, speedups and efficiency using 4 to 24 threads on a 12-core hyper-threaded system.Superlinear speedups and efficiencies greater than 100% are shown in bold.