Faster and Better Nested Dissection Orders for Customizable Contraction Hierarchies

: Graph partitioning has many applications. We consider the acceleration of shortest path queries in road networks using Customizable Contraction Hierarchies (CCH). It is based on computing a nested dissection order by recursively dividing the road network into parts. Recently, with FlowCutter and Inertial Flow, two ﬂow-based graph bipartitioning algorithms have been proposed for road networks. While FlowCutter achieves high-quality results and thus fast query times, it is rather slow. Inertial Flow is particularly fast due to the use of geographical information while still achieving acceptable quality. We combine the techniques of both algorithms to achieve more than six times faster preprocessing times than FlowCutter and even slightly better quality. We show that using 16 cores of a shared-memory machine, this preprocessing needs four minutes on the Europe road network.


Introduction
The goal of graph partitioning is to divide a graph into a given number of roughly equally sized parts by removing a small number of edges or nodes.Graph partitioning has many practical applications such as accelerating matrix multiplication, dividing compute workloads, image processing, VLSI design and, the focus of this work, accelerating shortest path computations in road networks.For an overview of the state of the art in graph partitioning we refer the reader to a survey article [1].
Modern speedup techniques for shortest path computation in road networks usually achieve fast queries by an expensive preprocessing phase, which builds a metric-independent index datastructure, a customization phase, which incorporates the metric (e.g.travel time, walking distance, and real-time traffic information) into the index, as well as a query phase, which uses the index to answer queries very quickly.For such three-phase approaches, the preprocessing phase typically includes a graph partitioning step, e. g. a hierarchy of nested partitions or a nested dissection [2] order.
Contraction Hierarchies simulate contracting all nodes in a given order and insert shortcut arcs between the neighbors of the contracted node.These represent paths via the contracted nodes.Shortest s − t path queries are answered by e. g. a bidirectional Dijkstra search [3] from s and t, which only considers shortcut and original arcs to higher-ranked nodes.Thus, nodes which lie on many shortest paths should be ranked highly in the order.Customizable Contraction Hierarchies [4] use contraction orders computed via recursive balanced node separators (nested dissection) in order to achieve a logarithmic search space depth with few added shortcuts.Node separators are considered to lie on many shortest paths, as any path between the components crosses the separator.The weights of the contraction hierarchy can then be quickly customized to any metric, allowing to, e.g., incorporate real-time traffic information.The running time needed for the customization and the shortest path queries depends on the quality of the calculated order.Previously proposed partitioning tools for computing separators in road networks include FlowCutter [5], Inertial Flow [6], KaHiP [7] and Metis [8], PUNCH [9] and Buffoon [10].KaHiP and Metis are general-purpose graph partitioning tools.PUNCH and Buffoon are special-purpose partitioners, which aim to use geographical features of road networks such as rivers or mountains.Rivers and mountains form very small cuts and were dubbed natural cuts in [9].PUNCH identifies and deletes natural cuts, then contracts the remaining components, and subsequently runs a variety of highly randomized local search algorithms.Buffoon incorporates the idea of natural cuts into KaHiP, running its evolutionary multilevel partitioner instead of the flat local searches of PUNCH.In [5] it was shown that FlowCutter is also able to identify and leverage natural cuts.Inertial Flow is another special-purpose partitioner that is even based on using the geographic embedding of the road network.
We combine the idea of Inertial Flow to use geographic coordinates with the incremental cut computations of FlowCutter.This allows us to compute a series of cuts with suitable balances much faster than FlowCutter while still achieving high quality.In an extensive experimental evaluation, we compare our new algorithm InertialFlowCutter to the state-of-the-art.FlowCutter is the previously best method for computing CCH orders.InertialFlowCutter computes slightly better CCH orders than FlowCutter and is a factor of 6.1 and 6.9 faster on the road networks of the USA and Europe, respectively -our two most relevant instances.Using 16 cores of a shared-memory machine we can compute CCH orders for these instances in four minutes.
In Section 2 we briefly present the existing Inertial Flow and FlowCutter algorithms and describe how we combined them.In Section 3 we describe the setup and results of our experimental study.We conclude with a discussion of our results and future research directions in Section 4.

Materials and Methods
After introducing preliminaries, we describe the existing biparitioning algorithms FlowCutter and Inertial Flow on a high level, before discussing how to combine them into our new algorithm InertialFlowCutter.We refer the interested reader to [5] for implementation details and a more in-depth discussion of the FlowCutter algorithm.Then we discuss our application Customizable Contraction Hierarchies (CCH), what makes a good CCH order, and how we use recursive bisection to compute them.This paper recreates the experiments from [5] and uses a lot of the same setup.Therefore, there is substantial content overlap.For self-containedness, we repeat the parts we use and clearly state our new contributions.

Preliminaries
An undirected graph G = (V, E) consists of a set of nodes V and a set of edges E ⊆ ( V 2 ).A directed graph G = (V, A) has directed arcs A ⊆ V × V instead of undirected edges.It is symmetric iff for every arc (x, y) ∈ A the reverse arc (y, x) ∈ A exists.For ease of notation, we do not distinguish between undirected and symmetric graphs in this paper, and we use them interchangeably, whichever better suits the description.Let n := |V| denote the number of nodes and let m := |E| denote the number of edges of an undirected graph.All graphs in this paper contain neither self-loops (x, x) nor multi-edges.
, the graph with nodes U and all arcs/edges of G with endpoints in U.The degree deg(x) = |{(x, y) ∈ A}| is the number of outgoing arcs of x.A path is a sequence of edges such that consecutive edges overlap in a vertex.A graph is called k-connected, iff there are k node-disjoint paths between every pair of nodes.The k-connected components of a graph are the node-induced subgraphs, which are inclusion-maximal regarding k-connectivity.1-connected components are called connected components, 2-connected components are called biconnected components.

Separators and Cuts
We often use the terms cut and bipartition interchangeably.Sometimes we say a bipartition is induced by a set of cut edges.
A node separator partition is a partition of ) such that there is no edge between V 1 and V 2 .We call Q the separator and V 1 , V 2 the blocks or components of the separator.|Q| is the separator size. For We often call ε the imbalance, as larger values correspond to less balanced cuts.The balanced graph bipartitioning [balanced node separator] problem is to find an ε-balanced cut [separator] of minimum size.
Let S, T ⊂ V be two fixed, disjoint, non-empty subsets of V.

Maximum Flows
A flow network N = (V, A, S, T, c) is a simple symmetric directed graph (V, A) with two disjoint non-empty terminal node sets S, T V, also called the source and target node set, as well as a capacity function for all arcs a, flow conservation ∑ (u,v)∈A f ((u, v)) = 0 for all non-terminal nodes v and skew symmetry f ((u, v)) = − f ((v, u)) for all arcs (u, v).In this paper we consider only unit flows and unit capacities, i. e. f : A → {−1, 0, 1}, c : A → {0, 1}.The value of a flow | f | := ∑ s∈S,(s,u)∈A f ((s, u)) is the amount of flow leaving S. The residual capacity r f (a) := c(a) − f (a) is the additional amount of flow that can pass through a without violating the capacity constraint.The residual network with respect to f is the directed graph N f = (V, A f ) where A f := {a ∈ A|r f (a) > 0}.An augmenting path is an S-T path in N f .A node v is called source-reachable if there is a path from S to v in N f .We denote the set of source-reachable nodes by S r , and define the set of target-reachable nodes T r analogously.The flow f is a maximum flow if | f | is maximal among all possible flows in N .This is the case iff there is no augmenting path in N f .The well-known max-flow-min-cut theorem [11] states that the value of a maximum flow equals the capacity of a minimum S-T edge cut.(S r , V \ S r ) is the source-side cut and (V \ T r , T r ) is the target-side cut of a maximum flow.

FlowCutter
FlowCutter is an algorithm for the balanced graph bipartitioning problem.The idea of its core algorithm is to solve a sequence of incremental max flow problems, which induce cuts with monotonically increasing cut size and balance, until the latest cut induces an ε-balanced bipartition.The flow problems are incremental in the sense that the terminal nodes S, T of the previous flow problem are subsets of the terminals in the next flow problem.This nesting allows us to reuse the flow computed in previous iterations.
Given starting terminal nodes s, t, we set S := {s}, T := {t} and compute a maximum S-T flow.Then we transform the S-reachable nodes S r to sources, if S r ≤ T r , or T r to targets otherwise.Assume S r ≤ T r without loss of generality.Now S induces a minimum S-T cut C S .If C S is ε-balanced, we terminate.Otherwise we transform one additional node, called piercing node, to a source.The piercing node is chosen from the nodes incident to the cut C S and not in S.This step is called piercing the cut C S .It ensures we find a different cut in the next iteration.Subsequently, we augment the previous flow to a maximum flow which considers the new source node.These steps are repeated until the latest cut induces an ε-balanced bipartition.
A significant detail of the piercing step is that piercing nodes which are not reachable from the opposite side are preferred.Choosing such nodes for piercing does not create augmenting paths.Thus the cut size does not increase in the next iteration.This is called the avoid-augmenting-paths heuristic.A secondary distance-based piercing heuristic is used to break ties, when the avoid-augmenting-paths heuristic gives multiple choices.It chooses the node p which minimizes dist(p, t) − dist(s, p), where dist is the hop distance, precomputed via Breadth-First-Search from s and t.Roughly speaking, this attempts to prevent the cut sides from meeting before perfect balance.It also has a geometric interpretation, which is explained in [5].
We choose the starting terminal nodes s and t uniformly at random.Experiments [5] indicate that 20 terminal pairs are sufficient to obtain high quality partitions of road networks.
For computing maximum flows, we use the basic Ford-Fulkerson algorithm [11], with Pseudo-Depth-First-Search for finding augmenting paths.Pseudo-Depth-First-Search directly marks all adjacent nodes as visited when processing a node.It can be implemented like Breadth-First-Search by using a stack instead of a queue.
A major advantage of FlowCutter over other partitioning tools is the fact that it computes multiple cuts, which form a Pareto cutset after filtering dominated cuts.By this, we mean for every pair of cuts C 1 , C 2 in the Pareto cutset, the cut C 1 either has fewer edges or has better balance than C 2 .This means that we do not need to determine the maximum imbalance a priori, but we can select a good trade-off between cut size and imbalance from the Pareto cutset.

Inertial Flow
Given a line l ∈ R 2 , Inertial Flow orthogonally projects the nodes onto l, according to their geographical coordinates.The nodes are sorted by order of appearance on l.For a parameter α ∈ [0, 0.5] the first α • n nodes are chosen as S. Analogously, the last α • n nodes are chosen as T. In the next step, a maximum S-T flow is computed from which a minimum S-T cut is derived.Instead of line, we use the term direction.In [6], α = 0.2 and four directions are used: West-East, South-North, Southwest-Northeast and Southeast-Northwest.This simple approach works surprisingly well for road networks.

Combining Inertial Flow and FlowCutter into InertialFlowCutter
One drawback of Inertial Flow is the restriction to one value of α.We enhance FlowCutter by initializing S and T in the same way as Inertial Flow, however with a smaller parameter α than proposed for Inertial Flow.Additionally, we pierce cuts with multiple nodes from the Inertial Flow order at once.We call this bulk piercing.This way, we enumerate multiple Inertial Flow cuts simultaneously, without having to restart the flow computations.Furthermore, we can skip some of the first, highly imbalanced cuts of FlowCutter that are irrelevant for our application.
We introduce three additional parameters γ a , γ o ∈ (0, 0.5] and δ ∈ (0, 1) to formalize bulk piercing.Let L be a permutation of the nodes, ordered by projection according to a direction.For the source side, we use bulk piercing as long as S contains at most γ a • n nodes.Further, we limit ourselves to piercing the first γ o • n nodes of L. The parameter δ influences the step size.The idea is to decrease the step size as our cut becomes more balanced.When we decide to apply bulk piercing, we settle the next δ( 1−δ 2 n − |S|) nodes to S, when piercing the source side.To enforce the limit set by γ o , we pierce fewer nodes if necessary.For the target side, we apply this analogously starting from the end of the order.If bulk piercing cannot be applied, we revert to the standard FlowCutter method of selecting single piercing nodes incident to the cut.Additionally, we always prioritize the avoid-augmenting-paths heuristic over bulk piercing.

Running Multiple InertialFlowCutter Instances
To improve solution quality, we run q ∈ N instances of InertialFlowCutter with different directions.An instance is called a cutter.We use the directions (cos(ϕ), sin(ϕ)) for ϕ = kπ q and k ∈ [0, . . ., q − 1].To include the directions proposed in [6] q should be a multiple of 4. To improve running time, we run cutters simultaneously in an interleaved fashion as already proposed in [5].We always schedule the cutters with the currently smallest flow value to either push one additional unit of flow or derive a cut.For the latter, we improve the balance by piercing the cut as long as this does not create an augmenting path.One stand-alone cutter runs in O(cm), where c is the size of the largest output cut.Roughly speaking, this stems from performing one graph traversal, e. g.Pseudo-DFS, per unit of flow.The exact details can be found in [5].Flow-based execution interleaving ensures that no cutter performs more flow augmentations than the other cutters.Thus, the running time for q cutters is O(qcm), where c is the size of the largest found cut among all cutters.Note that we specifically avoid computing some cuts that the stand-alone cutters would find.
Note that it is important in the case of InertialFlowCutter to actually employ flow-based interleaving and not just run a cutter until the next cut is found, as after a bulk piercing step the next cut might be significantly larger.Consider the simple example with q = 2, where the second cutter immediately finds a perfectly balanced cut with cut size c but the first cutter only finds one cut with cut size C c.If the first cutter runs until a cut is found, we invested Cm work, but should only have invested cm.For road networks and FlowCutter, this difference is insignificant in practice, as the cut increases by just one, most of the time.

Customizable Contraction Hierarchies
A Customizable Contraction Hierarchy (CCH) is an index data structure which allows fast shortest path queries and fast adaptation to new metrics in road networks.It consists of three phases: a preprocessing phase, which only uses the network topology, a faster customization phase, which adapts the index to the weights of the edges, and a query phase which quickly answers shortest path queries.The preprocessing phase simulates contracting all nodes in a given order and inserts shortcut arcs between all neighbors of the contracted node.The customization phase assigns correct weights to shortcuts by processing all arcs (u, v) in the order ascending by rank of u, i. e., the position of u in the order.To process an arc (u, v), it enumerates all triangles u, w, v where w has lower rank than u and v, and updates the weight of (u, v) if the path (u, w, v) is shorter.There are two different algorithms for s-t queries.The first, basic query algorithm performs bidirectional Dijkstra search from s and t and relaxes only arcs to higher-ranked nodes.The second query algorithm uses the elimination tree of a CCH to avoid priority queues, which are typically a bottleneck.In the elimination tree, the parent of a node is its lowest-ranked upward neighbor.The ancestors of a node v are exactly the nodes in the upward search space of v in the basic query [12].For the s-t query, the outgoing arcs of all nodes on the path from s to the root and all incoming arcs of all nodes on the path from t to the root are relaxed.The node z minimizing the distance from s to z plus the distance from z to t determines the distance between s and t.
The query complexity is linear in the number of arcs incident to nodes on the paths from s and t to the root.Similarly, the customization running time depends on the number of triangles in the CCH.Fewer shortcuts result in less memory consumption and faster queries.We aim to minimize these metrics by computing high quality contraction orders.

Nested Dissection Orders For Road Networks
The framework to compute contraction orders is the same as for FlowCutter in [5].For self-containedness we repeat it here.We only exchange the partitioning algorithm.

Recursive Bisection
We compute contraction orders via recursive bisection, using node separators instead of edge cuts.This method is also called nested dissection [2].
and return the order of G[V 1 ] followed by the order of G[V 2 ] followed by Q. Q can be in an arbitrary order.We opt for the input order.Recursion stops once the graphs are trees or cliques.For cliques, any order is optimal.For trees, we use an algorithm to compute an order with minimal elimination tree depth in linear time [13,14].

Separators
InertialFlowCutter computes edge cuts.We use a standard construction [15] to model node capacities as edge capacities in flow networks -which corresponds to node separators as edge cuts.It expands the undirected input graph G = (V, E) into a directed graph G = (V , A ).For every node v ∈ V, there is an in-node v i and an out-node v o in V , joined by a directed arc (v i , v o ), called the bridge arc of v. Further, for every edge {u, v} ∈ E there are two directed external arcs (u o , v i ) and (v o , u i ) ∈ A .Since we restrict ourselves to unit capacity flow networks, we cannot use infinite capacity for external arcs and our cuts contain both bridge arcs and external arcs.Bridge arcs directly correspond to a node in the separator.From the external cut arcs, the incident node on the larger side of the cut is included in the separator.

Choosing Cuts from the Pareto Cutset
FlowCutter and InertialFlowCutter yield a sequence of non-dominated cuts with monotonically increasing cut size and balance, whereas other partitioners yield a single cut for some prespecified imbalance.We need to choose one cut, to recurse on the sides of the corresponding separator.The expansion of a cut is its cut size divided by the number of nodes on the smaller side.This gives a certain trade-off between cut size and balance.We choose the cut with minimum expansion and ε < 0.6, i. e. at least 20% of the nodes on the smaller side.While this approach is certainly not optimal, it works well enough.It is not clear how to choose the optimum cut without considering the whole hierarchy of cuts in deeper levels of recursion.

Special Preprocessing
Road networks contain many nodes of degree 1 or 2. The graph size can be drastically reduced by eliminating them in a preprocessing step that is performed only once.First we compute the largest biconnected component B and remove all edges between B and the rest of the graph G.The remaining graph consists of B and many tiny, often tree-like components.We compute orders for the components separately and concatenate them in an arbitrary order.The order for B is placed after the orders of the smaller components.
A degree-2-chain is a path (x, y 1 , . . ., y k , z) where all deg(y i ) = 2 but deg(x) > 2 and deg(z) = 2.We split the graph into two graphs G ≥3 and G ≤2 with degrees at least 3 and at most 2, by computing all degree-2-chains in linear time and splitting along them.If deg(z) > 2, we insert an edge between x and z since z is in G ≥3 .We compute contraction orders for the connected components of G ≤2 separately and concatenate them in an arbitrary order.Since these are paths, we can use the algorithm for trees.The order for G ≥3 is placed after the one for G ≤2 .

Parallelization
Recursive bisection is straightforward to parallelize by computing orders on the separated blocks independently.This only employs parallelism after the first separators have been found.Therefore, we additionally parallelize InertialFlowCutter.The implementation of FlowCutter [16] contains a simple parallelization that lets all cutters with minimum cut progress to the next cut in a parallel for loop.
We employ a more sophisticated parallelization, as waiting after every flow unit incurs too much idle time.Recall that we interleave cutter execution based on flow, not on cuts.We employ task-based parallelism both for the nested dissection and for individual cuts.For q cutters, we create q tasks and leave it up to the non-preemptive task scheduler how many of them are launched in parallel.If less than q tasks are running simultaneously, tasks switch between cutters to advance the cutters with the currently smallest flow values.If all q tasks are running, each task advances a cutter.This switching mechanism is more light-weight, and in particular incurs almost no overhead when q tasks are available.
For every cutter, we store two atomic flags: an active flag which indicates whether the cutter is not finished, and an acquired flag which indicates whether a task currently holds this cutter.In the beginning every cutter is active and not acquired.In a running task, we acquire an active task with minimum flow, which has not been acquired.If this is not possible, we terminate the task.Otherwise, we check whether the cutter is finished via a callback function and deactivate it, if so.If not, we push one unit of flow or derive a cut.If we find a cut, we report it via a second callback function.Finally we release the cutter and repeat.Since we create q tasks and we release the previous cutter before trying to acquire a new one, it is sufficient to try acquiring every active cutter once, and terminating the task if unsuccessful.
This scheme guarantees O( qcm k ) span and O(qcm) work, for k ≤ q cores executing in parallel.Additionally, the overhead for synchronizing cutter acquisition with atomic flags is insignificant compared to the synchronization overhead of the straightforward parallelization, which synchronizes all cutters after every unit of flow.
Depending on whether we compute top-level cuts or separators for nested dissection, the callback functions do different things.For cuts, we report non-dominated Pareto cuts and deactivate a cutter once it reaches ε-balance.For separators in nested dissection, we report cuts that improve expansion and have at least 20% of the nodes on the smaller side.We deactivate a cutter once it cannot find cuts with smaller expansion.Note that due to the parallelization, cuts are not necessarily reported in order of increasing cut size and also dominated cuts may be reported.

Setup
In Section 3.6 we perform a parameter study based on CCH performance, to obtain reasonable parameters for InertialFlowCutter.The parameters are tuned for CCH performance, not top-level cuts.Our remaining experiments follow the setup in [5], comparing FlowCutter, KaHiP, Metis and Inertial Flow to InertialFlowCutter, regarding CCH performance as well as top-level cut sizes for different imbalances.Our benchmark set consists of the road networks of Colorado, California and Nevada, the USA and Western Europe, see Table 1, made available during the DIMACS implementation challenge on shortest paths [17].
The CCH performance experiments compare the different partitioners based on the time to compute a contraction order, the median running time of nine customization runs, the average time of 10

CCH implementation
We use the CCH implementation in RoutingKit [19].There are different CCH customization and query variants.We use basic customization with upper triangles instead of lower triangles, no witness searches, no precomputed triangles, no SSE, and no parallelization.For queries we use elimination tree search.There has been a recent, very simple improvement [20], which drastically accelerates elimination tree search for short-range queries.It is not implemented in RoutingKit but random s-t queries tend to be long range, so the effect would be negligible for our experiments.

Partitioner Implementations
For all partitioners already included in [5], we re-state their reported running times and achieved cuts.Note that we use the same machine as [5].As we use a different CCH implementation than [5], we executed all customizations and queries again based on the same node orders.
In [5] the KaHiP versions 0.61 and 1.00 are used.We add the latest KaHiP version 2.11, which is available on GitHub [21].We refer to the three KaHiP variants as K0.61, K1.00 and K2.11.For the CCH order experiments we keep versions K0.61 and K1.00 but omit them for the top-level cut experiments because K2.11 is better for top-level cuts.
In [5], Metis 5.1.0has been used, which is still the latest version available from the authors' website [22].We denote Metis by M in our tables.
The Inertial Flow implementation used in [5] uses the Dinic flow algorithm [23] and the four directions proposed in [6].We denote Inertial Flow by I in our tables.
Implementations of Buffoon [10] and PUNCH [9] are not publicly available, but [5] concluded that the quality of PUNCH is similar to KaHiP, based on similar top-level cuts on the Europe and USA road networks.Therefore, these are excluded in our experiments.
We now discuss the different node ordering setups used in the experiments.For the node order computation with Metis, the tool offered by Metis has been used in [5].
For Inertial Flow and K1.00, in [5], a nested dissection implementation has been used, which computes one edge cut per level and recurses until components are trees or cliques, which are solved directly.Separators are derived by picking the nodes incident to one side of the edge cut.We use the same implementation for K2.11.For comparability with [4], an older nested dissection implementation has been used for K0.61, which, on every level repeatedly computes edge cuts until no smaller cut was found for ten consecutive iterations.
For InertialFlowCutter, we employ the same setup that has been used for FlowCutter [5], which, in addition to special cases for cliques and trees, also includes the specialized preprocessing described in Section 2.7.We tried to employ these techniques for KaHiP 2.11.While this made order computation faster, the order quality was much worse regarding all criteria.
Our nested dissection implementation is based on the implementation in the FlowCutter repository.We made minor changes and parallelized it.
Starting with version 1.00, KaHiP includes a more sophisticated multilevel node separator algorithm [24].It was omitted from the experiments in [5] because it took 19 hours to compute an order for the small California graph, using one separator per level, and did not finish in reasonable time on the larger instances.Therefore we still exclude it.

Order Experiments
In this section, we compare the different partitioners with respect to the quality of computed CCH orders and running time of the preprocessing.Table 2 contains a large collection of metrics and measurements for the four road networks of California, Colorado, Europe and USA.

Quality
Over all nodes v, we report the average and maximum number of ancestors in the elimination tree, as well as the number of arcs incident to the ancestors.These metrics assess the search space sizes of an elimination tree query.Further, we report the number of arcs in the CCH, i. e. shortcut and original arcs, the number of triangles and an upper bound on the treewidth, which we obtain by using the CCH order as elimination ordering.A CCH is essentially a chordal supergraph of the input.Thus CCHs are closely related to tree decompositions and elimination orderings.The relation between tree decompositions and Contraction Hierarchies is further explained in [25].A low treewidth usually corresponds to good performance with respect to the other metrics.However, as the treewidth is defined by the largest bag in the tree decomposition which may depend on the size of few separators and disregards the size of all smaller separators, this is not always consistent.For example for the California road network, F3 produces the order with the smallest treewidth and also smallest maximum search space, but still on average F20 and the InertialFlowCutter variants achieve better results.In the context of shortest path queries, a better average is preferable to a slightly reduced maximum.
On all graphs, the InertialFlowCutter variants and the FlowCutter variants F20, F100, yield the fastest queries, fastest customizations, and smallest values for all metrics.Their search space sizes are rather similar.On the more relevant continental size networks USA and Europe, InertialFlowCutter is ahead in terms of metrics, as well as query and customization times, but only slightly.The only exception is the small California network, where F3 is ahead of F20, F100 and InertialFlowCutter.This is somewhat surprising and is an indicator that the selection of Pareto cuts with smallest expansion is not ideal.F20, F100 and InertialFlowCutter should be considered tied for rank 1 in order quality.The differences are so minor that they might be random fluctuations.The different KaHiP variants and Inertial Flow compute the next best orders, while Metis is ranked last by a large margin.
The ratio between maximum and average search space size is most strongly pronounced for Inertial Flow.This indicates that Inertial Flow works well for most separators but the quality degrades for a few.

InertialFlowCutter resolves this problem.
There is an interesting difference in the number of cutters necessary for good CCH orders with InertialFlowCutter and FlowCutter.In [5], F20 is the recommended configuration.It is almost never beneficial to use 100 cutters instead of 20.However, using just 3 seems insufficient to get rid of bad random choices.For InertialFlowCutter it seems almost irrelevant whether 4 or more cutters are used.This is also confirmed by the top-level cut experiments in Section 3.5.It seems the Inertial Flow guidance is sufficiently strong to eliminate bad random choices.Only on Europe, the queries are slower, which is why we recommend using 8 cutters.The better query running times justify the twice as long preprocessing.

Preprocessing Time
Previously, CCH performance came at the cost of high preprocessing time.We compute slightly better CCH orders than FlowCutter in a much shorter time.
KaHiP is by far the slowest, followed by FlowCutter, then InertialFlowCutter using 8 or more cutters.IFC4 and Inertial Flow have similar running times.This is because even though IFC4 enumerates multiple cuts instead of computing just one, we found that Ford-Fulkerson is in fact faster than Dinic.For the sake of comparability, we report the numbers of [5] for Inertial Flow.
The different KaHiP variants are slow for different reasons.As already mentioned K0.61 computes at least 10 cuts, as opposed to K1.00 and K2.11.K1.00 is slow because the running time for ε ≥ 0.2 increases unexpectedly, according to [5].This seems to be somewhat resolved for K2.11 on the deeper levels of the nested dissection, since the orders are computed much faster.As we will see in Section 3.5, computing single top-level cuts with K2.11 is still slow, and even slower than the numbers reported for K1.00 in [5].
Using 16 cores and IFC8, we compute a CCH order of Europe in just 242 seconds, with 2258 seconds sequential running time on that machine.This corresponds to a speedup of 9.3 over the sequential version.Note that due to using 8 cutters, at most 8 threads work on a single separator.Therefore, in particular for the top-level separator at most 8 of the 16 cores are used.The top level separator alone needs about 50 seconds using 8 cores.Due to unfortunate scheduling and unbalanced separators, it happens also at later stages that a single separator needs to be computed before any further tasks can be created.Using 8 cores, we get a much better speedup of 6.8 for the Europe network, up to four cores we see an almost perfect  speedup for all but the smallest road network.This is because some cutters need less running time than others.Thus there is actually less potential for parallelism than the number of cutters suggests.

Pareto Cut Experiments
In Tables 4, 5, 6 and 7 we report the found cuts for various values of ε for all considered partitioners and road networks.We also report the actually achieved imbalance as well as the running time.We report ε = 0.0 only if perfect balance was achieved, otherwise if the rounded value is 0.0, we report < 0.1%.For none of the graphs, Metis and KaHIP were able to achieve perfect balance if perfect balance was desired.We note this by cancelling the respective values.For KaHiP this is due to our use of its library interface that does not support enforcing perfect balance.Metis simply rejects ε = 0. Perfect balance is not actually useful for the application.We solely include it to analyze the different Pareto cuts.
Note that for FlowCutter and InertialFlowCutter, the running time always includes the computation of all more imbalanced cuts, i.e., to generate the full set of cuts, only the running time of the perfectly balanced cut is needed while for all other partitioners, the sum of all reported running times is needed.
Concerning the performance, Metis wins but almost all reported cuts are larger than the cuts reported by the other partitioners.Inertial Flow is also quite fast, but, due to its design, produces cuts that are much more balanced than desired and thus cannot achieve as small cuts as the other partitioners.
KaHIP achieves exceptionally small, highly balanced cuts on the Europe road network.On the other road networks it is similar to or worse than F20 in terms of cut size.This is due to the special geography of the Europe road network.It excludes large parts of Eastern Europe, which is why there is a cut of size 2 that separates Norway, Sweden, and Finland from the rest of Europe.This is the cut found at ε = 90%.For ε = 10%, KaHiP computes a cut with 112 edges, which separates the European mainland from the Iberian peninsula, Britain, Scandinavia minus Denmark, Italy and Austria.The alps separate Italy from the rest of Europe.Britain is only connected via ferries, and the Iberian peninsula is separated by the Pyrénées.One side of the cut is not connected because the only ferry between Britain and Scandinavia runs between Britain and Denmark.FlowCutter is unable to find cuts with disconnected sides without a modified initialization.By handpicking terminals for FlowCutter, a similar cut with only 87 edges and 15% imbalance, which places Austria with the mainland instead, is found in [5].However, it turns out that the FlowCutter CCH order using the 87 edge cut as a top-level separator is not actually better than plain FlowCutter.This indicates that it does not matter at what level of recursion the different cuts are found.For large imbalances, KaHIP seems unable to leverage the additional freedom to achieve the much smaller but more unbalanced cuts, like the ones reported by InertialFlowCutter and FlowCutter.This has already been observed for previous versions of KaHIP [5].In terms of running time, KaHIP and F20 are the slowest algorithms.InertialFlowCutter is in all three configurations an order of magnitude faster than F20.Up to a maximum ε of 10, the three variants report almost the same cuts.Apart from the very imbalanced ε = 90% cuts, the cuts are also at most one edge worse than F20.Only for more balanced cuts, more cutters give a significant improvement.Here, in particular on the Europe road network, F20 is also significantly better than InertialFlowCutter.In the range between ε = 60% and ε = 10%, which is most relevant for our application, there is thus no significant difference between F20 and InertialFlowCutter, regardless of the number of cutters.This indicates that on the top level, the first four directions seem to cover most cuts already.On the other hand, for highly balanced cuts, the geographic initialization does not help much, as can be seen from the much worse cuts for InertialFlowCutter.Here, just having more cutters seems to help.For ε = 90%, IFC4 and Inertial Flow actually compute the same cuts, modulo the slightly better balance for IFC4 due to the avoid-augmenting-paths heuristic.This is because IFC4 initially fixes α = 5% of nodes on each side.which corresponds to ε = 90% just from initialization and thus Inertial Flow also uses α = 5% for this imbalance.Here, we can clearly see the performance advantage of Ford-Fulkerson over Dinic flow algorithm.The difference is most pronounced on the USA network, which is slightly surprising as the theoretical advantage of Dinic algorithm only starts to matter with decently large cut sizes and the USA network has the largest cut out of the four road networks at ε = 90%.

Parameter Configuration
In this section, we tune the parameters α, δ, γ a , γ o of InertialFlowCutter.Our goal is to achieve much faster order computation without sacrificing CCH performance.Recall that α is the fraction of nodes initially fixed on each side, δ is -roughly speaking -a stepsize, γ o is the threshold up to how many nodes on a side of the projection we perform bulk piercing, and similarly γ a for how many settled nodes on a side.Table 8 shows a large variety of tested parameter combinations for InertialFlowCutter with 8 directions on the road network of Europe.We select the parameter set α = 0.05, δ = 0.05, γ a = 0.4, γ o = 0.25 based on query performance.The best entries per column are highlighted in bold.Further, color shades are scaled between values in the columns.Darker shades correspond to lower values, which are better for every measure.
First, we consider the top part of time (27ms) and query time (3µs) are marginal.Therefore we settle on the configuration δ = 0.05, γ o = 0.25, γ a = 0.4, which simultaneously yields the fastest query and order times.
In the bottom part of Table 8 we try different values of α with the best choices for the other parameters.As expected, larger values for α accelerate order computation and slightly slow down queries.
In summary, InertialFlowCutter is relatively robust to parameter choices other than for α, which means users do not need to invest much effort on parameter tuning.

Discussion
We have presented InertialFlowCutter, an algorithm that exploits geographical information to quickly compute high-quality bipartitions of road networks.Our experiments show that we are able to compute nested dissection orders as used for CCHs more than six times faster than the previous state-of-the-art algorithm, FlowCutter.Using 16 cores, we can compute a nested dissection order of the Europe road network in four minutes.This makes CCHs even more attractive to be applied in practice.
An open question is how to transfer the ideas of large initial terminal node sets and piercing multiple nodes simultaneously to graphs without geographical information.As FlowCutter also achieved quite good results on general graphs albeit with slow running times [5], this might be an interesting direction for future research.

Table 1 .
[18]dom s-t queries, as well as the criteria introduced in Section 2.6.Unless explicitly stated as parallel, all reported Benchmark road networks.runningtimesaresequential on an Intel Xeon E5-1630 v3 Haswell processor clocked at 3.7GHz with 10MB L3 cache and 128GB DDR4 RAM (2133 MHz).We additionally report running times for computing contraction orders in parallel on a shared-memory machine with two 8-core Intel Xeon Gold 6144 Skylake CPUs, clocked at 3.5GHz with 24.75MB L3 cache and 192GB DDR4 RAM (2666 MHz).InertialFlowCutter is implemented in C++ and the code is compiled with g++ version 8.2 with optimization level 3. We use Intel's Threading Building Blocks library for shared-memory parallelism.Our implementation and evaluation setup are available on GitHub[18].

Table 3 .
Running times in seconds of IFC8, using up to 16 cores of the Skylake CPU.

Table 5 .
California and Nevada top-level cuts.

Table 8
, where we fix α to 0.05 and try different combinations of δ, γ o , γ a .While the number of triangles and customization times are correlated, the top configurations for these measures are not the same; interestingly.The variations in search space sizes, customization

Table 8 .
CCH performance of different parameter configurations of IFC8 on Europe.Bold values are the best in their category.Darkness of shading indicates better values.