This paper is an extended version of our paper published in The proceedings of the 14th International Symposium on Experimental Algorithms (SEA 2015).
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CCBY) license (http://creativecommons.org/licenses/by/4.0/).
Linear system solving is a main workhorse in applied mathematics. Recently, theoretical computer scientists contributed sophisticated algorithms for solving linear systems with symmetric diagonallydominant (SDD) matrices in provably nearlylinear time. These algorithms are very interesting from a theoretical perspective, but their practical performance was unclear. Here, we address this gap. We provide the first implementation of the combinatorial solver by Kelner et al. (STOC 2013), which is appealing for implementation due to its conceptual simplicity. The algorithm exploits that a Laplacian matrix (which is SDD) corresponds to a graph; solving symmetric Laplacian linear systems amounts to finding an electrical flow in this graph with the help of cycles induced by a spanning tree with the lowstretch property. The results of our experiments are ambivalent. While they confirm the predicted nearlylinear running time, the constant factors make the solver much slower for reasonable inputs than basic methods with higher asymptotic complexity. We were also not able to use the solver effectively as a smoother or preconditioner. Moreover, while spanning trees with lower stretch indeed reduce the solver’s running time, we experience again a discrepancy in practice: in our experiments, simple spanning tree algorithms perform better than those with a guaranteed low stretch. We expect that our results provide insights for future improvements of combinatorial linear solvers.
Solving square linear systems
Spielman and Teng [
Symmetric matrices that are diagonally dominant (SDD matrices) have many applications, not only in applied mathematics, such as elliptic PDEs [
Spielman and Teng’s seminal paper [
Spielman and Teng’s algorithm crucially uses the lowstretch spanning trees first introduced by Alon et al. [
It should also be noted that there are a few methods available for the problem with fast empirical running times; but with no equivalent guarantee on the theoretical worstcase running time: combinatorial multigrid (CMG) [
Although several extensions and simplifications to the Spielman–Teng nearlylinear time solver [
In this paper, which extends the previous conference version [
We consider undirected simple weighted graphs
Some conventions used throughout the paper: Every function that is parametrized by a single graph will implicitly use
A cycle in a graph is usually defined as a simple path that returns to its starting point, and a graph is called Eulerian if there is a cycle that visits every edge exactly once. In this work, we will interpret cycles somewhat differently: We say that a cycle in
In a spanning tree (ST)
We can regard
The idea of the algorithm is to start with any valid flow and successively adjust the flow, such that every cycle has potential zero. We need to transform the flow back to potentials at the end, but this can be done consistently, as all potential drops along cycles are zero.
Regarding the crucial question of what flow to start with and how to choose the cycle to be repaired in each iteration, Kelner et al. [
The solver described in Algorithm 1 is actually just the SimpleSolver in the Kelner et al. [
The improved running time of their FullSolver to compute an
While Algorithm 1 provides the basic idea of the KOSZ solver, it leaves open several implementation decisions we had to make and that we elaborate on in this section.
As suggested by the convergence result in Theorem 1, the KOSZ solver crucially depends on lowstretch spanning trees. The notion of stretch was introduced by Alon et al. [
To test how dependent the algorithm is on the stretch of the spanning tree (ST), we also look at a special ST for
First note that, by the recursive construction, the total stretch of the four subgrids remains the same if such a subgrid is treated separately. Moreover, the stretches of the
Since the number of edges is
In case of a square grid (
Since every basis cycle contains exactly one offtreeedge, the flows on offtreeedges can simply be stored in a single vector. To be able to efficiently get the potential drop of every basis cycle and to be able to add a constant amount of flow to it, the core problem is to efficiently store and update flows in
We can simplify the operations by fixing
The itemized twonode operations can then be supported with
The trivial implementation of (2) directly stores the flows in the tree and implements each operation in (2) with a single traversal from the node
While the data structure presented above allows fast repairs for short basis cycles, the worstcase time is still in
Now, consider a tree
Then, we can compute
The
In our preliminary experiments, in order to evaluate the flow data structures, the cost of querying the LCAbased data structure (LCAFlow) strongly depends on the structure of the used spanning tree, while the logarithmictime data structure (LogFlow) induces costs that stay nearly the same. Similarly, the cost of LCAFlow grows far more with the size of the graph than LogFlow, and LogFlow wins for the larger graphs in both classes. For these reasons, we only use LogFlow in later results.
Given
The easiest way to select a cycle, in turn, is to choose an offtree edge uniformly at random in
For convenience, we summarize the implementation choices for Algorithm 1 in
We implemented the KOSZ solver in C++ using NetworKit [
We mainly use two graph classes for our tests: (i) rectangular
For both classes of graphs, we consider both unweighted and weighted variants (uniform random weights in
In the description of the solver, so far, we did not state our termination condition; Kelner et al. [
CPU performance characteristics, such as the number of executed FLOP (floating point operations), etc., are measured with the PAPI library [
Papp [
In
In all cases, the solver converges exponentially, but the convergence speed crucially depends on the solver settings. If we select cycles by their stretch, the order of the convergence speeds is the same as the order of the stretches of the ST (compare
Using the results of all our experiments, we are not able to detect any correlation between the improvement made by a cycle repair and the stretch of the cycle. Therefore, we cannot fully explain the different speeds with uniform cycle selection and stretch cycle selection. For the grid, the stretch cycle selection wins, while Barabási–Albert graphs favor uniform cycle selection. Another interesting observation is that most of the convergence speeds stay constant after an initial fast improvement at the start to about a residual of one. That is, there is no significant change of behavior or periodicity. Even though we can hugely improve convergence by choosing the right settings, even the best convergence is still very slow, e.g., we need about six million iterations (≈3000 sparse matrixvector multiplications (SpMVs) in time comparison) on a Barabási–Albert graph with 25,000 nodes and 100,000 edges in order to reach a residual of
Now that we know which settings of the algorithm yield the best performance for 2D grids and Barabási–Albert graphs, we proceed by looking at how the performance with these settings behaves asymptotically and how it compares to conjugate gradient (CG) without preconditioning, a simple and popular iterative solver (often used in its preconditioned form). Since KOSZ turns out to be not competitive, we do not need to compare it to more sophisticated algorithms.
In
The results for the Barabási–Albert graphs are basically the same (and hence, not shown in detail): Even though the growth is approximately linear from about 400,000 nodes, there is still a large gap between KOSZ and CG since the constant factor is enormous. Furthermore, the results for the number of FLOP are again much better than the results for the other performance counters.
In conclusion, although we have nearlylinear growth, even for 1,000,000 graph nodes, the KOSZ algorithm is still not competitive with CG because of huge constant factors, in particular a large number of iterations and memory accesses.
The convergence of most iterative linear solvers on a linear system
In iterative methods, we usually do not explicitly compute
For the CG method, we see that, unfortunately, the more iterations we use, the more slowly the methods converge. Since the cycle repairs depend crucially on the righthand side and the solver is probabilistic, using the Laplacian solver as the preconditioner means that the preconditioner matrix is not fixed, but changes from iteration to iteration. Axelsson and Vassilevski [
We conclude that KOSZ is not suitable as a preconditioner for common iterative methods. It would be an interesting extension to check if the solver works in a specialized variablestep method.
One way of combining the good qualities of two different solvers is smoothing. Smoothing means to dampen the highfrequency components of the error, which is usually done in combination with another solver that dampens the lowfrequency error components. It is known that in CG and most other solvers, the lowfrequency components of the error converge very quickly, while the highfrequency components converge slowly. Thus, we are interested in finding an algorithm that dampens the highfrequency components, a good smoother. This smoother does not necessarily need to reducing the error, it just needs to make its frequency distribution more favorable. Smoothers are particularly often applied at each level of multigrid or multilevel schemes [
To test whether the Laplacian solver is a good smoother, we start with a fixed
In the solver, we start with a flow that is nonzero only on the ST. Therefore, the flow values on the ST are generally larger at the start than in later iterations, where the flow will be distributed among the other edges. Since we construct the output vector by taking potentials on the tree, after one iteration,
While testing the Laplacian solver in a multigrid scheme could be worthwhile, the bad initial vector creates robustness problems when applying the Richardson iteration multiple times with a fixed number of iterations of our solver. In informal tests, multiple Richardson steps lead to ever increasing errors without improved frequency behavior unless our solver already yields an almost perfect vector in a single run.
The nearlylinear running time of the Laplacian solver was proven in the RAM machine model. To get good practical performance on modern outoforder superscalar computers, one also has to take their complex execution behavior into account, e.g., the cache hierarchy.
One particular problem indicated by our experiments is that the number of cache misses increases in the LogFlow data structure when a bad spanning tree is used. Note that querying and updating the flow with this data structure corresponds to a dot product and an addition, respectively, of a dense vector and a sparse vector. The sparse vectors are stored as sequences of pairs of indices (into the dense vector) and values. Thus, the cache behavior depends on the distribution of the indices, which is determined by the subtree decomposition of the spanning tree and the order of the subtrees.
We managed to consistently improve the time by about 6% by doing the decomposition in breadthfirst search order, so that the indices are grouped together at the front of the vector. In contrast, the actual decomposition only depends on the spanning tree. Furthermore, we could save an additional 10% of time by using 256bit AVX instructions to do four double precision operations at the same time in LogFlow, but this vectorized implementation still uses (vectorized) indirect accesses.
In our experiments, we get about 5% cache misses by using the minimum weight ST on the 2D grid, compared with 1% when using CG. In contrast, the special ST yields competitive cache behavior. Not surprisingly, since the Barabási–Albert graph has a much more complex structure, its cache misses using the sparse matrix representation increase to 5%. In contrast, the cache misses improve for larger graphs with LogFlow since the diameter of the spanning tree is smaller than on grids and the decomposition, thus grouping most indices at the start of the vector.
From the benchmarks, we can infer that the microperformance in terms of cache misses suffers from indirect accesses just as in the case of the usual sparse matrix representations. Furthermore, the microperformance crucially depends on the quality of the spanning tree. For good spanning trees or more complex graphs, the microperformance of the Laplacian solver is competitive with CG.
At the time of writing the conference version of this paper, we provided the first comprehensive experimental study of a Laplacian solver with provably nearlylinear running time. In the meantime, our results regarding KOSZ have been recently confirmed and in some aspects extended [
Our study supports the theoretical result that the convergence of KOSZ crucially depends on the stretch of the chosen spanning tree, with low stretch generally resulting in faster convergence. This particularly suggests that it is crucial to build algorithms that yield spanning trees with lower stretch. Since we have confirmed and extended Papp’s observation that algorithms with provably low stretch do not yield good stretch in practice [
Even though KOSZ proves to grow nearly linearly as predicted by theory, the constant seems to be too large to make it competitive without major changes in the algorithm, even compared to the CG method without a preconditioner. Hence, we can say that the running time is nearly linear indeed and, thus, fast in the
This work was partially supported by the Ministry of Science, Research and the Arts BadenWürttemberg under the grant “Parallel Analysis of Dynamic Networks—Algorithm Engineering of Efficient Combinatorial and Numerical Methods” and by DFG Grant ME 3619/31.
D.H., D.L. and H.M. designed the research project. D.H. performed the implementation and the experiments, partially assisted by M.W. All authors contributed to the design and evaluation of the experiments. D.H. and H.M. wrote the paper with the assistance of D.L. and M.W.
The authors declare no conflict of interest.
Transformation into an electrical network.
Special spanning tree (ST) with
Average stretch
Convergence of the residual. Terminate when the residual is
Asymptotic behavior for 2
Convergence of the residual when using the Laplacian solver as a preconditioner on an unweighted 100 × 100 grid.
The Laplacian solver with the special ST as a smoother on a
Summary of the components of the KOSZ solver.
Spanning tree  
Dijkstra  no stretch bound, 
Kruskal  no stretch bound, 
Elkin et al. [ 

Abraham and Neiman [ 

Initialize cycle selection  
Uniform  
Weighted  
Initialize flow  
LCA flow  
Log flow  
Iterations  
Select a cycle  
Uniform  
Weighted  
Repair cycle  
LCA flow  
Log flow  
Complete solver  
Improved solver 