#
Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation^{ †}

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- Compressed Sparse Column (CSC), used for efficient and nuanced implementation of core arithmetic operations such as matrix multiplication and addition, as well as efficient reading of individual elements;
- Red-Black Tree (RBT), used for both robust and efficient incremental construction of sparse matrices (i.e., construction via setting individual elements one-by-one, not necessarily in order);
- Coordinate list (COO), used for low-maintenance and straightforward implementation of relatively complex and/or lesser-used sparse matrix functionality.

## 2. Functionality

`*`and

`+`) [2] are exploited to allow mathematical operations with matrices to be expressed in a concise and easy-to-read manner, in a similar fashion to the proprietary MATLAB language. For example, given sparse matrices

`A`,

`B`and

`C`, a mathematical expression such as:

`sp_mat D = 0.5 * (A + B) * C.t()`;

`sp_mat`is our sparse matrix class. Figure 1 contains a complete C++ program, which briefly demonstrates the usage of the sparse matrix class, while Table 1 lists a subset of the available functionality.

## 3. Template-Based Optimisation of Compound Expressions

`trace(A.t() * B)`, which often appears as a fundamental quantity in semidefinite programs [20]. These computations are thus used in a wide variety of diverse fields, most notably machine learning [21,22,23]. A brute-force implementation would evaluate the transpose first,

`A.t()`, and store the result in a temporary matrix

`T1`. The next operation would be a time-consuming matrix multiplication,

`T1 * B`, with the result stored in another temporary matrix

`T2`. The trace operation (sum of diagonal elements) would then be applied on

`T2`. The explicit transpose, full matrix multiplication and creation of the temporary matrices are suboptimal from an efficiency point of view, as for the trace operation, we require only the diagonal elements of the

`A.t() * B`expression.

`Op`and

`Glue`, where

`Op`objects are used for representing unary operations, while

`Glue`objects are used for representing binary operations. The objects are lightweight, as they do not store actual sparse matrix data; instead the objects only store references to matrices and/or other

`Op`and

`Glue`objects. Ternary and more complex operations are represented through combinations of

`Op`and

`Glue`objects. The exact type of each

`Op`and

`Glue`object is automatically inferred from a given mathematical expression through template meta-programming.

`A.t()`is automatically converted to an instance of the lightweight

`Op`object with the following type:

`Op<sp_mat, op_trans>`

`Op<...>`indicates that

`Op`is a template class, with the items between

`<`and

`>`specifying template parameters. In this case, the

`Op<sp_mat, op_trans>`object type indicates that a reference to a matrix is stored and that a transpose operation is requested. In turn, the compound expression

`A.t() * B`is converted to an instance of the lightweight

`Glue`object with the following type:

`Glue< Op<sp_mat, op_trans>, sp_mat, glue_times>`

`Glue`object type in this case indicates that a reference to the preceding

`Op`object is stored, a reference to a matrix is stored, and a matrix multiplication operation is requested. In other words, when a user writes the expression

`trace(A.t() * B)`, the C++ compiler is induced to represent it internally as

`trace(Glue< Op<sp_mat, op_trans>, sp_mat, glue_times>(A,B))`.

`trace()`function, one of which is automatically chosen by the C++ compiler to handle the

`Glue< Op<sp_mat, op_trans>, sp_mat, glue_times>`expression. The specific form of

`trace()`takes references to the

`A`and

`B`matrices, and executes a partial matrix multiplication to obtain only the diagonal elements of the

`A.t() * B`expression. All of this is accomplished without generating temporary matrices. Furthermore, as the

`Glue`and

`Op`objects only hold references, they are in effect optimised away by modern C++ compilers [6]: the resultant machine code appears as if the

`Glue`and

`Op`objects never existed in the first place.

`diagmat()`function, which obtains a diagonal matrix from a given expression. For example, in the expression

`diagmat(A + B)`, only the diagonal components of the

`A + B`expression are evaluated.

## 4. Storage Formats for Sparse Data

- Flexible ad-hoc construction and element-wise modification of sparse matrices via unordered insertion of elements, where each new element is inserted at a random location.
- Incremental construction of sparse matrices via quasi-ordered insertion of elements, where each new element is inserted at a location that is past all the previous elements according to column-major ordering.
- Multiplication of dense vectors with sparse matrices.
- Multiplication of two sparse matrices.
- Operations involving bulk coordinate transformations, such as flipping matrices column- or row-wise.

#### 4.1. Compressed Sparse Column

- The values array, which is a contiguous array of N floating point numbers holding the non-zero elements.
- The rows array, which is a contiguous array of N integers holding the corresponding row indices (i.e., the $n\mathrm{th}$ entry contains the row of the $n\mathrm{th}$ element).
- The col_offsets array, which is a contiguous array of n_cols+1 integers holding offsets to the values array, with each offset indicating the start of elements belonging to each column.

#### 4.2. Red-Black Tree

#### 4.3. Coordinate List Representation

- The values array, which is a contiguous array of N floating point numbers holding the non-zero elements of the matrix.
- The rows array, a contiguous array of N integers holding the row index of the corresponding value.
- The columns array, a contiguous array of N integers holding the column index of the corresponding value.

`TRANSP`algorithm. This resulted in considerable speedups, due to no longer requiring the time-consuming sort operation. We verified that the new CSC-based implementation is correct by comparing its output against the previous COO-based implementation on a large set of test matrices.

`reverse()`for flipping matrices column- or row-wise and

`repelem()`, where a matrix is generated by replicating each element several times from a given matrix. While it is certainly possible to adapt these functions to use the more complex CSC format directly, at the time of writing, we spent our time-constrained efforts on optimising and debugging more commonly-used parts of the sparse matrix class.

## 5. Automatic Conversion between Storage Formats

#### 5.1. Conversion between COO and CSC

#### 5.2. Conversion between CSC and RBT

#### 5.3. Practical Considerations

## 6. Empirical Evaluation

- Unordered element insertion into a sparse matrix, where the elements are inserted at random locations in random order.
- Quasi-ordered element insertion into a sparse matrix, where each new inserted element is at a random location that is past the previously inserted element, under the constraint of column-major ordering.
- Calculation of $trace\left({A}^{T}B\right)$, where A and B are randomly-generated sparse matrices.
- Obtaining a diagonal matrix from the $(A+B)$ expression, where A and B are randomly-generated sparse matrices.

`trace(A.t()*B)`and

`diagmat(A+B)`, with and without the aid of the automatic template-based optimisation of compound expression described in Section 3. For both expressions, employing expression optimisation led to considerable reduction in the wall-clock time. As the density increased (i.e., more non-zero elements), more time was saved via expression optimisation.

`trace(A.t()*B)`expression, the expression optimisation computed the trace by omitting the explicit transpose operation and performing a partial matrix multiplication to obtain only the diagonal elements. In a similar fashion, the expression optimisation for the

`diagmat(A+B)`expression directly generated the diagonal matrix by performing a partial matrix addition, where only the diagonal elements of the two matrices were added. As well as avoiding full matrix addition, the generation of a temporary intermediary matrix to hold the complete result of the matrix addition was also avoided.

## 7. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Nunez-Iglesias, J.; van der Walt, S.; Dashnow, H. Elegant SciPy: The Art of Scientific Python; O’Reilly Media: Sebastopol, CA, USA, 2017. [Google Scholar]
- Stroustrup, B. The C++ Programming Language, 4th ed.; Addison-Wesley: Boston, MA, USA, 2013. [Google Scholar]
- Sanderson, C.; Curtin, R. Armadillo: A template-based C++ library for linear algebra. J. Open Source Softw.
**2016**, 1, 26. [Google Scholar] [CrossRef] - Liniker, P.; Beckmann, O.; Kelly, P.H. Delayed Evaluation, Self-optimising Software Components as a Programming Model. In Proceedings of the European Conference on Parallel Processing, Paderborn, Germany, 27–30 August 2002; Volume 2400, pp. 666–673. [Google Scholar]
- Abrahams, D.; Gurtovoy, A. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond; Addison-Wesley Professional: Boston, MA, USA, 2004. [Google Scholar]
- Vandevoorde, D.; Josuttis, N.M. C++ Templates: The Complete Guide, 2nd ed.; Addison-Wesley: Boston, MA, USA, 2017. [Google Scholar]
- Saad, Y. SPARSKIT: A Basic Tool Kit for Sparse Matrix Computations; NASA Ames Research Center: Mountain View, CA, USA, 1990. [Google Scholar]
- Eaton, J.W.; Bateman, D.; Hauberg, S.; Wehbring, R. GNU Octave 4.2 Reference Manual; Samurai Media Limited: Surrey, UK, 2017. [Google Scholar]
- Davis, T.A.; Rajamanickam, S.; Sid-Lakhdar, W.M. A survey of direct methods for sparse linear systems. Acta Numerica
**2016**, 25, 383–566. [Google Scholar] [CrossRef][Green Version] - MathWorks. MATLAB Documentation—Accessing Sparse Matrices. Available online: https://www.mathworks.com/help/matlab/math/accessing-sparse-matrices.html (accessed on 18 July 2019).
- Iglberger, K.; Hager, G.; Treibig, J.; Rüde, U. Expression templates revisited: A performance analysis of current methodologies. SIAM J. Sci. Comput.
**2012**, 34, C42–C69. [Google Scholar] [CrossRef] - Duff, I.S.; Erisman, A.M.; Reid, J.K. Direct Methods for Sparse Matrices, 2nd ed.; Oxford University Press: Oxford, UK, 2017. [Google Scholar]
- Bai, Z.; Demmel, J.; Dongarra, J.; Ruhe, A.; van der Vorst, H. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide; SIAM: Philadelphia, PA, USA, 2000. [Google Scholar]
- Sanderson, C.; Curtin, R. A User-Friendly Hybrid Sparse Matrix Class in C++. In Proceedings of the International Congress on Mathematical Software, South Bend, IN, USA, 24–27 July 2018; Volume 10931, pp. 422–430. [Google Scholar]
- Lehoucq, R.B.; Sorensen, D.C.; Yang, C. ARPACK Users’ Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods; SIAM: Philadelphia, PA, USA, 1998. [Google Scholar]
- Li, X.S. An overview of SuperLU: Algorithms, implementation, and user interface. ACM Trans. Math. Softw.
**2005**, 31, 302–325. [Google Scholar] [CrossRef] - Mernik, M.; Heering, J.; Sloane, A.M. When and how to develop domain-specific languages. ACM Comput. Surv.
**2005**, 37, 316–344. [Google Scholar] [CrossRef][Green Version] - Scherr, M.; Chiba, S. Almost first-class language embedding: Taming staged embedded DSLs. In Proceedings of the ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, Pittsburgh, PA, USA, 26–27 October 2015; pp. 21–30. [Google Scholar]
- Veldhuizen, T.L. C++ Templates as Partial Evaluation. In Proceedings of the ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program Manipulation, San Antonio, TX, USA, 22–23 January 1999; pp. 13–18. [Google Scholar]
- Vandenberghe, L.; Boyd, S. Semidefinite Programming. SIAM Rev.
**1996**, 38, 49–95. [Google Scholar] [CrossRef][Green Version] - Boumal, N.; Voroninski, V.; Bandeira, A. The non-convex Burer–Monteiro approach works on smooth semidefinite programs. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2757–2765. [Google Scholar]
- El Ghaoui, L.; Lebret, H. Robust solutions to least-squares problems with uncertain data. SIAM J. Matrix Anal. Appl.
**1997**, 18, 1035–1064. [Google Scholar] [CrossRef] - Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the Kernel Matrix with Semidefinite Programming. J. Mach. Learn. Res.
**2004**, 5, 27–72. [Google Scholar] - Mittal, S. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Comput. Surv.
**2016**, 49, 35:1–35:35. [Google Scholar] [CrossRef] - Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 3rd ed.; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Bank, R.E.; Douglas, C.C. Sparse matrix multiplication package (SMMP). Adv. Comput. Math.
**1993**, 1, 127–137. [Google Scholar] [CrossRef] - Anderson, E.; Bai, Z.; Bischof, C.; Blackford, S.; Demmel, J.; Dongarra, J.; du Croz, J.; Greenbaum, A.; Hammarling, S.; McKenney, A.; et al. LAPACK Users’ Guide; SIAM: Philadelphia, PA, USA, 1999. [Google Scholar]
- Davis, T.A. Direct Methods for Sparse Linear Systems; SIAM: Philadelphia, PA, USA, 2006. [Google Scholar]
- Gilbert, J.R.; Li, X.S.; Ng, E.G.; Peyton, B.W. Computing row and column counts for sparse QR and LU factorization. BIT Numer. Math.
**2001**, 41, 693–710. [Google Scholar] [CrossRef] - George, A.; Ng, E. On the complexity of sparse QR and LU factorization of finite-element matrices. SIAM J. Sci. Stat. Comput.
**1988**, 9, 849–861. [Google Scholar] [CrossRef] - St. Laurent, A. Understanding Open Source and Free Software Licensing; O’Reilly Media: Sebastopol, CA, USA, 2008. [Google Scholar]
- Curtin, R.; Edel, M.; Lozhnikov, M.; Mentekidis, Y.; Ghaisas, S.; Zhang, S. mlpack 3: A fast, flexible machine learning library. J. Open Source Softw.
**2018**, 3, 726. [Google Scholar] [CrossRef] - Bhardwaj, S.; Curtin, R.; Edel, M.; Mentekidis, Y.; Sanderson, C. ensmallen: A flexible C++ library for efficient function optimization. In Proceedings of the Workshop on Systems for ML and Open Source Software at NIPS / NeurIPS, Montreal, QC, Canada, 7 December 2018. [Google Scholar]
- Eddelbuettel, D.; Sanderson, C. RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Comput. Stat. Data Anal.
**2014**, 71, 1054–1063. [Google Scholar] [CrossRef][Green Version]

**Figure 2.**Illustration of sparse matrix representations: (

**a**) example sparse matrix with 5 rows, 4 columns and 6 non-zero values, shown in traditional mathematical notation; (

**b**) corresponding Compressed Sparse Column (CSC) representation; (

**c**) corresponding Red-Black Tree (RBT) representation, where each node is expressed by $(i,v)$, with i indicating a linearly-encoded matrix location and v indicating the value held at that location; (

**d**) corresponding Coordinate list (COO) representation. Following C++ convention [2], we use zero-based indexing.

**Figure 3.**Left panel: a Python program using the SciPy toolkit, requiring explicit conversions between sparse format types to achieve efficient execution; if an unsuitable sparse format is used for a given operation, SciPy will emit TypeError or SparseEfficiencyWarning. Right panel: A corresponding C++ program using the sparse matrix class, with the format conversions automatically done by the class.

**Figure 4.**Algorithms for: (

**a**) conversion from COO to CSC and (

**b**) conversion from CSC to COO. Matrix elements in COO format are assumed to be stored in column-major ordering. All arrays and matrix locations use zero-based indexing. N indicates the number of non-zero elements, while n_cols indicates the number of columns. Details for the CSC and COO arrays are given in Section 4.

**Figure 5.**Algorithms for: (

**a**) conversion from CSC to RBT and (

**b**) conversion from RBT to CSC. All arrays and matrix locations use zero-based indexing. N indicates the number of non-zero elements, while n_rows and n_cols indicate the number of rows and columns, respectively. Details for the CSC arrays are given in Section 4.

**Figure 6.**Wall-clock time taken to insert elements into a 10,000 × 10,000 sparse matrix to achieve various densities of non-zero elements. In (

**a**), the elements are inserted at random locations in random order. In (

**b**), the elements are inserted in a quasi-ordered fashion, where each new inserted element is at a random location that is past the previously inserted element, using column-major ordering.

**Figure 7.**Wall-clock time taken to calculate the expressions (

**a**)

`trace(A.t()*B)`and (

**b**)

`diagmat(A + B)`, where

`A`and

`B`are randomly-generated sparse matrices with a size of 10,000 × 10,000 and various densities of non-zero elements. The expressions were calculated with and without the aid of the template-based optimisation of compound expression described in Section 3. As per Table 1,

`X.t()`returns the transpose of matrix

`X`, while

`diagmat(X)`returns a diagonal matrix constructed from the main diagonal of

`X`.

**Table 1.**Subset of available functionality for the sparse matrix class, with brief descriptions. Optional additional arguments have been omitted for brevity. See http://arma.sourceforge.net/docs.html for more detailed documentation.

Function | Description |
---|---|

sp_mat X(1000,2000) | Declare sparse matrix with 1000 rows and 2000 columns |

sp_cx_mat X(1000,2000) | As above, but use complex elements |

X(1,2) = 3 | Assign a value of 3 to the element at location (1,2) of matrix X |

X = 4.56 * A | Multiply matrix A by scalar |

X = A + B | Add matrices A and B |

X = A * B | Multiply matrices A and B |

X( span(1,2), span(3,4) ) | Provide read/write access to a submatrix of X |

X.diag(k) | Provide read/write access to diagonal k of X |

X.print() | Print matrix X to the terminal |

X.save(filename, format) | Store matrix X as a file |

speye(rows, cols) | Generate a sparse matrix with values on the main diagonal set to one |

sprandu(rows, cols, density) | Generate a sparse matrix with random non-zero elements |

sum(X, dim) | Sum of elements in each column (dim = 0) or row (dim = 1) |

min(X, dim); max(X, dim) | Obtain the extremum value in each column (dim = 0) or row (dim = 1) |

X.t() or trans(X) | Return the transpose of matrix X |

kron(A, B) | Kronecker tensor product of matrices A and B |

repmat(X, rows, cols) | Replicate matrix X in block-like fashion |

norm(X, p) | Compute the p-norm of vector or matrix X |

normalise(X, p, dim) | Normalise each column (dim = 0) or row (dim = 1) to unit p-norm |

trace(A.t() * B) | Compute the trace of ${A}^{T}B$ without explicit transpose and multiplication |

diagmat(A + B) | Obtain the diagonal matrix from $A+B$ without full matrix addition |

eigs_gen(eigval, eigvec, X, k) | Compute the k largest eigenvalues and eigenvectors of matrix X |

svds(U, s, V, X, k) | Compute k singular values and singular vectors of matrix X |

x = spsolve(A, b) | Solve sparse system Ax = b for x |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sanderson, C.; Curtin, R. Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation. *Math. Comput. Appl.* **2019**, *24*, 70.
https://doi.org/10.3390/mca24030070

**AMA Style**

Sanderson C, Curtin R. Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation. *Mathematical and Computational Applications*. 2019; 24(3):70.
https://doi.org/10.3390/mca24030070

**Chicago/Turabian Style**

Sanderson, Conrad, and Ryan Curtin. 2019. "Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation" *Mathematical and Computational Applications* 24, no. 3: 70.
https://doi.org/10.3390/mca24030070