User-Friendly Sparse Matrices with Hybrid Storage and Template-Based Expression Optimisation

Despite the importance of sparse matrices in numerous fields of science, software implementations remain difficult to use for non-expert users, generally requiring the understanding of underlying details of the chosen sparse matrix storage format. In addition, to achieve good performance, several formats may need to be used in one program, requiring explicit selection and conversion between the formats. This can be both tedious and error-prone, especially for non-expert users. Motivated by these issues, we present a user-friendly sparse matrix class for the C++ language, with a high-level application programming interface deliberately similar to the widely used MATLAB language. The class internally uses two main approaches to achieve efficient execution: (i) a hybrid storage framework, which automatically and seamlessly switches between three underlying storage formats (compressed sparse column, Red-Black tree, coordinate list) depending on which format is best suited and/or available for specific operations, and (ii) template-based meta-programming to automatically detect and optimise execution of common expression patterns. To facilitate relatively quick conversion of research code into production environments, the class and its associated functions provide a suite of essential sparse linear algebra functionality (eg., arithmetic operations, submatrix manipulation) as well as high-level functions for sparse eigendecompositions and linear equation solvers. The latter are achieved by providing easy-to-use abstractions of the low-level ARPACK and SuperLU libraries. The source code is open and provided under the permissive Apache 2.0 license, allowing unencumbered use in both open-source projects and closed-source commercial products.


Introduction
Recent decades have seen the frontiers of scientific computing increasingly push towards the use of larger and larger datasets. In fact, frequently the data to be represented is so large that it cannot fully fit into working memory. Fortunately, in many cases the data has many zeros and can be represented in a compact manner, allowing users to work with sparse matrices of extreme size with few non-zero elements. However, converting code from using dense matrices to using sparse matrices-a common task when scaling code to larger data-is not always straightforward.
Existing open-source frameworks may provide several separate sparse matrix classes, each with their own data storage format. For instance, SciPy [23] has 7 sparse matrix classes: bsr matrix, coo matrix, csc matrix, csr matrix, dia matrix, dok matrix, and lil matrix. Each storage format is best suited for efficient execution of a specific set of operations (eg., matrix multiplication vs. incremental matrix construction). Other frameworks may provide only one sparse matrix class, with severe runtime penalties if it is not used in the right way. This can be challenging and bewildering for users who simply want to create and use sparse matrices, and do not have the time, expertise, or desire to understand the advantages and disadvantages of each format. To achieve good performance, several formats may need to be used in one program, requiring explicit selection and conversion between the formats. This plurality of sparse matrix classes complicates the programming task, increases the likelihood of bugs, and adds to the maintenance burden. Motivated by the above issues, we present a user-friendly sparse matrix class for the C++ language [29], with a high-level application programming interface (function syntax) that is deliberately similar to MATLAB. The sparse matrix class uses a hybrid storage framework, which automatically and seamlessly switches between three data storage formats, depending on which format is best suited and/or available for specific operations: • Compressed Sparse Column (CSC), used for efficient and nuanced implementation of core arithmetic operations such as matrix multiplication and addition, as well as efficient reading of individual elements; • Red-Black Tree (RBT), used for both robust and efficient incremental construction of sparse matrices (i.e., construction via setting individual elements one-by-one, not necessarily in order); • Coordinate List (COO), used for low-maintenance and straightforward implementation of relatively complex and/or lesser-used sparse matrix functionality.
The COO format is important to point out, as the source code for the sparse matrix class is distributed and maintained as part of the open-source Armadillo library [25]. Due to its simpler nature, the COO format facilitates functionality contributions from time-constrained and/or nonexpert users, as well as reducing maintenance and debugging overhead for the library maintainers.
To further promote efficient execution, the sparse matrix class implements a delayed evaluation 1 approach [19] based on C++ features such as operator overloading [29] and template metaprogramming [1,31]. In contrast to simply using brute-force evaluation of mathematical expressions, the delayed evaluation framework allows automatic compile-time analysis of such expressions, which in turns allows for automatic detection and optimisation of common expression patterns.
Overall, the sparse matrix class provides an intuitive interface that is very close to a typical dense matrix API; this can help with rapid transition of dense-specific code to sparse-specific code. In addition, we demonstrate that the overhead of the hybrid format is minimal, and that the format is able to choose the optimal representation for a variety of sparse linear algebra tasks. This makes the format and implementation suitable for real-world prototyping and production usage.
Although there are many other sparse matrix implementations in existence, to our knowledge ours is the first to offer a unified interface with automatic format switching under the hood. Most toolkits are limited to either a single format or multiple formats the user must manually convert between. As mentioned earlier, SciPy contains no fewer than seven formats, and the comprehensive SPARSKIT package [24] contains 16. In these toolkits the user must manually convert between formats. On the other hand, both MATLAB and GNU Octave [11] contain sparse matrix implementations, but they supply only the CSC format [9], meaning that users must write their code in special ways to ensure its efficiency [20].
We continue the paper as follows. In Section 2 we overview the functionality provided by the sparse matrix class and its associated functions. The delayed evaluation approach is overviewed in Section 3. In Section 4 we describe the underlying storage formats used by the class, and the scenarios that each of the formats is best suited for. In Section 5 we discuss the costs for switching between the formats. Section 6 provides an empirical evaluation showing the advantages of the hybrid storage framework and the delayed evaluation approach. The salient points and avenues for further exploitation are summarised in Section 7. This paper is a revised and extended version of our earlier work [26].
To allow prototyping directly in C++ as well as to facilitate relatively quick conversion of research code into production environments, the sparse matrix class and its associated functions provide a userfriendly suite of essential sparse linear algebra functionality, including fundamental operations such as addition, matrix multiplication and submatrix manipulation. Various sparse eigendecompositions and linear equation solvers are also provided. C++ language features such as overloading of operators (eg., * and +) [29] are exploited to allow mathematical operations with matrices to be expressed in a concise and easy-to-read manner. For example, given sparse matrices A, B, and C, a mathematical expression such as where sp mat is our sparse matrix class. Low-level details such as memory management are hidden, allowing the user to concentrate effort on mathematical details. Table 1 lists a subset of the available functionality, while Figure 1 contains a complete C++ program which briefly demonstrates usage of the sparse matrix class. Sparse eigendecompositions and linear equation solutions are accomplished through integration with low-level routines in the de facto standard ARPACK [17] and SuperLU libraries [18]. The resultant high-level functions automatically take care of the cumbersome and errorprone low-level management required with these libraries.
In effect, the aggregate of the sparse matrix class, operator overloading and associated functions on sparse matrices is an instance of a Domain Specific Language (sparse linear algebra) embedded within the host C++ language [21,27]. This allows complex algorithms relying on sparse matrices to be easily developed and integrated within a larger C++ program, making the sparse matrix class directly useful in application/product development.

Template-Based Optimisation of Compound Expressions
The sparse matrix class uses a delayed evaluation approach, allowing several operations to be combined to reduce the amount of computation and/or temporary objects. In contrast to brute-force evaluations, delayed evaluation can provide considerable performance improvements as well as reduced memory usage [32]. The delayed evaluation machinery is accomplished through template metaprogramming [1,31], where a type-based signature of a compound expression (set of consecutive mathematical operations) is automatically constructed. The C++ compiler is then automatically induced to detect common expression patterns at compile time, followed by selecting the most computationally efficient implementations. As an example of the possible efficiency gains, let us consider the expression trace(A.t() * B), which often appears as a fundamental quantity in semidefinite programs [30]. These computations are thus used in a wide variety of diverse fields, most notably machine learning [5,13,16]. A brute-force implementation would evaluate the transpose first, A.t(), and store the result in a temporary matrix T1. The next operation would be a time consuming matrix multiplication, T1 * B, with the result stored in another temporary matrix T2. The trace operation (sum of diagonal elements) would then be applied on T2. The explicit transpose, full matrix multiplication and creation of the temporary matrices is suboptimal from an efficiency point of view, as for the trace operation we require only the diagonal elements of the A.t() * B expression.
Template-based expression optimisation can avoid the unnecessary operations. Let us declare two lightweight objects, Op and Glue, where Op objects are used for representing unary operations, while Glue objects are used for representing binary operations. The objects are lightweight as they do not store actual sparse matrix data; instead the objects only store references to matrices and/or other Op and Glue objects. Ternary and more complex operations are represented through combinations of Op and Glue objects. The exact type of each Op and Glue object is automatically inferred from a given mathematical expression through template meta-programming.
In our example, the expression A.t() is automatically converted to an instance of the lightweight Op object with the following type: Op<sp mat, op trans> where Op<...> indicates that Op is a template class, with the items between '<' and '>' specifying template parameters. In this case the Op<sp mat, op trans> object type indicates that a reference to a matrix is stored and that a transpose operation is requested. In turn, the compound expression A.t() * B is converted to an instance of the lightweight Glue object with the following type: Glue< Op<sp mat, op trans>, sp mat, glue times> where the Glue object type in this case indicates that a reference to the preceding Op object is stored, a reference to a matrix is stored, and that a matrix multiplication operation is requested. In other words, when a user writes the expression trace(A.t() * B), the C++ compiler is induced to represent it internally as trace(Glue< Op<sp mat, op trans>, sp mat, glue times>(A,B)).
There are several implemented forms of the trace() function, one of which is automatically chosen by the C++ compiler to handle the Glue< Op<sp mat, op trans>, sp mat, glue times> expression. The specific form of trace() takes references to the A and B matrices, and executes a partial matrix multiplication to obtain only the diagonal elements of the A.t() * B expression. All of this is accomplished without generating temporary matrices. Furthermore, as the Glue and Op objects only hold references, they are in effect optimised away by modern C++ compilers [31]: the resultant machine code appears as if the Glue and Op objects never existed in the first place.
The template-based delayed evaluation approach has also been employed for other functions, such as the diagmat() function, which obtains a diagonal matrix from a given expression. For example, in the expression diagmat(A + B), only the diagonal components of the A + B expression are evaluated.

Underlying Sparse Storage Formats
The three underlying storage formats (CSC, RBT, COO) were chosen to give overall efficient execution across several use cases, as well as minimising the difficulty of implementation and code maintenance burden where possible. Specifically, our focus is on the following main use cases: (i) flexible ad-hoc construction and element-wise modification of sparse matrices via unordered insertion of elements, where each new element is inserted at a random location; (ii) incremental construction of sparse matrices via quasi-ordered insertion of elements, where each new element is inserted at a location that is past all the previous elements according to columnmajor ordering; (iii) multiplication of dense vectors with sparse matrices; (iv) multiplication of two sparse matrices; (v) operations involving bulk coordinate transformations, such as flipping matrices column-or rowwise.
Below we briefly describe each storage format and its benefits and limitations. We use N to indicate the number of non-zero elements of the matrix, while n rows and n cols indicate the number of rows and columns, respectively. (c) corresponding RBT representation, where each node is expressed by (i, v), with i indicating a linearly encoded matrix location and v indicating the value held at that location; (d) corresponding COO representation. Following C++ convention [29], we use zero-based indexing.

Compressed Sparse Column (CSC)
The CSC format [24] uses column-major ordering where the elements are stored column-by-column, with consecutive non-zero elements in each column stored consecutively in memory. Three arrays are used to represent a sparse matrix: (i) the values array, which is a contiguous array of N floating point numbers holding the non-zero elements, (ii) the rows array, which is a contiguous array of N integers holding the corresponding row indices (ie., the n-th entry contains the row of the n-th element), and (iii) the col offsets array, which is a contiguous array of n cols+1 integers holding offsets to the values array, with each offset indicating the start of elements belonging to each column.
Following C++ convention [29], all arrays use zero-based indexing, ie., the initial position in each array is denoted by 0. For consistency, element locations within a matrix are also encoded as starting at zero, ie., the initial row and column are both denoted by 0. Furthermore, the row indices for elements in each column are kept sorted in ascending manner. In many applications, sparse matrices have more non-zero elements than the number of columns, leading to the col offsets array being typically much smaller than the values array.  Figure 2 The CSC format is well-suited for efficient sparse linear algebra operations such as vector-matrix multiplication. This is due to consecutive non-zero elements in each column being stored next to each other in memory, which allows modern CPUs to speculatively read ahead elements from the main memory into fast cache memory [22]. The CSC format is also suited for operations that do not change the structure of the matrix, such as element-wise operations on the non-zero elements (eg., multiplication by a scalar). The format also affords relatively efficient random element access; to locate an element (or determine that it is not stored), a single lookup to the beginning of the desired column can be performed, followed by a binary search [6] through the rows array to find the element.
While the CSC format provides a compact representation yielding efficient execution of linear algebra operations, it has two main disadvantages. The first disadvantage is that the design and implementation of efficient algorithms for many sparse matrix operations (such as matrix-matrix multiplication) tend to be non-trivial [3,24]. This stems not only from the sparse nature of the data, but also due to the need to (i) explicitly keep track of the column offsets, and (ii) keep the row indices for elements in each column sorted in ascending manner. In our experience, the process of designing and implementing efficient matrix processing algorithms in CSC is a time-consuming affair -it is both finicky and prone to subtle bugs.
The second disadvantage of CSC is the computational effort required to insert a new element [9]. In the worst-case scenario, memory for three new larger-sized arrays (containing the values and locations) must first be allocated, the position of the new element determined within the arrays, data from the old arrays copied to the new arrays, data for the new element placed in the new arrays, and finally the memory used by the old arrays deallocated. As the number of elements in the matrix grows, the entire process becomes slower.
There are opportunities for some optimisation, such as using oversized storage to reduce memory allocations, where a new element past all the previous elements can be readily inserted. However, this does not help when a new non-zero element is inserted between two existing non-zero elements. It is also possible to perform batch insertions with some speedup by first sorting all the elements to be inserted and then merging with the existing data arrays. While the above approaches can be effective, they require the user to explicitly deal with cumbersome low-level storage details instead of focusing on high-level functionality.
The CSC format was chosen over the related Compressed Sparse Row (CSR) format [24] for two main reasons: (i) to ensure compatibility with external libraries such as the SuperLU solver [18], and (ii) to ensure consistency with the surrounding infrastructure provided by the Armadillo library, which uses column-major dense matrix representation to take advantage of low-level functions provided by LAPACK [2].

Red-Black Tree (RBT)
To address the efficiency problems with element insertion at arbitrary locations, we first represent each element as a 2-tuple, l = (index, value), where index encodes the location of the element as index = row + column × n rows. Zero-based indexing is used. This encoding implicitly assumes column-major ordering of the elements. Secondly, rather than using a simple linked list or an array based representation, the list of the tuples is stored as a Red-Black Tree (RBT), a self-balancing binary search tree [6].
Briefly, an RBT is a collection of nodes, with each node containing the 2-tuple described above and links to two children nodes. There are two constraints: (i) each link points to a unique child node, and (ii) there are no links to the root node. The index within each 2-tuple is used as the key to identify each node. An example of this structure for a simple sparse matrix is shown in Figure 2(c). The ordering of the nodes and height of the tree (number of node levels below the root node) is controlled so that searching for a specific index (ie., retrieving an element at a specific location) has worst-case complexity of O(log N ). Insertion and removal of nodes (ie., matrix elements), also has the worst-case complexity of O(log N ). If a node to be inserted is known to have the largest index so far (eg., during incremental matrix construction), the search for where to place the node can be omitted, which in practice can considerably speed up the insertion process.
With the above element encoding, traversing an RBT in an ordered fashion (from the smallest to largest index) is equivalent to reading the elements in column-major ordering. This in turn allows for quick conversion of matrix data stored in RBT format into CSC format. The location of each element is simply decoded via row = (index mod n rows), and column = index/n rows , where, for clarity, z is the integer version of z, rounded towards zero. These operations are accomplished via direct integer arithmetic on CPUs.
In our hybrid storage framework, the RBT format is used for incremental construction of sparse matrices, either in an ordered or unordered fashion, and a subset of element-wise operations (such as inplace addition of values to specified elements). This in turn enables users to construct sparse matrices in the same way they might construct dense matrices-for instance, a loop over elements to be inserted without regard to storage format.
While the RBT format allows for fast element insertion, it is less suited than CSC for efficient linear algebra operations. The CSC format allows for exploitation of fast caches in modern CPUs due to the consecutive storage of non-zero elements in memory [22]. In contrast, accessing consecutive elements in the RBT format requires traversing the tree (following links from node to node), which in turn entails accessing node data that is not guaranteed to be consecutively stored in memory. Furthermore, obtaining the column and row indices requires explicit decoding of the index stored in each node, rather than a simple lookup in the CSC format.

Coordinate List Representation (COO)
The Coordinate List (COO) is a general concept where a list L = (l 1 , l 2 , · · · , l N ) of 3-tuples represents the non-zero elements in a matrix. Each 3-tuple contains the location indices and value of the element, ie., l = (row, column, value). The format does not prescribe any ordering of the elements, and a simple linked list [6] can be used to represent L. However, in a computational implementation geared towards linear algebra operations [24], L is often represented as a set of three arrays: (i) the values array, which is a contiguous array of N floating point numbers holding the non-zero elements of the matrix; (ii) the rows array, a contiguous array of N integers holding the row index of the corresponding value; and (iii) the columns array, a contiguous array of N integers holding the column index of the corresponding value.
As per the CSC format, all arrays use zero-based indexing, ie., the initial position in each array is 0. The array-based representation of COO is related to CSC, with the main difference that for each element the column indices are explicitly stored. This leads to the primary advantage of the COO format: it can greatly simplify the implementation of matrix processing algorithms. It also tends to be a natural format many non-expert users expect when first encountering sparse matrices. However, due to the explicit representation of column indices, the COO format contains redundancy and is hence less efficient (spacewise) than CSC for representing sparse matrices. An example of this is shown in Figure 2(d).
To contrast the differences in effort required in implementing matrix processing algorithms in CSC and COO, let us consider the problem of sparse matrix transposition. When using the COO format this is trivial to implement: simply swap the rows array with the columns array and then re-sort the elements so that column-major ordering is maintained. However, the same task for the CSC format is considerably more specialised: an efficient implementation in CSC would likely use an approach such as the elaborate TRANSP algorithm by Bank and Douglas [3], which is described through a 47-line pseudocode algorithm with annotations across two pages of text.
Our initial implementation of sparse matrix transposition used the COO based approach. COO was used simply due to shortage of available time for development and the need to flesh out other parts of sparse matrix functionality. When time allowed, we reimplemented sparse matrix transposition to use the abovementioned TRANSP algorithm. This resulted in considerable speedups, due to no longer requiring the time-consuming sort operation. We verified that the new CSC-based implementation is correct by comparing its output against the previous COO-based implementation on a large set of test matrices.
The relatively straightforward nature of COO format hence makes it well-suited for: (i) functionality contributed by time-constrained and/or non-expert users, (ii) relatively complex and/or less-common sparse matrix operations, and (iii) verifying the correct implementation of algorithms in the more complex CSC format. The volunteer driven nature of the Armadillo project makes its vibrancy and vitality depend in part on contributions received from users and the maintainability of the codebase. The number of core developers is small (ie., the authors of this paper), and hence difficult-to-understand or difficult-to-maintain code tends to be avoided, since the resources are simply not available to handle that burden.
The COO format is currently employed for less-commonly used tasks that involve bulk coordinate transformations, such as reverse() for flipping matrices column-or row-wise, and repelem(), where a matrix is generated by replicating each element several times from a given matrix. While it is certainly possible to adapt these functions to directly use the more complex CSC format, at the time of writing we have spent our time-constrained efforts on optimising and debugging more commonly used parts of the sparse matrix class.

Automatically Switching Between Storage Formats
To avoid the problems associated with selection and manual conversion between formats, our sparse matrix class uses a hybrid storage framework that automatically and seamlessly switches between the data storage formats described in Section 4. By default, matrix elements are stored in CSC format. When required, data in CSC format is internally converted to either the RBT or COO format, on which an operation or set of operations is performed. The matrix is automatically converted ('synced') back to the CSC format the next time an operation requiring the CSC format is performed.
The actual underlying storage details and conversion operations are completely hidden from the user, who may not necessarily be knowledgeable about (or care to learn about) sparse matrix storage formats. This allows for simplified user code, which in turn increases readability and lowers maintenance. In contrast, other toolkits without automatic format conversion can cause either slow execution (as a non-optimal storage format might be used), or require many manual conversions. As an example, Figure 3 shows a short Python program using the SciPy toolkit [23] and a corresponding C++ program using the hybrid sparse matrix class. Manually initiated format conversions are required for efficient execution in the SciPy version; this causes both development time and code required to increase. If the user does not carefully consider the type of their sparse matrix at all times, they are likely to write inefficient code. In contrast, in the C++ program the format conversion is done automatically and behind the scenes.
A potential drawback of the automatic conversion between formats is the added computational cost. However, it turns out that COO/CSC conversions can be done in time that is linear in the number of non-zero elements in the matrix, and that CSC/RBT conversions can be done at worst in log-linear time. Since most sparse matrix operations are more expensive (eg., matrix multiplication), the conversion overhead turns out to be mostly negligible in practice. Below we present straightforward algorithms for conversion and note their asymptotic complexity in terms of the O notation [6]. This is followed by discussing practical considerations that are not directly taken into account by the O notation.

Conversion Between COO and CSC
Since the COO and CSC formats are quite similar, the conversion algorithms are straightforward. In fact the only parts of the formats to be converted are the columns and col offsets arrays with the rows and values arrays remaining unchanged.
The algorithm for converting COO to CSC is given in Figure 4 N + 2n cols). As in most applications the number of non-zero elements will be considerably greater than the number of columns, the overall asymptotic complexity in these cases is O (N ). The corresponding algorithm for converting CSC to COO is shown in Figure 4(b). In essence the col offsets array is unpacked into a columns array with length N . As such, the asymptotic complexity of this operation is O(N ).

Conversion Between CSC and RBT
The conversion between the CSC and RBT formats is also straightforward and can be accomplished using the algorithms shown in Figure 5. In essence, the CSC to RBT conversion involves encoding the 1 proc CSC to RBT 2 input: N , n rows, n cols (integer scalars) 3 input: values, rows, col offsets (CSC arrays) 4 declare red-black tree T 5 forall j ∈ [0, n cols): insert node l into T 12 output: T (red-black tree) 1 proc RBT to CSC 2 input: N , n rows, n cols (integer scalars) 3 input: T (red-black tree) 4 allocate array values with length N 5 allocate array row indices with length N 6 allocate array col offsets with length n cols + 1 7 forall j ∈ [0, n cols]: col offsets[j] ← 0 8 k ← 0 9 foreach node l ∈ T, where l = (index,value): location of each matrix element to a linear index, followed by inserting a node with that index and the corresponding element value into the RBT. The worst-case complexity for inserting all elements into an RBT is O(N · log N ). However, as the elements in the CSC format are guaranteed to be stored according to column-major ordering (as per Section 4.1), and the location encoding assumes column-major ordering (as per Section 4.2), the insertion of a node into an RBT can be accomplished without searching for the node location. While the worst-case cost of O(N · log N ) is maintained due to tree maintenance (ie., controlling the height of the tree) [6], in practice the amortised insertion cost is typically lower due to avoidance of the search.
Converting an RBT to CSC involves traversing through the nodes of the tree from the lowest to highest index, which is equivalent to reading the elements in column-major format. The value stored in each node is hence simply copied into the corresponding location in the CSC values array. The index stored in each node is decoded into row and column indices, as per Section 4.2, with the CSC row indices and col offsets arrays adjusted accordingly. The worst-case cost for finding each element in the RBT is O(log N ), which results in the asymptotic worst-case cost of O(N · log N ) for the whole conversion. However, in practice most consecutive elements are in nearby nodes, which on average reduces the number of traversals across nodes, resulting in considerably lower amortised conversion cost.

Practical Considerations
Since the conversion algorithms given in Figures 4 and 5 are quite straightforward, the O notation does not hide any large constant factors. For COO/CSC conversions the cost is O(N ), while for CSC/RBT conversions the worst-case cost in O(N ·log N ). In contrast, many mathematical operations on sparse matrices have much higher computational cost than the conversion algorithms. Even simply adding two sparse matrices can be much more expensive than a conversion. Although the addition operation still takes O(N ) time (assuming N is identical for both matrices), there is a lot of hidden constant overhead, since the sparsity pattern of the resulting matrix must be computed first [24]. A similar situation applies for multiplication of two sparse matrices, which for square matrices takes O(N + n cols) time [8], but in practice tends to be much slower due to the many passes and extra overhead of computing the output sparsity structure [3].
Sparse matrix factorisations are much more expensive, meaning that any conversion overhead is essentially negligible. A sparse LU factorisation is superlinear [15] as well as other factorisations like the Cholesky factorisation, which costs O(n cols 3/2 ) time [14]. Other factorisations and higher-level operations exhibit similar complexity characteristics. Given this, the cost of format conversions is heavily outweighed by the user convenience that they allow.

Empirical Evaluation
To empirically demonstrate the usefulness of the hybrid storage framework and the template-based expression optimisation mechanism, we have performed a set of experiments, measuring the wall-clock time (elapsed real time) required for: (i) unordered element insertion into a sparse matrix, where the elements are inserted at random locations in random order; (ii) quasi-ordered element insertion into a sparse matrix, where each new inserted element is at a random location that is past the previously inserted element, under the constraint of columnmajor ordering; (iii) calculation of trace(A T B), where A and B are randomly generated sparse matrices; (iv) obtaining a diagonal matrix from the (A+B) expression, where A and B are randomly generated sparse matrices.
In all cases the sparse matrices have a size of 10,000×10,000, with four settings for the density of non-zero elements: 0.01%, 0.1%, 1%, 10%. The experiments were done on a machine with an Intel Xeon E5-2630L CPU running at 2 GHz, using the GCC v5.4 compiler. Each experiment was repeated 10 times, and the average wall-clock time is reported. The wall-clock time measures the total time taken from the start to the end of each run, and includes necessary overheads such as memory allocation. Figure 6 shows the average wall-clock time taken for element insertion done directly using the underlying storage formats (ie., CSC, COO, RBT, as per Section 4), as well as the hybrid approach which uses RBT followed by conversion to CSC. The CSC and COO formats use oversized storage as a form of optimisation (as mentioned in Section 4.1), where the underlying arrays are grown in chunks of 1024 elements in order to reduce both the number of memory reallocations and array copy operations due to element insertions.
In all cases bar one, the RBT format is the quickest for insertion, generally by one or two orders of magnitude. The conversion from RBT to CSC adds negligible overhead. For the single case of quasi-ordered insertion to reach the density of 0.01%, the COO format is slightly quicker than RBT. This is due to the relatively simple nature of the COO format, as well as the ordered nature of the element insertion where the elements are directly placed into the oversized COO arrays (ie., no sorting required). Furthermore, due to the very low density of non-zero elements and the chunked nature of COO array growth, the number of reallocations of the COO arrays is relatively low. In contrast, inserting a new element into RBT requires the allocation of memory for a new node, and modifying the tree to append the node. For larger densities (≥ 0.1%), the COO element insertion process quickly becomes more time consuming than RBT element insertion, due to to an increased amount of array reallocations and the increased size of the copied arrays. Compared to COO, the CSC format is more complex and has the additional burden of recalculating the column offsets (col offsets) array for each inserted element.  Figure 6: Wall-clock time taken to insert elements into a 10,000×10,000 sparse matrix to achieve various densities of non-zero elements. In (a), the elements are inserted at random locations in random order. In (b), the elements are inserted in a quasi-ordered fashion, where each new inserted element is at a random location that is past the previously inserted element, using column-major ordering.  , where A and B are randomly generated sparse matrices with a size of 10,000×10,000 and various densities of non-zero elements. The expressions were calculated with and without the aid of the template-based optimisation of compound expression described in Section 3. As per Table 1, X.t() returns the transpose of matrix X, while diagmat(X) returns a diagonal matrix constructed from the main diagonal of X. Figure 7 shows the wall-clock time taken to calculate the expressions trace(A.t()*B) and diagmat(A+B), with and without the aid of the automatic template-based optimisation of compound expression described in Section 3. For both expressions, employing expression optimisation leads to considerable reduction in the wall-clock time. As the density increases (ie., more non-zero elements), more time is saved via expression optimisation.
For the trace(A.t()*B) expression, the expression optimisation computes the trace by omitting the explicit transpose operation and performing a partial matrix multiplication to obtain only the diagonal elements. In a similar fashion, the expression optimisation for the diagmat(A+B) expression directly generates the diagonal matrix by performing a partial matrix addition, where only the diagonal elements of the two matrices are added. As well as avoiding full matrix addition, the generation of a temporary intermediary matrix to hold the complete result of the matrix addition is also avoided.
Motivated by a lack of easy-to-use tools for sparse matrix development, we have proposed and implemented a sparse matrix class in C++ that internally uses a hybrid storage framework. The framework automatically and seamlessly switches between several underlying formats, depending on which format is best suited and/or available for specific operations. This allows the user to write sparse linear algebra without requiring to consider the intricacies and limitations of various storage formats. In addition, a template meta-programming framework is used to automatically optimise several common expression patterns, resulting in faster execution.
The source code for the sparse matrix class and its associated functions is included in recent releases of the cross-platform and open-source Armadillo linear algebra library [25], available from http://arma.sourceforge.net. The code is provided under the permissive Apache 2.0 license [28], allowing unencumbered use in both open-source projects and commercial closed-source products.
The sparse matrix class has already been successfully used in open-source projects such as the mlpack library for machine learning [7], and the ensmallen library for mathematical function optimisation [4]. In both cases the sparse matrix class is used to allow various algorithms to be run on either sparse or dense datasets. Furthermore, bi-directional bindings for the class are provided to the R environment via the Rcpp bridge [12].
Future avenues for exploration include integrating more specialised matrix formats in order to automatically speed up specific operations. For example, the Skyline formats [10] are useful for Cholesky factorisation and related operations, while the compressed diagonal storage format [24] can be used for operations on the main diagonal.