SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

: Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientiﬁc and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs).The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( npr ) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean npr of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean npr , providing improved load balancing and coalesced memory access, and hence more efﬁcient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the npr variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high npr variance matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Sparse Matrix-Vector (SpMV) multiplication is the key operation of iterative methods for solving linear equation systems [50]. Iterative solvers such as Jacobi and Gauss-Seidel mainly consist of multiple iterations of Sparse Matrix-Vector computations, whereas a single iteration in Krylov solvers includes several SpMV operations. Sparse Matrix-Vector (SpMV) Multiplication is a Level-2 Basic Linear Algebra Subprograms (BLAS) operation between a sparse matrix and a dense vector and can be represented mathematically as y = A × x + b [51]. Other sparse computations such as eigenvalue problems (Ax = λx) also require a large number of SpMV operations.
Implementation of parallel iterative solvers on different kinds of graphics processing units (GPUs) has numerous issues. Specialized storage structures are used to improve the performance of SpMV. These structures have design issues for translating it to GPUs. The major issues include coalesced memory access to both the sparse matrix A and the vector y, load balance among array threads and warps, thread divergence among a warp of threads, performance variance based on the structure of the sparse matrices, and the amount of memory access required for computations. These issues are due to the irregular sparsity structure of the matrix, which results in poor performance. Researchers have suggested several storage formats to improve SpMV on GPUs (see Section 3). Leading sparse libraries such as CUSP [52] and cuSPARSE [53] provide a number of sparse storage techniques including Coordinate (COO), Compressed Sparse Rows (CSR), ELLPACK (ELL), Hybrid (HYB), and Diagonal (DIA), and associated BLAS operations. However, the performance depends upon the sparsity structure and, due to the diversity in the sparsity pattern, they do not perform efficiently for all matrices. The burden of using an appropriate format for a given matrix usually lies with the user. Nevertheless, SpMV computations deliver low throughput due to the reasons mentioned above.
In this article, we propose SURAA (an Arabic word meaning speed), a novel method for SpMV computations on GPUs that addresses the major challenges of SpMV computations on GPUs, i.e., load balanced execution, coalesced memory access, and redundant computations. The novelty lies in the way we group matrix rows into different segments, and the way we dynamically schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row (npr) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman-Diaconis rule [54,55]. To the best of our knowledge, no other work exists that used the Freedman-Diaconis rule for SpMV computations the way we used, or used adaptive kernels the way we have done in this paper. The segments are assembled into three groups based on the mean npr of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic parallelism available in Nvidia GPUs (Major Version 3.5, NVIDIA Corporation, Santa Clara, CA, USA) is utilized to execute the group containing segments with the largest mean npr, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Note that the row segments in SURAA are dynamically scheduled by the dynamic kernel and are executed at the same time using the hardware streams on the GPU. The use of multiple streams provides higher performance by hiding the kernel launch overhead. This could be considered as reducing the kernel launch overhead, with the exception that the number of dynamic kernels that can be executed on GPU in parallel depends on the number of available Streaming Processors because the number is fixed for a given GPU. The number of segments that are added to the dynamically executed group is tunable. In our experiments, we find that depending upon the structure of the matrices, tuning the size of dynamically executed group enhances the performance of the scheme; see Section 4.4.
We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP) tools using widely used benchmarks comprising 26 matrices from 13 diverse domains. A total of seven other SpMV storage and computation techniques have been compared with SURAA. We have used three widely used performance metrics to evaluate SURAA and other techniques; throughput in Giga Floating Point Operations per second(GFLOP/s), speedup, and effective memory bandwidth (GB/s). Each scheme is also evaluated in terms of their performance with increasing npr variance (variance in the number of non-zeros per row) and nnz (the total number of non-zeros in the matrix).
On average, SURAA outperforms all other seven SpMV computation methods by achieving an average speedup of 40× for SELL-P, 29.89× against WPK1, 12.98× for CSR, 11.33× for BSR, 1.50× for HYB, 1.15× for WPK2, and 1.1× for CSR5. SURAA provides the best throughput for 20 matrices out of the 26 matrices in the benchmark dataset. The average collective speedup of SURAA against all the other schemes is 13.99×. The highest speedups achieved by SURAA against other schemes for any single matrix are 562×, 531.56×, 147×, 26.70×, 3.06×, 1.97×, and 1.94× against SELL-P, WPK1, CSR, BSR, HYB, WPK2, and CSR5, respectively. We repeat for clarity that these are SURAA's peak performance values against specific storage schemes for any single matrix. SURAA provides the highest mean throughput (GFLOP/s) of 15.75, followed by 14.28 for CSR5. SURAA also leads the table with the highest max throughput value of 36.70 GFLOP/s, followed by 26.8 for SELL-P. SURAA also provides the highest effective memory bandwidth of 226.25 GB/s compared to the second best 160.75 GB/s for SELL-P. SURAA also demonstrates a much lower setup overhead, equal to five SpMV execution time, compared to 69 and 95.5 SpMV executions for SELL-P and CSR5.
An important aspect of our analysis reveals (see Section 5.4, Page 24) that SURAA comparatively delivers higher performance with the increase in the npr variance and nnz. SURAA achieved the highest throughput (GFLOP/s) among all the schemes for the highest npr variance matrices.
We note that the speedup of SURAA against CSR5 and WPK2 was small, however, SURAA provided the highest throughput for 20 out of 26 matrices. More importantly, none of the schemes except SURAA delivered consistently high performance on all the performance benchmarks. SURAA code can be found at the GitHub online repository (see: http://github.com/stormvirux/suraa). cyan The rest of the paper is organized as follows. In Section 2, we discuss the background materials related to this paper. In Section 3, we discuss the state of the art. In Section 4, we describe our proposed technique. The experimental results from the implementation and analysis is given in Section 5. Section 6 concludes and summarizes future work.

Sparse Storage Schemes
Several sparse storage schemes have been proposed by researchers to improve the performance of sparse matrix computations. Many of them have been designed for matrices with particular sparsity structure or from a given application domain-for example, see [56]-which proposes a compact representation of sparse matrices arising from Markov Chains. However, there are a few general storage formats that are frequently used to store sparse matrices. Most of the sparse matrices are stored using one of these formats. We shall have a brief discussion regrading these formats in the following subsections. Figure 1 shows an example matrix A that we shall use to illustrate various sparse storage schemes. Implementation of these storage mechanisms and associated SpMVs can be found in cuSPARSE [53] and CUSP [52] libraries. The coordinate (COO) [57] format is a simple sparse storage scheme. Three arrays, row, col, and val are used to store the row indices, column indices, and values of the nonzero elements in the matrix. Figure 2 illustrates the COO representation of the example matrix A. Each GPU thread will compute the product of a non-zero with the associated element in vector x. Further reduction is required to sum the partial products to achieve the final results. The reduction procedure creates an overhead that slows down the SpMV.

Compressed Sparse Row (CSR)
The compressed sparse row (CSR) format like the COO format stores the column indices and nonzero values in arrays named col and val explicitly. However, a third array of row pointers, ptr, is used to store the row pointers. For an m × n matrix, ptr has length m + 1 and stores the offset into the ith row in ptr[i]. The last entry in ptr, which would otherwise correspond to the (m + 1)-st row, stores nnz, the number of nonzeros in the matrix. Figure 3 illustrates the CSR representation of an example matrix. SpMV for CSR based matrix can be performed using two techniques, CSR scalar and CSR vector [58]. In CSR scalar, one thread is assigned per row for SpMV operation. Each thread will compute the products and sum up the products for each row. However, due to workload imbalance and non-coalesced memory access, CSR scalar has poor performance. CSR vector is an improvement to CSR scalar, in which a warp (32 threads) is assigned to a row to perform SpMV. However, variance in the number of non-zero elements per row results in idle threads, which causes load imbalance, which leads to poor performance.

Hybrid ELL/COO (HYB)
The hybrid format was proposed by Bell and Garland to eliminate the issues with ELL [58]. The major disadvantage of ELL was the performance degradation on varying number of non-zeros per row of the matrix. The HYB storage scheme divides the matrix into two parts, a dense part and a sparse part. In the dense part, we store the typical number of non-zeros per row in the ELL storage format and the rest of entries belonging to rows with higher non-zeros in the COO format. The size of the ELL matrix needs to be determined normally from the input matrix. We can add a kth column to the ELL part if at least one third part of the matrix rows contains k non-zeroes. The remaining non-zeros in the rows are stored using the COO format. Another way to compute k is to use the frequency of the non-zeros per row as seen in CUSP implementation. The remaining non-zeros in the rows are stored using the COO format. An example of HYB storage for a sparse matrix A is shown in Figure 4. Hence, this storage technique requires two 2D arrays of size m × k for storing the ELLAPCK part, where k is the number of selected columns, and three separate arrays to store the COO part. If the actual matrix is of size m × n, then the value array of the COO part would have a dimension of (n − k).

Dynamic Parallelism
In this section, we give a general introduction to dynamic parallelism which was introduced in CUDA 5. It introduced nested parallelism with the help of nested kernel calls from the GPUs. In the earlier GPU parallel computing model, kernels were invoked in sequence from the host side. In order to achieve higher performance, each of these kernels had to expose ample parallelism.
Dynamic parallelism has been known to provide excellent performance for irregular applications such as graph-based applications and simulations [59,60]. Many graph-based applications produce dynamic sparse matrices/graphs that change during the execution. Graph algorithms such as PageRank, Hyperlink-Induced Topic Search (HITS), Random walk with restart is dominated by SpMV and has been shown to perform well with Dynamic Parallelism. The use of Dynamic parallelism simplifies the code and results in code near to the high-level description, which simplifies the development of algorithms.
Dynamic parallelism provides the ability to launch kernels from a kernel on the device. Each thread of the main kernel can launch a new kernel. A coarse-grained kernel can launch finer-grained kernels to do work where needed, which reduces the execution and data transfer as launch configuration decisions are made on runtime from the device. Algorithms requiring constructs that are difficult to program using a single level of parallelism, such as recursion, irregular loop structure, etc. [61], can be programmed in a finer way using Dynamic Parallelism.
Dynamic parallelism is only supported by devices of compute capability 3.5 and higher. It has an overhead for invocation similar to general CUDA kernels. An overhead is caused by runtime's execution tracking and management software and may result in decreased performance. The overhead is incurred when the applications link against device run-time library [61]. It could be possible to hide kernel launch overhead by a developer, but, in our work, we do not try to explicitly overlap or hide the kernel launch overhead. In this paper, it is not our intention to discuss, compare and/or quantify dynamic parallelism overhead with the counterpart approaches. Some of the works on the comparison and discussion of dynamic parallelism overhead are [59,[62][63][64].

Literature Survey
There are many studies that have attempted to improve the performance of Sparse Matrix vector multiplications on GPUs. Many of these have been developed to improve the performance of a specific sparsity structure. In this section, we shall discuss some of the approaches for SpMV computations on GPUs. Jagged Array Diagonals (JAD) format [65,66] can be considered as an optimization of ELL format on GPUs. Initial preprocessing is required to store a sparse matrix in JAD format. JAD provides better performance in SpMV kernel than ELL, but the occupancy in JAD is lesser than ELL. Abu-Sufah and Abdel Karim [67] introduced an improvement to the Transpose Jagged Diagonal Storage (TJDS) by the use of blocking technique, which is called Blocked Transpose Jagged Diagonal Storage (BTJDS). This technique does not achieve a good memory bandwidth utilization as compared to HYB and ELL.
Choi et al. [68] implemented the blocked version of CSR algorithm (BCSR) on GPU. Furthermore, they also propose a second sparse storage structure, which is the blocked version of ELL called the blocked ELLPACK (BELLPACK). ELL-R was introduced in [69] to improve the performance of the ELL storage format on GPUs. It has an additional data structure that helps in reducing the performance degradation due to thread divergence caused by non-zero entries in the ELL storage format. Kreutzer et al. [70] propose a sparse matrix storage scheme to reduce the space overhead of extra padded zeroes in the ELL format and improve the performance of SpMV on GPUs. The major disadvantage of this technique is that the multiplication by permutation destroys the dense blocks and diagonal elements which prevents cache reuse and degrades the performance.
ELLR-T is based on GPU kernel modification of ELL-R [69,71]. In ELL-R format, each row is handled by a thread during SpMV, whereas in ELLR-T, T number of threads operate on the row at the same time, similar to vector CSR (In vector CSR 32 threads operate on a row). Here, T can be 1, 2, 4, 8, 16, or 32. However, in this technique, the authors do not propose a way for selecting best T based on the input sparse matrix, rather they leave it to the user to select by experimentation. Dziekonski et al. [72] proposed a sparse matrix storage scheme called Sliced ELLR-T. The storage scheme was designed specifically to solve complex-valued sparse linear equations in computational electromagnetics. The proposed technique is based on Sliced ELL [73] and ELLR-T storage schemes. The amount of space required by this scheme is less than ELL-R and ELLR-T but more than CSR. The bigger the matrix, the better the performance of the SpMV kernel using this scheme.
Padded Sliced ELLPACK (SELL-P) [74] is derived from SELLPACK [73] and SELL-C-σ [75], wherein the rows with a similar number of nonzero elements are gathered into slices, thus minimizing the storage overhead as compared to ELLPACK scheme. By assigning multiple threads per row, as in ELLR-T [69], they modify SELL-C-σ by padding rows with zeroes such that the row length of each slice becomes a multiple of the number of threads assigned to each row. The implementation of SELL-P is available in the MAGMA library. The implementation of the sorting in the MAGMA library is performed on the CPU, which increases the overhead.
All the schemes discussed until now are focused on the storage of the sparse matrices and they mostly use just one kernel or fixed number of threads as opposed to SURAA.
Maggioni and Berger-Wolf [76] proposed a storage scheme based on ELL called Adaptive ELL that balances the non-zero rows to GPU warps. In this technique, the computations on the nonzero elements are distributed on the warps (hardware level blocks) adaptively. Using a best-fit heuristic greedy policy, they try to fill the warps with the rows. They solve an optimization problem of maximizing the efficiency of each warp and minimizing the load unbalance among the warps. This is a finer level (lower level) optimization unlike our technique, which uses multiple parallel kernels and variable threads to perform scheduling on sorted segments and dynamic parallelism. Adaptive ELL also requires four extra data structures. Of these four extra data structures, three data structures will have the size of total working warps in the GPU and one will have the size of the number of rows in the matrix. Older GPUs, such as Tesla K20 and Tesla K40, have lower memory and hence this technique will lead to difficulty in processing larger matrices. However, many of the newer GPUs such as Pascal P100 and Volta V100 have 12 to 16 GBs of DRAM and can also access memory on other GPUs over high bandwidth buses.
Ashari et al. [77] propose a sparse storage scheme based on CSR called Adaptive Compressed Sparse Row (ACSR). In this scheme, the rows are grouped into bins, such that a bin consists of rows with an equal number of non-zeros per row. The rows of the sparse matrix are moved into bins depending on the number of non-zeros per row. The binning is done based on powers of 2. Since the binning is not done on consecutive rows, it is termed as inter-binning. Each bin is either executed by a bin specific kernel or a row specific dynamic kernel. However, as their work is based on CUDA 5.0, in their technique, dynamic parent kernel does not process more than 2048 rows. Our work exploits virtual pooling and execute rows greater than 2048. SURAA is also different from their work in that it achieves coalesced memory access due to reordering the similar loads (rows with similar npr) together in the memory rather than logically performing inter-binning. Moreover, we differ in the way we assemble matrix rows into different segments, and segments into groups, as we use the Freedman-Diaconis rule [54,55] to achieve a relatively balanced segment size. Consequently, we achieve higher levels of load balancing due to the way we devise row segments and allocate them to dynamic kernels. This decreases the idle number of threads and leads to better memory coalescing. The row segments in SURAA are dynamically scheduled by the dynamic kernel and are executed at the same time using the hardware streams on the GPU. The use of multiple streams provides higher performance by hiding the kernel launch overhead. However, the number of dynamic kernels that can be executed on GPU in parallel depends on the number of available Streaming Processors because these are fixed for a given GPU.
ELL-WARP is a variant of ELLPACK-R and padded JAD (JDS) developed for finite element unstructured grids by Wong et al. [78]. They propose two kernels, ELL WARP(K1) also known as WPK1 and ELL WARP v2 (K2) which is also known as WPK2. Both these kernels initially sort the rows by length. WPK1 then arranges rows into groups of warp size and pads accordingly and then it reorders the data within each warp in a column-major coalesced pattern. However, in WPK2, the rows are subdivided recursively beginning when the number of elements a thread should execute exceeds the prescribed threshold. The number of elements a thread should execute are continued to be subdivided until the number reaches below the threshold. This is done to efficiently process the larger rows as compared to WPK1.
Lu and Vinter [79] propose a storage format for sparse matrix-vector multiplication called CSR5. The 5 in the name indicates the number of data structures used to store the sparse matrices. The designed data storage scheme is insensitive to the sparsity of the data structure provides similar performances for sparse as well as dense matrices. This technique uses a combination of row block method [77] and segmented sum method [80]. The design objective is improved load balancing and reduction in overhead due to memory access and synchronization. The matrix transpose operations and the complexity causes overhead. Moreover, higher meta-data requirement increases the required memory for the representation. However, CSR5 does provide improved performance over the earlier SpMV techniques.
Hou et al. [81] proposed an auto-tuning framework for AMD APU platforms to find appropriate binning scheme and select appropriate kernel for each bin. The process of grouping rows with similar number of nonzeros together is referred to as binning by the authors. The proposed binning method is a coarse grained method with different granularities. The auto tuning technique selects the best granularity for a given matrix using machine learning techniques (C5.0). In their binning strategy they combine multiple neighboring rows to be a single virtual row. The granularity or the number of neighboring rows selected is determined by the auto-tuner. The autotuner hence needs to be trained with appropriate data for the proper feature selection and prediction which is non-trivial. Moreover, for irregular matrices with highly varying non-zero per row, a fixed granularity will require each bin requiring a different kernel. Executing large number of kernels at the same time (which is subject to the device capability) requires multiple streams which is hardware dependent and hence might degrade the performance if the number of bins increases. Unlike this work, we do not use machine learning; instead, we segment the matrix on the basis of Freedman-Diaconis rule [54,55]. The other major difference is the use of dynamic parallelism in our work and the way we adaptively schedule different kernels with varying number of threads, which addresses memory coalescing, thread divergence, and load balancing. Flegar and Anzt [82] propose a load-balanced GPU kernel based on COO for computing SpMV product. The proposed kernel use a "warp vote function" along with a warp level segmented scan to avoid atomic collisions. It also uses an "oversubscribing" parameter to determine the number of threads allocated per each physical core. The three arrays in the COO scheme is sorted with respect to the row index to decrease the number of atomic operations such that elements with same row index require only one atomic operation.
The major issues that need to be addressed are improved coalesced memory access to the matrix, better load balanced execution of SpMV, higher memory bandwidth utilization, preprocessing requirements, utilization and improvement in cache usage, and reduction of thread divergence. A comparative analysis of the related approaches for SpMV computations with respect to these issues have been provided in Table 1. The schemes are listed in chronological order of their publications. We observe that no single work has comprehensively addressed all the issues.
From the discussions, it is clear that due to the variance in the number of nonzeros per row, and differences in the sparsity structure, a single kernel with fixed number of threads would not be able to guarantee the best performance for various matrix sizes, application domains, and sparsity structures. Hence, in this work, we propose the SURAA method that balances the load efficiently and achieves coalesced memory access to attain the best overall performance. This is due to the fact that sparse matrices have a range of diverse sparsity structures, and dynamic kernels (together with Freedman-Diaconis (FD) rule based segmenting) allow us to dynamically schedule the matrix rows based on the matrix properties enabling improved memory and computation layout, and in turn providing improved load balancing, memory coalescing, and thread divergence.  4 Transposed Jagged Array Diagonals. 5 Padded Jagged Diagonals. 6 Ellpack Rows Thread. 7 Sliced Ellpack. 8 Sliced Ellpack Rows-Thread. 9 Sliced Ellpack-C. 10 Sliced Ellpack-P. 11 Ellpack. 12 Compressed Sparse Row.

SURAA: The Proposed Method and Tool
Load-balanced execution and coalesced memory access are the two crucial factors that we consider to improve the performance of SpMV. A load-balanced execution balances the workload of the threads and eradicates idle threads. When a group of consecutive threads accesses consecutive memory locations, we call it coalesced memory access. Coalesced memory access tremendously improves the throughput of data access from global memory and hence improves the SpMV performance.
The two major CSR based algorithms are the CSR scalar and the vector CSR (VCSR). The issue with scalar CSR is that it does not correctly balance workload as one thread is assigned per row. This results in imbalanced workload for the threads in a warp. The vector CSR rectifies this issue by assigning threads equal to the number of warps (32 threads) equal to closest power of the mean number of the non-zero per row on the GPU. Matrices that have a uniform number of non-zeros per row and are a multiple of a warp accelerates the SpMV. However, there are many matrices with less number of nnz per row, especially the matrices with the number of non-zeros per row less than sixteen. These matrices are not efficiently executed by VCSR as many GPU threads remain idle while the other threads in a warp are under execution. Therefore, warps that are assigned to rows with a substantial number of nonzeroes per row lag behind other warps with a fewer number of the non-zero per row.
Hence, it is essential to consider the sparsity structure and coalesced memory access to derive benefit from the merits and avoid the flaws of both the scalar and vector CSR techniques. To this end, we develop a load balanced dynamic technique based on the CSR storage format called SURAA. SURAA uses sorting, segmentation, asynchronous kernels, streams, and dynamic parallelism. Sorting is used to bring all the rows with similar non-zero elements together and segmentation is used to divide the work into manageable blocks to distribute the workload evenly. Dynamic kernel enables us to launch multiple child kernels from within the parent kernel which can efficiently execute the sorted segments.
The terminologies and the data structures used are illustrated in Table 2. The format is very similar to the CSR format, except that we introduce an array named perm to store the sorted indices. perm is of size m. A temporary vector, nprv, of size m is generated to speed up the pre-processing (sorting), which stores the number of non zero elements per row. In the rest of the paper, we shall use the terminologies defined in the Table 2 for our discussions. Table 2. The terminology and data structures used for matrix representation.

Structure Details
x The vector for multiplication m The number of rows in the matrix n The number of columns in the matrix y The result vector nnz The total number of non zeros npr The number of non zeros per row nprv Vector for storing the non zeros per row val Vector to store the non zeros of the matrix row_ptr Vector to store the row pointers to the matrix col_idx Vector to store the column indices to the non zeros perm Vector for storing the indices of the sorted matrix The SURAA tool is divided into two phases-(a) the setup phase and (b) the execution phasecomprising a total of six algorithms (details to follow). Algorithm 1 shows our main driver program which invokes both the setup phase and the execution phase. We shall discuss the driver algorithm and the two phases in the next subsections.

Algorithm 1 SURAA_Master_Program
Input: A csr (val, row_ptr, col_idx), vector : x, vector : perm, vector : nprv Output: vector : y 1: function MAIN 2: for Each segment s i ∈ G 3 do 4: end for Asynchronous returns for the kernel 6: for Each segment s j ∈ G 2 do 7: Thread per blocks as a multiple of 32 10: vectorSpMV << grid s j , stream j >> (s j , Y) 11: end for Asynchronous returns for the kernel 12:

Setup Phase
Algorithm 2 depicts the setup phase where the matrix is sorted, segmented, and grouped into various groups. The matrix is sorted row by row in the descending order of the number of non-zero elements per row. The rows with the same number of number of non-zeros are further sorted based on the column index of the first non-zero element, which further balances the load of vector x. The sorting can be done on the GPUs or the CPUs. In our work, we use a GPU based parallel radix sort to perform the sort. The function radixSort_Key(...) is used to perform the sorting of the nprv vector. It sorts the nprv vector and stores the permutations of the original nprv vector in the perm vector and the associated npr values in the nprSort vector. The input matrix is then rearranged using the perm vector.

Algorithm 2 SURAA_Setup_Phase
Input: A csr (val, row_ptr, col_idx), vector : x, vector : perm, vector : nprv Output: set : G 1 , set : G 2 , set : G for Each segment s do 10: if 32 < nprv[s i ] < max_child_threads then 11: else 15: end if 17: end for 18: end function Secondly, the sorted matrix is divided into various segments. The division into segments is based on the Freedman-Diaconis rule [54,55], which is given below: where Seg_size is the size of each segment, IQR is the Interquartile range, nprv has been defined already, and m is the number of rows in the matrix. Moreover, IQR(nprv) = q 34 − q 12 , where q 34 and q 12 are the third and first quartiles of nprv, respectively. In our algorithm, we calculate the quartiles of the sorted npr array (nprSort). Since the matrix is pre-sorted, we can get the optimal size of the segments as well as the number of segments using the Freedman-Diaconis rule for each matrix with O(1) complexity.
In the third step, the segments are allocated to various groups depending upon the average number of non-zero elements per row in each segment. This is also depicted in the left part of Figure 5. The number 0 to N indicates the npr for each row and the numbers 0 to M indicate the number of rows in the matrix. The red dashed line in the figure indicates the size of the warp (0 to 31) for GPUs, which is used as a threshold in our method. Since the matrix is sorted in the descending order of npr, we add each segment s i from the beginning of the matrix to Group 1 (G 1 ) until the mean number of non-zero per row for a segment becomes less than 32 or the total number of rows in G 1 exceeds max_child_threads (The number of child grids in the queue ready for execution). Since the segment size might not be a multiple of group size, if n × segment_size > max_child_threads, we will only include (n − 1) × segment_size > max_child_threads. The rest of the segments which has mean npr greater than or equal to 32 is allocated to G 2 . The remaining segments that have a mean number of non-zero per row less than 32 are added to G 3 .
The last segment might not have the same number of rows as the other segments and hence is executed separately. All the segments in G 3 are executed asynchronously on different streams using a scalar CSR kernel, whereas, all the segments in G 2 are executed by a modified vector CSR kernel asynchronously. Finally, we employ dynamic parallelism based kernel to run the segments in G 1 , which we shall discuss in greater details in the following subsections.
After setting up the matrix, we proceed to the execution phase where we schedule and allocate various segments to different kernels.  Figure 5 further illustrates the allocation of various segments to different kernels. The segments with rows containing larger npr are executed as a part of the dynamic kernel, whereas the other segments are executed based on the average length of the rows in the segment. The segments that form the part of the dynamic kernel have been shown in red. In the figure, the segments from Seg 1 to Seg l have npr less than or equal to max_child_threads (The blue dashed line shows the max_child_threads parameter) and hence will be grouped together in G 1 , which shall be executed by the dynamic kernel. The segment Seg k has mean npr less than 32 and hence is executed by the scalar CSR based kernel. The rest of the segments, between Seg 2 and Seg k are grouped together into G 2 and executed by the vector kernel.

Dynamic Parallel Kernel
Algorithm 3 is the parent algorithm for the dynamic kernel that will execute on the segments of G 1 . This parent kernel assigns child kernels, depicted in Algorithm 4, to each row. Each thread of the parent kernel will work on a row by invoking child kernels. Multiple warps (depending upon the row size) of the child algorithm will execute the row assigned to them. As Multiple warps are involved for a row in the child kernel, we require a reduction within each warp as well as among the multiple warps. We use the CUDA function _ sh f l_down_sync to compute the partial sums of the threads within a warp. The function _ sh f l_down_sync is faster than the _ sh f l_down function and is available in the CUDA version 9.0 and later. Each warp stores their respective partial sums in the shared memory (shared_mem) and a final reduction of the shared memory is performed to get the sum.
The size of the grids for the child kernels is dependent upon the number of non-zeros in each row. In our example shown in Figure 5, the parent kernel will then invoke child kernels for all the segments from Seg 1 to Seg 2 , i.e., for each row in G 1 , a child kernel is assigned to process the row. ChildKernel <<< Row 0 >>> executes the first row in the first segment and ChildKernel <<< Row 1 >>> executes on the second row of the first segment and so on throughout G 1 . A segment specific vector kernel is used to execute each segment at the same time using various cuda streams for each segment in G 2 . In addition, the last two segments, Seg k−1 and Seg k , are added to G 3 and executed using a scalar CSR based algorithm. In the Child kernel, initially we calculate the thread Id (tid), lane Id (lid), and warp Id (wid) to access the data structures. Each child thread computes the partial sum and stores in the sum vector. Segmented parallel reduction using the shuffle instruction (__sh f l_down_sync) is used to accumulate the partial results. The partial sum is then stored in the shared memory (shared_mem) for processing of all the relevant threads. We perform one more reduction to accumulate the final sum and store it in the result vector (y).

Scalar Kernel
Algorithm 5 illustrates the segment based scalar SpMV execution. SpMV for the segments in G 3 is done by this algorithm. In this algorithm, one thread is assigned per each row. For each row, the threads will compute the products of the non-zero elements with the corresponding vector x. Each thread will finally sum the products of each row and write to the corresponding row in the result vector y. The kernel is quite handy in handling the product of the smaller rows as compared to other kernels. For each segment, this kernel is launched with a total of ((segment_size + 256 − 1)/256) blocks. The total number of threads per block can be varied as required to achieve optimal performance. Algorithm 6 illustrates the segment based vector SpMV algorithm. In this algorithm, a total of Y threads operate on each row. We assign Y as 32, 64, or 128 depending on npr. Y would be the smallest multiple of 32 greater than npr. The calculation of Y can be seen in lines 7, 8, and 9 in Algorithm 1. As in the Child kernel (see the explanation of Algorithm 4), we calculate the thread Id (tid), lane Id (lid), and warp Id (wid) to access the data structures. A total of Y threads are assigned to each row. These Y threads will then use segmented parallel reduction using the shuffle instruction (__sh f l_down_sync) to accumulate the partial results. The partial sum is then stored in the shared memory (shared_mem) for processing of all the relevant threads. We perform one more reduction to accumulate the final sum and store it in the result vector (y). This is very similar to the CSR Vector scheme. The kernels for each segment is launched with (((segment_size * Y) + 256 − 1)/256) blocks.

A Note on max_child_threads
The parameter max_child_threads in our code is a tunable parameter. It is the number of child grids that are pending, running, suspended, or waiting in the queue ready for execution. Exceeding the max_child_threads in earlier CUDA versions between 5.0 and 5.9 results in discarding the kernel. Such kernels were not executed. max_child_threads signifies the size of the buffer that tracks the running kernels as well as the kernels to be executed. The maximum child kernels can be modified in such scenarios using cudaDeviceSetLimit() which takes in two parameters, cudaLimitDevRuntimePendingLaunchCount and x, where x is the new number of child kernels. Changing the Pending Launch Count changes the size of the buffer. However, this leads to degradation in performance for older CUDA devices. The default space of the buffer is set as 2048 by default.
However, from CUDA 6.0 onwards, in addition to the fixed sized pool of size cudaLimitDevRuntimePendingLaunchCount, a variable sized virtualized pool has been added to the buffer. The fixed sized pool is used until it is filled. After that, the virtualized pool is used. The cost associated with the virtualized pool is slightly higher, but in the overall scenario for SpMV we find that use of virtualized pool contributes to better performance until a threshold. Wrong pool size affects the performance of SpMV. In our experiments, we tried 1024, 2048, 4096, and 8192 as pool sizes. We will discuss the effects of pool sizes on our benchmark suite in Section 5.

Section Summary and Clarifications
We would like to clarify here that in using multiple dynamic kernels we do not imply that a single kernel with a fixed number of threads will not be able to guarantee a better or the best performance. The SURRA Master Program (Algorithm 1) first launches the Scalar (Algorithm 5) and Vector kernels (Algorithm 6) asynchronously. Subsequently, it invokes the dynamic parallelism kernel (Algorithm 3). Dynamic parallelism enables executing the child kernels (which are managing different rows: Algorithm 4) with different thread configurations, determined dynamically, and this is not possible if we were to use a static kernel.
We reiterate that the number of threads for the dynamic kernels are calculated during the execution, dynamically, using the Max_Child_Threads parameter and the properties of the matrix being multiplied to a vector. If we were to reuse the Algorithm 2 for the setup phase and then spawn three separate asynchronous, non-dynamic, kernels, we will not be able to dynamically adjust the configuration of each thread that is spawned by the kernels. The dynamic kernel allows us to spawn a different number of child kernels, up to Max_Child_Threads (as discussed in Section 4.4), each having its own thread configuration parameters and this is not possible with a non-dynamic kernel. Consequently, it will not allow us to achieve higher performance as we are able to in the current case (for the results, see Section 5).
In summary, the six algorithms, their descriptions, and the other discussions in this section show that we have utilized streams, warps, warp divergence, and dynamic kernels to efficiently use the GPU resources and launched the maximum possible number of concurrently executing kernels.

Results and Analysis
In this section, we evaluate the performance of SURAA against seven well-known and widely used SpMV storage and computation techniques. It comprises seven subsections. The details of the experimental testbed including the storage schemes used are given in Section 5.1. The details of the dataset comprising 26 matrices for comparatively benchmarking SURAA and other schemes are provided in Section 5.2. Section 5.3 analyzes in detail the comparative performance of SURAA against the seven other schemes using three performance metrics, namely throughput, speedup and effective memory bandwidth. Section 5.4 provides a comparative analysis of SURAA with other schemes with respect to npr variance of all the 26 matrices in our dataset. Section 5.6 compares the setup cost or overhead of SURAA with other SpMV computation techniques. Section 5.5 discusses the parametric configuration for SURAA's performance in terms of the number of maximum child threads (max_child_threads) that can be invoked by the dynamic kernel. Finally, Section 5.7 relates SURAA with our broader work on developing sparse iterative solvers.

Experimental Testbed
The platform used for the experiments is a part of the 230 TFLOPS Aziz supercomputer (ranked number 360 in June 2015 among the Top500 machines). It consists of 496 nodes and 11,904 cores in total. Each node contains two Intel Xeon E5-2695v2 processors (Intel Corporation, Santa Clara, CA, USA), each with 12 Cores (2.4 GHz). In addition, 380 of these nodes have 96 GB Random Access Memory, while each of the rest 112 nodes contain 256 GB. This sums up to a total of 66 TB memory. It also has two Nvidia K20X GPU (Kepler) accelerator nodes and two Intel Phi 5110P MIC accelerator nodes. It is controlled by 64-bit Red Hat Enterprise Linux v6.5 (Red Hat, Riyadh, Saudi Arabia). We used the K20 nodes to execute our code. We used CUDA toolkit 9.0 with C++ for the compilation of our code. The code was compiled using nvcc − O3 − arch = sm_35. We compare our scheme with the following seven SpMV storage and computation techniques: 1. The best CSR-based SpMV [58] from cuSPARSE v9.0 and CUSP v0.5.1 [52]. 2. The best HYB [58] from the above two libraries. 3. BSR format from cuSPARSE. The block size for a given matrix was obtained from the experiments based on the performance. The best performing block size was selected for each matrix. 4. The CSR5 storage technique (see [79] for details of the scheme). 5. The SELL-P [74] sparse storage scheme, which is a padded version of SELL-C-σ scheme [75].
The implementation of the SELL-P scheme that we have used in this paper is from the MAGMA library [83,84]. The best block sizes for the scheme were selected from our experiments based on the best performance. 6. The WPK1 and WPK2 storage techniques by Wong et al. [78]. Table 3 illustrates the selected sparse matrices in our benchmark suite. We have selected 26 matrices for the benchmarking. Fifteen of these matrices are selected based on the recommendations by [85], which are the most frequently used matrices by the researchers for SpMV and iterative solvers. The other included matrices have been widely used (see e.g., [67,79]). All the selected matrices are from the University of Florida Sparse Matrix Collection [86]. The chosen matrices are from diverse applications that include computational fluid dynamics, bioengineering, thermal, quantum chemistry, circuit simulation, power and web search. Figure 6 indicates the normalized mean and standard deviation of the non zeros per row for the matrices in our Benchmark suite. The red dots in the figure indicates the normalized mean and the bars indicate the normalized standard deviation. We can observe the irregularity in the number of non-zeros per row for these selected matrices. The high standard deviation for most of the matrices indicates the challenge in designing a uniform storage format and SpMV algorithm.

SpMV Performance
In this subsection, we evaluate the performance of SURAA in terms of three main performance metrics: (1) Throughput (GFLOP/s), (2) Average Speedup, and (3) Effective Memory Bandwidth (GB/s). We first define these three performance metrics in the this section.
Next, the throughput and speedup are evaluated in Section 5.3.1; it provides a detailed comparative analysis of SURAA including individual performance for all matrices in the dataset, average performance over all the matrices, best performance for any single matrix in the dataset, and a detailed discussion of comparison of SURAA with each of the other seven techniques (see the paragraph headings, WPK1 & WPK2, HYB, etc.).
The aggregate throughput and speedup are analyzed in Section 5.3.2. The effective memory bandwidth performance is compared in Section 5.3.3.
The throughput in GFLOP/s is calculated based on the effective flop count in an SpMV execution, as given below in Equation (2): In the above equation, nnz denotes the total number of non-zero elements in the given matrix and time denotes the execution time in seconds.
The average speedup is calculated using the standard method depicted in Equation (3) Here, in the above equation, SURAA i indicates the throughput in GFLOP/s for each matrix from our benchmark suite of 26 matrices and re f erence_scheme i indicates the throughput (in GFLOP/s) of the reference scheme against which we make comparison, i.e., CSR, HYB, BSR, CSR5, SELL-P, WPK1, or WPK2.
The effective memory bandwidth in GB/s is calculated as in Equation (4): In the above equation, brw = bytes_read + bytes_written, with bytes_read and bytes_written representing the number of bytes read and written during a single kernel execution time. Figure 7 illustrates the double precision performance for the 26 benchmark matrices on an Nvidia K20 GPU. Each matrix was executed 1000 times using each sparse storage technique and their average was used to calculate throughout (GFLOP/s) using Equation (2). These throughput values for all the eight storage schemes are depicted as a bar chart in Figure 7.

Average Throughput Performance
We observe in Figure 7 that, overall, our proposed technique performs better than all the compared schemes, CSR, HYB, BSR, CSR5, SELL-P, WPK1, and WPK-2 (the SURAA throughput is represented by dark blue color bar and it is higher than the other bars in the figure). SURAA provides higher throughput than any other technique for 20 out of 26 matrices (we will come back to the remaining six matrices in a moment). On average, SURAA outperform all seven other SpMV computation methods by achieving an average speedup of 40× against SELL-P, 29.89× against WPK1, 12.98× for CSR, 11.33× for BSR, 1.50× for HYB, 1.15× for WPK2, and 1.1× for CSR5.

Best Performance for Individual Matrices
Note in Figure 7 that the highest speedups achieved by SURAA against other schemes for any single matrix are 562× (matrix ASIC_680ks), 531.56× (for the matrix ASIC_680ks), 147× (FullChip), 26.70× (flickr), 3.06× (matrix Torso1), 1.97× (torso1), and 1.94× (rail4284) against SELL-P, WPK1, CSR, BSR, HYB, WPK2, and CSR5, respectively. We repeat for clarity that these are SURAA's peak performance values against specific storage schemes for any single matrix. Even for blocked matrices, such as ASIC_680ks and TSOPF_RS_b2383, we gain higher performance compared to the BSR technique, which is expected to perform better due to the block structure. Clearly, SURAA outperforms SELL-P, WPK1, CSR, BSR, and HYB by a fairly large margins; however, its performance gain over WPK2 and CSR5 is small.
Note that stating the highest gain against a particular scheme for a given matrix is meant to understand the performances of individual schemes against SURAA. It does not imply that SURAA has equally higher gain over other schemes. For example, the 562× speedup over SELL-P for the matrix ASIC_680ks only applies to SELL-P, and Figure 7 clearly depicts that SURAA's performance is slightly higher (1.06×) than the next best scheme CSR5. Comparing each scheme with every other scheme will lengthen the paper, may confuse the reader, and some may find such analysis inappropriate.
We now dig deeper into the throughput performance of all the schemes by further elaboration.
HYB SURAA outperforms the HYB scheme, overall, although by relatively smaller margins (average speedup: 1.50×, max: 3.06×). The highest speedup achieved by SURAA against HYB for a single matrix from the 26 matrices is 3.06× for the matrix torso1; see Figure 7. The HYB scheme outperforms the SURAA method for two matrices, thermal2 and amazon0312. We noted that this is also the case for a few other schemes, and we will discuss them in respective paragraphs. Further investigation reveals that SURAA does not utilize the dynamic parallelism for these two matrices because the largest rows for these matrices have only 11 and 10 non-zeros, respectively, and the average number of non-zeros is 7 and 8, respectively. Hence, SURAA uses the scalar CSR kernel for these two matrices (thermal2 and amazon0312). We also observe (see Table 3 and Figure 6) that both of these matrices have low npr variance and standard deviation, and therefore do not benefit from the adaptive kernels in SURAA. We will see in Section 5.4 that SURAA provides higher differential performance for matrices with higher npr variance (a behavior which is highly desirable). The performance aspects of SURAA for low npr matrices will be investigated in the future to enable SURAA gain higher performance, also, for low npr variance matrices. CSR SURAA outperforms the CSR scheme, overall, by moderate margins (average speedup: 12.98×, max: 147×). Note in Figure 7 that CSR delivers moderate performance compared to SURAA and other schemes. However, its overall performance degradation (12.98×) comes from its poor performance for two matrices, ASIC_680k and FullChip. SURAA achieves a higher throughput against CSR for all the 26 matrices except one matrix, amazon0312 (0.91×). The reasons for this low performance of SURAA for this particular matrix have been already discussed alongside the HYB scheme. SELL-P SURAA outperforms the SELL-P scheme, overall, by fairly large margins (average speedup: 40×, max: 562×). The high speedup of SURAA against SELLP is particularly attributed to the poor performance of SELLP for two matrices, ASIC_680k (speedup, 562×) and FullChip (speedup, 378×); see Figure 7. SELL-P mostly has produced very low throughput for high npr variance matrices such as ASIC_680k, FullChip, rail4284, and others (see Figures 6 and 7). This behavior will be investigated further in Section 5.4. SURAA achieves a higher throughput against SELL-P for all the 26 matrices except four matrices, thermal2 (0.71×), amazon0312 (0.72×), Andrews (0.79×), and Thermomech_TK (0.70×). The first two of these matrices have been mentioned already while discussing the HYB performance. The reason for this low performance, as before, is that SURAA did not invoke dynamic parallel kernel for these four matrices because these matrices have low average nonzero per row (see Table 3 and Figure 6). We have mentioned this earlier that SURAA does not invoke dynamic kernel for matrices which have average npr value less than 32. A possible modification in the future to improve the SURAA performance for such matrices is either to optimize our CSR scalar kernel or replace just the SURAA CSR scalar kernel with the cuSPARSE CSR kernel, which is very likely to make SURAA faster than all the techniques for all the matrices. Another possibility is to tune the parameter that invokes the dynamic kernel to an optimum value or to embed further intelligence into the kernel invocation process.

CSR5
SURAA outperforms the CSR5 scheme, overall, although by relatively smaller margins (average speedup: 1.1×, max: 1.94×). The highest speedup achieved for SURAA against CSR5 for any single matrix among the 26 matrices of our dataset is 1.94× (for matrix rail4284). SURAA achieves a higher throughput against CSR5 for all the 26 matrices except for four matrices, thermal2 (0.71×), amazon0312 (0.78×), in-2004 (0.63×), and bundle1 (0.90×). The reasons for low performance of SURAA for the first two matrices have already been discussed while discussing the HYB performance (i.e., npr ≤ 32). The average npr for the third matrix, in-2004, is 12.23, and it has a medium npr variance; see Table 3 and Figure 6. Therefore, the reasons for the low performance of SURAA for this matrix are the same, i.e., npr ≤ 32 implies no invocation of dynamic kernel in SURAA. The potential solutions to address these limitations of SURAA have also been mentioned earlier. The npr variance of the fourth matrix, bundle1, is high, with an average npr of 72.84. This matrix has neither low average npr nor low npr variance. However, bundle1 is a small (10,581 rows), square, symmetric matrix and the difference in SURAA and CSR5 performance is quite small, 9.30 versus 10.30 GFLOP/s, respectively. Future work will focus on further investigation of SURAA's performance to improve its behavior in such cases.

WPK1 and WPK2
SURAA outperforms the WPK1 scheme by large margin (average speedup: 29.89×, max: 531.56×), but its performance over WPK2 is small (average speedup: 1.15×, max: 1.97×). Our experiments suggest that WPK1 does not provide good results for asymmetric matrices (see Figure 7). There are a total of 16 asymmetric matrices in our dataset (see Table 3) and we noted that WPK1 produced relatively low throughput for all these matrices. Two of these 16 matrices, Freescale and Fullchip, are exceptions because WPK1 failed to execute SpMV for these two matrices and caused a segmentation error. Particularly, for three matrices, ASIC_680ks, 12month1, and rail4284, WPK1 produced very low throughput results, all less than 0.04 GFLOP/s. On the contrary, WPK2 produced good results. Although it never produced the highest throughput for any of the benchmark matrices, it did produce high throughput comparable to the second high performing scheme CSR5. Overall, it produced the third best average performance after SURAA and CSR5 with small margins. SURAA achieved a higher throughput against WPK1 for all the 26 matrices except for three matrices, amazon0312 (0.70×), Andrews (0.74×), and Thermomech_TK (0.96×). SURAA also achieved a higher throughput against WPK2 for all the 26 matrices except Andrews (0.75×) and thermal2 (0.96×). These four matrices belong to the set of the six matrices where SURAA did not produce the best performance and we have discussed the reasons for SURAA's poor performance for these matrices in the earlier paragraphs. Note that WPK2 failed to execute on ten matrices from the dataset, of which seven are from circuit simulation and directed graph, and the remaining three were rectangular matrices. On inspecting the WPK1 and WPK2 GitHub Web Page [87], we found that the developers have reported code error issues on the page as follows, "trying to debug issues with large unsymmetric matrices-now mainly thrust errors and a few numerical errors to debug on certain matrices". Note that, in computing speedups of SURAA against other schemes, we did not include the results of the matrices where WPK1 and WPK2 schemes failed to execute. Table 4 provides for all eight schemes the statistical information such as mean, standard deviation (sd), variance (var), maximum (max), and minimum (min) of the GFLOP/s metric of the SpMV kernel for the 26 benchmark matrices. The values are listed in descending order of the mean value. We observe that SURAA provides the highest mean GFLOP/s of 15.75. The next best scheme in terms of the mean throughput is CSR5, it has the second highest mean GFLOP/s, 14.28, followed by 13.65 for WPK2. In terms of the max throughout, SURAA again leads the table with the highest value of 36.7077 GFLOP/s, followed by SELL-P with the second best value of 26.8 GFLOP/s, followed by CSR5. Based on the mean values, WPK2 comes third after SURAA and CSR5. As noted earlier, WPK1 and WPK2 were unable to execute some of the matrices in the benchmark dataset and their results have not been included in computing mean and other qualities in the table. In order to calculate the aggregate performance of SURAA, we calculate aggregate speedup as in Equation (5). Figure 8 indicates the aggregate GFLOP/s achieved for SURAA and other tools. These GFLOP/s are aggregated over all the matrices for each scheme and do not imply to be the GFLOP/s for a single GPU at one time. The intention of providing the aggregate throughput is to highlight the aggregate performance gain or loss that could be achieved by the use of a single SpMV computation scheme for matrices from various domains. SURAA has the highest throughput compared to all the other seven techniques, i.e., CSR, HYB, BSR, CSR5, WPK1, WPK2, and SELL-P. For our benchmark suite of 26 matrices, we achieve an aggregate throughput of 409.7 GFLOP/s compared to the next best CSR5 (371.37 GFLOP/s), HYB (273.38 GFLOP/s), CSR (255.42 GFLOP/s), WPK2 (218.51 GFLOP/s), SELL-P (204.56 GFLOP/s), WPK1 (165.163 GFLOP/s), and BSR (70.751 GFLOP/s). The reason for low aggregate throughput for WPK1 anf WPK2 is that, as noted earlier, they were unable to execute some of the matrices in the benchmark dataset. It means that SURAA has an aggregate speedup of 1.1× compared to CSR5, 1.49× compared to HYB, 1.60× compared to CSR, 1.87× compared with WPK2, 2.0× compared to SELL-P, 2.47× compared to WPK1, and 5.79× compared to BSR for the given 26 matrices.

Effective Memory Bandwidth
Effective memory bandwidth utilization indicates the processor usage efficiency by the kernals [88]. Specifically, effective memory bandwidth denotes the rate at which the bytes are read and written for a given kernel. The general formula to calculate the effective memory bandwidth was given in Equation (4).
We did not find the specific formulas in the literature for calculating the effective memory bandwidth for each of the schemes that we have considered in this paper. Therefore, we looked at each of the schemes, its memory requirements, and devised equations to calculate the number of bytes read and written, brw (see Equation (4)), based on their storage requirements. Equation (4) was then used to calculate the effective memory bandwidth, e_bandwidth, in GB/s. Each scheme was executed 1000 times for each matrix in our dataset to get the average execution time for each of the 26 benchmark matrices. For further details on calculating effective memory bandwidth, see Nvidia documentation at https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/.
The formulas for the CSR and HYB schemes were reported in [58]) as given below: The effective memory bandwidth for the SELL-P scheme was reported in [75], but no formula was given. After consulting [74,75], we devised the formula to calculate brw for SELL-P as below: where num_seg is the number of segments in the matrix, as determined by the SELL-P scheme, seg_size is the size of the segments that is fixed, and max_npr_seg i is the maximum number of nonzero elements in any one row of segment i. It was difficult for us to calculate these numbers and therefore we approximated e_brw_SELLP as below: The effective memory bandwidth for the CSR5 scheme has not been reported before. After consulting [79] where this scheme was first introduced, we devised a formula to calculate brw for CSR5 as below: where σ is a parameter that determines the column width of the blocks in their data structure. It can assume three different values depending on the value of anpr of a matrix. These values are 4, 32, or anpr. The effective memory bandwidth for WPK1 and WPK2 was reported by Wong et al. [78], but it was on a different GPU device (Nvidia GeForce GTX 480) and for different matrices. They did not give formulas to calculate the effective memory bandwidth. We looked at their paper and have derived the following formula to calculate the brw: where num_warp is the number of warps in the matrix, as determined by the WPK1 and WPK2 schemes, warp_size is the size of the warps which is fixed, and max_npr_warp i is the maximum number of nonzero elements in any one row of warp i. It was difficult for us to calculate these numbers and therefore we approximated e_brw_WPKx as below: The effective memory bandwidth for the SURAA scheme can be calculated using the brw formula as below: e_brw_SURAA = (2m + 1 + nnz) × 4 + (nnz + m + n) × 8. Figure 9 plots the calculated effective memory bandwidth for double precision SpMV for CSR, HYB, SURAA, CSR5, SELL-P, WPK1, and WPK2. The maximum theoretical memory bandwidth of K20M is 208 GBytes/s. On average, SURAA delivers a bandwidth of 50.2% of the theoretical maximum as compared to 44.41%, 43.7%, 34.62%, 30.15%, 28.1%, and 24.02% of the theoretical maximum for CSR5, WPK2, HYB, CSR, WPK1, and SELL-P, respectively.
The highest effective memory bandwidths delivered by SURAA were 226.25 (torso1), 164.03 (TSOPF_RS_b2383), and 140.76 (rail4284) GB/s which are 109%, 78.9%, and 67.7% of the peak, respectively. A higher bandwidth, more than the theoretical bandwidth, signifies an efficient use of the cache and shared memory for SURAA when executing the matrix torso1. A bandwidth, for SpMV computations, higher than the peak memory bandwidth has also been reported earlier; see, e.g., [58,74].   Figure 10 illustrates the relationship between the obtained throughput for the seven schemes, SURAA, CSR5, CSR, HYB, WPK2, SELL-P, and WPK1, with respect to the variance of the non-zero elements of all the 26 matrices in our dataset. The symbols indicate the actual GFLOP/s obtained and the straight lines indicate the smoothened GFLOP/s. We observe that SURAA provides better throughput with the increase in the variance of nonzero elements per row. On extrapolating the GFLOP/s, we observe that larger variations in nonzero elements per row will achieve higher GFLOP/s for SURAA compared to the next best CSR5, CSR, and HYB (note the upward slope for these schemes); and this behavior is highly desirable because higher variations usually decrease SpMV performance. We also observe that larger variations decrease the performance of the WPK2, WPK1, and SELL-P techniques.

SURAA: Comparative Performance against High npr Variance
Similarly, Figure 11 depicts the speedup of SURAA against CSR5, HYB, SELL-P, WPK1, WPK2, and CSR, with respect to the variance of the nonzero elements per row for all the 26 matrices in the benchmark suite. As in Figure 10, the symbols indicate the actual SURAA speedups obtained and the straight lines indicates the smoothed speedups. Note that the npr variance and the speedups are plotted in log10. We observe in the figure that the speedup of SURAA over SELL-P, WPK1, CSR, HYB, WPK2, and CSR5 is increasing with the increase in the variance of the number of non-zeros per row. As seen and discussed before, this indicates an excellent behavior for a sparse matrix computation tool. The rate of increase in speedup of SURAA against SELL-P, WPK1 and CSR is relatively high compared to the rest of the schemes and this has made the plots for HYB, WPK2, and CSR5 appear as straight lines. To allow a closer look at the SURAA performance, Figure 12 depicts the same graphs as in Figure 11 but excludes the SELL-P, WPK1, and CSR data. The speedup scale now does not use log10. It can now be observed that the SURAA speedup against CSR5 and HYB show an increase for an increasing npr variance though the rate is lower than the other SpMV computation schemes. The highest rate of increase of SURAA speedup is over the HYB scheme followed by WPK2 and CSR5.   Figure 13 illustrates the GFLOP/s performance of BSR, SELL-P, WPK1, WPK2, CSR, HYB, CSR5, and SURAA, respectively, against the npr variance and the total number of nonzero elements, nnz.
Note the colour legend in the figure depicting colours ranging from blue to green, yellow, orange and red, representing GFLOP/s in increasing order from near zero (SELL-P) to maximum 36.70 (SURAA). More yellow/orange/red means higher throughput. Blue and green colours represent lower throughput.
BSR provides a low performance for the majority of the cases as shown by the blue shades that cover majority of the figure. The highest GFLOP/s results attained by BSR are 12.95 and 12.75, for the matrices nemeth23 and TSOPF_RS_b2383, which have a block structure.
SELL-P shows a performance in the range of 10 to 26.8 GFLOP/s for smaller nnz values and npr variance varying from low to medium. However, a higher npr variance results in poor performance with GFLOP/s in the range of 0 to 5. We had discussed this earlier while evaluating SELL-P individual performance. A look at Figures 6 and 7 will reveal that SELL-P mostly has produced very low throughput for high npr variance matrices such as ASIC_680k. WPK1 delivers relatively low performance against the increasing npr variance as evident by mostly green and blue colors though some yellow patches can be seen towards lower and upper middle (medium npr variance and low/high nnz) sections of the plot representing higher GFLOP/s of 17.5. WPK2 also delivers low performance against the increasing npr variance. This can be seen by mostly green and blue colors though the yellow patches are larger for WPK2 compared to the WPK1. This seems surprising at the first look because WPK2 has delivered a superior performance overall after SURAA and CSR5 (max throughput is 22.78; see Table 4). A closer look at Figures 6 and 7 reveals that both WPK1 and WPK2 did not successfully execute some of the high npr variance matrices including spal_004, 12month1, and rail4284.
CSR provides average performance (depicted by the yellow color; see the GFLOP/s legend) in two scenarios: (1) for matrices with low npr variance and low nnz and (2) for medium npr variance and large nnz. For high variance scenarios, CSR provides low performance, indicated by the blue and green shaded regions in the figure. CSR performance for low variance scenarios is also poor. The blue color indicates that at high npr variance and high nnz CSR provides low performance.
For HYB, we note a general trend that the performance increases with the increasing nnz and the npr variance. However, we do note light green shades towards the right of the plot which indicate slightly poor performance. For the majority of the part, the HYB performance is almost constant within the range of 10 to 18 GFlops.
CSR5 shows a general trend of an increasing performance as the nnz and npr variance increases. It provides an average performance of 15 to 23 GFLOP/s as seen by the yellowish shade in the figure. Low to average performance (5 to 15 GFLOP/s) can be observed in three scenarios: (1) high npr variance and medium to high nnz, (2) low nnz and medium npr variance, and (3) low npr variance and low to medium nnz.
Finally, SURAA shows an excellent behavior in that it delivers high performance with the increase in the npr variance and nnz. The patch in red color on the right of the plot shows the highest performance for the associated range of npr variance and nnz depicted by the red color. SURAA achieves the highest GFLOP/s of 36.70 among all the schemes as can also be seen in Figure 7.

SURAA: Parametric Configuration
The matrix, spal_004, is a rectangular matrix with fewer rows and a large number of dense columns. It also has a fairly high npr variance. Due to the high density, it is highly suitable for dynamic parallelism. In the results presented in this paper, the parameter max_child_threads is set to 8192 for spal_004. During the course of conducting our experiments, we found that configuring this parameter to 2048, 4096, and 8192 varies the performance of the SURAA tool. The typical trend was that the highest value provided the best performance for spal_004 due to the increase in dynamic parallelism. To illustrate, spal_004 achieved 19.5 GFLOP/s in our experiments with 8192 kernels, while it provided 13.60 GFLOP/s when the parameter was set to 2048. This indicates that tweaking of the max_child_threads affects the performance of the SpMV computation. However, larger values for max_child_threads cause larger overheads due to the spawning of the dynamic kernels, and, therefore, setting it to higher values should be done on a necessity basis. In our benchmark dataset, for 23 of the 26 matrices, there are no more than 2048 rows, in total, whose npr value is greater than 32. The three matrices are spal_004, rail4284 and 12month1, with spal_004 the biggest exception containing 8192 rows which have more than 32 npr. Therefore, we have set max_child_threads to 2048 for all the matrices except for the spal_004 matrix. An automatic configuration of the tool for this parameter can be done during the set up phase by calculating the number of rows with npr ≥ 32 and setting the max_child_threads parameter to an appropriate value. In future work, we will further investigate this behavior.

Preprocessing Cost
The preprocessing time introduced due to the conversion of the input matrix in the base format (usually CSR/COO) to a more advanced format could be more than three or four orders of magnitude higher than the time required for executing a single sparse matrix vector multiplication [77]. Applications that utilize the same sparse matrix for multiple executions of SpMV operations such as iterative linear solvers amortize the preprocessing cost. However, when the iterations are small in the linear iterative solver or when the matrix structure changes frequently as in dynamic graph applications such as PageRank, the preprocessing cost can adversely affect the performance gains obtained while using a specific storage format.
The major operation involved in converting CSR to SURAA is sorting the number of nonzero elements in each row. The sorting process in SURAA by default is performed on the GPU side unlike the SELL-P and CSR5 conversion routines in the MAGMA library. Technically executing on the CPU is not as bad as it sounds, rather the time it takes depends upon the type of operations involved. Conversions involving auxillary structures are difficult to implement on the device side. We use the radix sort by the key algorithm provided by the thrust library on the device side for faster sorting of the non-zero elements per row. The rest of the processing such as application of FD rule and grouping for execution are not time intensive.
In our experiments, for the purpose of a fair comparison, we assume that the initial format of all the matrices before the conversion is CSR. Use of CSR/COO as an initial stage for the data structures have been discussed in [85]. Since the conversion cost between COO and CSR is negligible and SURAA loads matrices as CSR by default, we selected CSR as the initial state of the storage format. Moreover, our conversion time includes the sorting time (required also by SELL-P), data transfer time from the host to device, as well as the time for other processing required for each format. Figure 14 depicts the normalized overhead time taken for SELL-P, CSR-5, and SURAA for our 26 benchmark matrices. The time is obtained empirically by executing the codes for SURAA, SELL-P and CSR-5. We can observe that SURAA has the lowest conversion overhead compared to SELL-P and CSR-5. For a better understanding, we calculate the ratio of the measured conversion time to a single SpMV execution (rounded to the next integer), which denotes the number of spare matrix vector operations that could have been possible in the overhead time. Figure 15 presents the ratio of the overhead in format conversion to the time for a single SpMV execution for SURAA, SELL-P, and CSR5. For the benchmark set of 26 matrices, the average ratio is 69, 5, and 96 for SELL-P, SURAA, and CSR5, respectively. This implies that on an average 69, 5, and 96 SpMV execution could have been performed in the time taken for format conversion. CSR5 has a higher overhead than SELL-P while converting graph based applications especially circuit simulation based matrices. The lowest ratio we obtained for SURAA is 0.26 (12month1), which is rounded to 0 and the largest ratio is 14 (wb-edu). Figure 16 illustrates the absolute time taken for device based radix sort used in SURAA for sorting the matrix based on the number of nonzero values. We can observe that the graph based matrices such as FullChip, wb-edu, and Freescale take more time than other matrices. The average time for sorting the tested matrices is less than 3 ms.

Sparse Iterative Solver Performance
Iterative solution of sparse linear equation systems can be considered as of prime importance due to its application in various important areas such as solving finite differences of PDEs, high accuracy surface modeling, finding steady-state probability vector of Markov chains, webranking, inventory control and manufacturing systems, queuing systems, fault modeling, and weather forecasting. This is a main focus of our research. Iterative methods converge based on the properties of the matrices. Many matrices take more than 1000 iterations to converge. The gain reported in the paper will lead to much larger time savings.

Conclusions
In this paper, we presented and evaluated SURAA, a new load balanced scheme for SpMV storage and computations that employ dynamic parallel kernel for the segments with larger rows, CSR vector kernel for segments with number of non-zeros per row greater than or equal to 32, and CSR scalar kernel for rows length shorter than 32. To further balance the load and reduce idle threads, the threads are distributed adaptively to the rows based on their lengths. We also achieve a coalesced memory access and accelerate the SpMV computations for a range of matrices with diverse sparsity features. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row (npr) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman-Diaconis (FD) rule. To the best of our knowledge, no other work exists that used the FD rule for SpMV computations the way we used it, or used adaptive kernels the way we have done in this paper.
We conducted experiments to assess the performance of SURAA on K20 GPUs and compared with the best techniques from the best CUSP (open source) and cuSPARSE (commercial) libraries. A benchmark dataset was formulated using widely used benchmarks comprising 26 matrices from 13 diverse domains. SURAA was compared with a total of seven other SpMV storage and computation techniques using three widely used performance metrics; throughput (GFLOP/s), speedup, and effective memory bandwidth (GB/s).
SURAA outperformed, overall, all other seven SpMV computation methods with the average speedup varying between 40× for SELL-P and 1.15× for WPK2 and 1.1× for CSR5, with the average collective speedup of 13.99×. SURAA provided the best throughput for 20 out of the 26 benchmark matrices. The highest speedups achieved by SURAA against other schemes for any single matrix varied between 562× against SELL-P and 1.94× against CSR5. SURAA provided the highest mean throughput (GFLOP/s) of 15.75, followed by 14.28 by CSR5, and the highest max throughput of 36.70 GFLOP/s followed by 26.8 by SELL-P. SURAA also provided the highest effective memory bandwidth of 226.25 GB/s compared to the second best 160.75 GB/s by SELL-P. Moreover, SURAA has shown a much lower setup overhead, equal to five SpMV execution time, compared to 69 and 95.5 SpMV executions for SELL-P and CSR5. The performance analysis of SURAA revealed that it consistently delivers the highest throughput among all the compared schemes with the increase in the npr variance and nnz, a property which is highly desirable.
While the speedup of SURAA against CSR5 and WPK2 was small, SURAA provided the highest throughput for 20 out of 26 matrices. More importantly, none of the schemes except SURAA delivered consistently high performance on all of the performance benchmarks. SURAA code can be found at the github online repository (see http://github.com/stormvirux/suraa).
In our future work, we shall strive to achieve parametric improvements for the various parameters that have been used in our work such as the number of segments and the number of child threads for parent kernel and use larger datasets to get a better understanding. We will also look to further optimize the scalar, vector, and dynamic kernels. We have achieved significant progress in addressing the major challenges of GPU SpMV computing including coalesced memory access and load balancing. Future work will look into further improving our method based on these challenges. SURAA did not perform for low npr variance matrices as well as it did for the high npr variance matrices. Future work will also look into improving this behavior of the tool. We will investigate further optimization of scalar vector and dynamic kernels using significantly larger data sets.