#
Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. GPU

#### 1.2. CUDA

- Faster downloads and readbacks to and from the GPU.
- Scattered reads.
- Unified memory (CUDA 6.0 and above).
- Full support for integer and bitwise operations.
- Unified virtual memory (CUDA 4.0 and above).
- Shared memory—CUDA exposes a fast shared memory region that can be shared among threads.

- The CPU allocates storage on the GPU (using cudaMalloc()).
- The CPU copies input data from the CPU to the GPU (using cudaMemcpy()).
- The CPU launches on the GPU multiple copies of the kernel on parallel threads to process the GPU data. A kernel is a Function (Serial program) that will run on the GPU. The CPU which launch the kernel on parallel threads.
- The CPU copies results back from the GPU to the CPU (using cudaMemcpy()).
- Use or display the output.

__global__ void AddArrays(int *d_Arr1, int *d_Arr2, int *d_sum, unsigned int Length) { unsigned int Index = (blockDim.x * blockIdx.x) + threadIdx.x; if (Index < Length) { d_sum[Index] = d_Arr1[Index] + d_Arr2[Index]; } }

- It takes time to copy data from the CPU to the GPU and vice versa.
- CUDA doesn’t allow all the threads to run simultaneously on the GPU since it depends on the architecture of the GPU.
- Kernel threads access global memory which is implemented in Dynamic Random Access Memory (DRAM) (slow and there is a lookup latency).

#### 1.3. Sparse Matrix-Vector Multiplication (SpMV)

#### 1.4. The Schrödinger Equation

- h: Planck’s constant (6.62607004 × 10
^{−34}J.s). - $\overline{\mathrm{h}}$ = h/2 π.
- m: Mass.
- ∇
^{2}: Second Derivative. - V: Potential Energy.
- ψ: Wavefunction.

**F C = S C E**

- F: Fock operator
- C: Matrix (a wavefunction)
- S: Overlap matrix.
- E: Energy

- Ecorr: Electron Correlation Energy.
- εo: True (exact) ground state energy.
- Eo: HF energy.

## 2. Configuration Interaction (CI)

#### 2.1. The CI Matrix Elements

_{ij}can be evaluated by the strategy employed for calculating the energy of a single determinant used for deriving the HF equations. This involves expanding the determinants in a sum of products of MOs, thereby making it possible to express the CI matrix elements in terms of MO integrals. There are some general features that make many of the CI matrix elements equal to zero.

_{2}O with a 6-31G(d) basis. There are 10 electrons and 38 spin-MOs, of which 10 are occupied and 28 are empty.

_{2}O is employed, there is still a total of 7,536,400 singlet CSFs with A1 symmetry. Table 1 displays the number of singlet configuration state functions (CSFs) as a function of excitation level for H

_{2}O with a 6-31G(d) basis.

#### 2.2. The CI Matrix

- |Φ
_{o}>: CI wavefunction. - |Ψ
_{o}>: Hartree-Fock wavefunction. - C: Some coefficient that is applied to a Slater Determinant (Ψ).
- $\sum _{\mathrm{a}\mathrm{r}}{\mathrm{C}}_{\mathrm{a}}^{\mathrm{r}}$|${\mathsf{\Psi}}_{\mathrm{a}}^{\mathrm{r}}$>: All the possible single excitations. An electron is excited from orbital a to orbital r.
- $\sum _{\mathrm{a}\mathrm{r}\mathrm{b}}{\mathrm{C}}_{\mathrm{a}\mathrm{b}}^{\mathrm{r}\mathrm{s}}$|${\mathsf{\Psi}}_{\mathrm{a}\mathrm{b}}^{\mathrm{r}\mathrm{s}}$>: All the possible double excitations.
- …: All the way up to n excitations (excite n electrons).

#### 2.3. The Proposed Work

## 3. Common Formats

#### 3.1. The CSR (Compressed Sparse Row) or CRS (Compressed Row Storage) Format

- The
**Value**vector: Contains all the non-zero entries. - The
**Column**vector: Contains column index of each non-zero entry. - The
**RowPtr**vector: Contains the index of the first non-zero entry of each row in the “Value” vector. We add the number of non-zero entries in the sparse matrix as the last element of the RowPtr vector.

- Value = 1, 2, 3, 4, 5, 6, 7, 8
- Column = 0, 2, 4, 1, 4, 2, 3, 4
- RowPtr = 0, 1, 3, 5, 6, 7, 8

Algorithm 1 The SpMV kernel for the CSR format. |

SpMV_CSR(Value, Column, Vector, RowPtr, Result)for i ← 0 to ROWS - 1 do Start ← RowPtr[i] End ← RowPtr[i + 1] for j ← Start to End - 1 do Temp ← Temp + (Value[j] * Vector[Column[j]]) Result[i] ← Temp Temp ← 0.00 |

#### 3.2. The ELLPACK (ELL) Format

- The
**NonZerosEntries**matrix: All the non-zero entries. - The
**Column**matrix: The column index of each non-zero entry.

Algorithm 2 The SpMV kernel for the ELLPACK format (thread per row). |

SpMV_ELLPACK(NonZerosEntries, Column, Vector, Result, MaxNonZeros)for r _{1} ← 0 to ROWS – 1 dofor r _{2} ← 0 to MaxNonZeros – 1 doif Column[r _{1}][r_{2}] = -1 thenexit loop Temp = Temp + (NonZerosEntries[r _{1}][r_{2}] * Vector[Column[r_{1}][r_{2}]])Result[r _{1}] ← TempTemp ← 0.00 |

#### 3.3. The ELLPACK-R (ELL-R) Format

Algorithm 3 The SpMV kernel for the ELLPACK-R format (thread per row). |

SpMV_ELLPACK_R(NonZerosEntries, Column, Vector, NonZerosCount, Result)for r _{1} ← 0 to ROWS – 1 doEnd ← NonZerosCount[r _{1}]for r _{2} ← 0 to End – 1 doTemp ← Temp + (NonZerosEntries[r _{1}][r_{2}] * Vector[Column[r_{1}][r_{2}]])Result[r _{1}] ← TempTemp ← 0.00 |

#### 3.4. The Sliced ELLPACK Format

Algorithm 4 The SpMV kernel for the Sliced ELLPACK format (thread per row). |

SpMV_SlicedELLPACK(NonZerosEntries, Column, Vector, Result, Rows, Cols)for r _{1} ← 0 to Rows – 1 dofor r _{2} ← 0 to Cols – 1 doif Column[r _{1}][r_{2}] = -1 thenexit loop Temp ← Temp + (NonZerosEntries[r _{1}][r_{2}] * Vector[Column[r_{1}][r_{2}]])Result[r _{1}] ← TempTemp ← 0.00 |

#### 3.5. The Sliced ELLPACK-R Format

Algorithm 5 The SpMV kernel for the Sliced ELLPACK-R format (thread per row). |

SpMV_SlicedELLPACK_R(NonZerosEntries, Column, Vector, NonZerosCount, Result, Rows)for r _{1} ← 0 to Rows – 1 doEnd ← NonZerosCount[r _{1}]for r _{2} ← 0 to End – 1 doTemp ← Temp + (NonZerosEntries[r _{1}][r_{2}] * Vector[Column[r_{1}][r_{2}]])Result[r _{1}] ← TempTemp ← 0.00 |

#### 3.6. The Compressed Sparse Blocks (CSB) Format

^{2}/z

^{2}equal-sized z × z square blocks using a block size parameter z. The CSB format consists of the following:

- The
**Value**vector: The Value vector is of length nnz. It stored all the non-zero elements of the sparse matrix. - The
**row_idx**and**col_idx**vectors: They track the row index and the column index of each non-zero entry inside the Value vector with regard to the block, not the whole entire matrix. Therefore, row_idx and col_idx range from 0 to z – 1. - The
**Block_ptr**vector: It stores the index of the first non-zero entry of each block inside the Value vector.

## 4. The Proposed Model

^{7}by 10

^{7}or even more. Generally speaking, the CI matrix includes two main regions, namely, the Reference region and the Expansion Space region. The Reference region occupies almost 10% of the whole CI matrix with sparsity ranging from 70 to 80%. The Reference region starts out from the left side of the CI matrix. The Expansion Space region occupies the remaining space of the CI matrix with extremely high sparsity, which is around 98 to 99%.

#### 4.1. The Proposed Storage Format

Algorithm 6 The proposed format. |

CreateFormat(Mat, NonZerosEntries, ELLPACK_Column, Value, CSR_Column, AllNonZerosCount, StartArr, EndArr)for i ← 0 to ROWS – 1 do c = 0 GotIt = 0 for j ← 0 to COLS – 1 do if Mat[i][j]! = 0.00 then if c < BOUNDARY NonZerosEntries[r][c] = Mat[i][j] ELLPACK_Column[r][c] = j c++ else Value[Index] = Mat[i][j] CSR_Column[Index] = j if GotIt = 0 GotIt = 1 Start = Index Index++ r++ if AllNonZerosCount[i] > BOUNDARY End = Index StartArr[i] = Start EndArr[i] = End |

sizeof(double)) + ROWS × BOUNDARY × sizeof(int)) + (CSRNonZeros ×

sizeof(double)) + (CSRNonZeros × sizeof(unsigned int))

**ROWS**: The number of rows in the matrix.**BOUNDARY**: A divider between the Reference region (ELLPACK format) and the Expansion Space region (CSR format). It marks the end of the Reference region and the start of the Expansion Space region.**CSRNonZeros:**The number of non-zero entries in the Expansion Space region.

int)) + ((ROWS + 1) × sizeof(unsigned int))

**AllNonZeros**: The number of non-zero entries in the matrix.**ROWS**: The number of rows in the matrix.

MaxNonZeros × sizeof(int))

**ROWS**: The number of rows in the matrix.**MaxNonZeros**: The length of the longest non-zero entry row in the matrix.

#### 4.2. The Developed SpMV Kernel

Algorithm 7 The SpMV kernel for the proposed format. |

SpMV_Hybrid_ELLPACKandCSR(NonZerosEntries, ELLPACK_Column, Value, CSR_Column, StartArr, EndArr, AllNonZerosCount, Vector, Result)Thread_id ← (blockDim.x × blockIdx.x) + threadIdx.x; Warp_id ← Thread_id/32 Th_Wa_Id ← Thread_id % 32 define ShMem1[32]//Shared Memory define ShMem2[32]//Shared Memory initialize ShMem1 initialize ShMem2 if Warp_id < ROWS then ShMem1[threadIdx.x] = 0.00 for r _{2} ← 0 + Th_Wa_Id to BOUNDARY – 1 STEP=32 doShMem1[threadIdx.x] ← ShMem1[threadIdx.x] + (NonZerosEntries[Warp_id][r _{2}] × Vector[ELLPACK_Column[Warp_id][r_{2}]])//Warp-level reduction for the ELLPACK format: if Th_Wa_Id < 16 ShMem1[threadIdx.x] ←(SynchronousAdd) ShMem1[threadIdx.x] + ShMem1[threadIdx.x + 16] if Th_Wa_Id < 8 ShMem1[threadIdx.x] ←(SynchronousAdd) ShMem1[threadIdx.x] + ShMem1[threadIdx.x + 8] if Th_Wa_Id < 4 ShMem1[threadIdx.x] ←(SynchronousAdd) ShMem1[threadIdx.x] + ShMem1[threadIdx.x + 4] if Th_Wa_Id < 2 ShMem1[threadIdx.x] ←(SynchronousAdd) ShMem1[threadIdx.x] + ShMem1[threadIdx.x + 2] if Th_Wa_Id < 1 ShMem1[threadIdx.x] ←(SynchronousAdd) ShMem1[threadIdx.x] + ShMem1[threadIdx.x + 1] if AllNonZerosCount[Warp_id] > BOUNDARY then Start ← StartArr[Warp_id] End ← EndArr[Warp_id] ShMem2[threadIdx.x] = 0.00 for j ← Start + Th_Wa_Id to End – 1 STEP=32 do ShMem2[threadIdx.x] ← ShMem2[threadIdx.x] + (Value[j] × Vector[CSR_Column[j]]) //Warp-level reduction for the CSR format: if Th_Wa_Id < 16 ShMem2[threadIdx.x] ←(SynchronousAdd) ShMem2[threadIdx.x] + ShMem2[threadIdx.x + 16] if Th_Wa_Id < 8 ShMem2[threadIdx.x] ←(SynchronousAdd) ShMem2[threadIdx.x] + ShMem2[threadIdx.x + 8] if Th_Wa_Id < 4 ShMem2[threadIdx.x] ←(SynchronousAdd) ShMem2[threadIdx.x] + ShMem2[threadIdx.x + 4] if Th_Wa_Id < 2 ShMem2[threadIdx.x] ←(SynchronousAdd) ShMem2[threadIdx.x] + ShMem2[threadIdx.x + 2] if Th_Wa_Id < 1 ShMem2[threadIdx.x] ←(SynchronousAdd) ShMem2[threadIdx.x] + ShMem2[threadIdx.x + 1] if Th_Wa_Id = 0 //Writing the results Result[Warp_id] ← Result[Warp_id] + ShMem1[threadIdx.x] + ShMem2[threadIdx.x] |

## 5. System Configuration

^{®}Xeon Phi™ co-processor. The 16 compute nodes that have NIVIDIA K20m GPUs can be used to process CUDA code. Each compute node has the following configurations:

- NIVIDIA K20m GPU card or Intel
^{®}Xeon Phi™ co-processor card. - 64 GB of memory.
- 150 GB RAID 17.2 K RPM storage hard drives.
- Linux Red Hat operating system RHEL 7.0.

## 6. The Results

- cudaDeviceProp DeviceProp_var
- cudaGetDeviceProperties(DeviceProp_var, DevNo)
- Print DeviceProp_var.warpSize

^{7}by 10

^{7}or even more. CI sparse matrices have the same structure in terms of having two regions (the Reference Region and the Expansion Space Region) with two different sparsity settings. We used 20 matrices for testing, the first 10 matrices were 32,768 by 32,768, and the other 10 matrices were 1,048,576 by 1,048,576. For the first 10 matrices, they have different number of non-zero elements as illustrated in Table 3. The sparsity results in Table 4 look close because these matrices have a huge number of elements (1,073,741,824), and when dividing the number of non-zero elements by that number, in order to get the total sparsity of each CI sparse matrix The resulting percentages will look very close, which holds true for most CI matrices.

## 7. Conclusions

## 8. Future Work

^{3}, Q = a + bi + cj + dk, a, b, c, d ∈ R. i

^{2}= −1, j

^{2}= −1, k

^{2}= −1, ijk = −1. They extend complex numbers. When we move from complex numbers to quaternions, objects are no longer fields, whereas they algebraically form a semi-ring. We also lose commutativity. For example, ij = k whilst ji = −k. Subsequently, we will compare the proposed model (quaternions version) to other quaternions models. Of course, the comparison will be based on the two crucial key factors: The amount of used memory, and performance.

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Mei, X.; Chu, X. Dissecting GPU Memory Hierarchy through Microbenchmarking. IEEE Trans. Parallel Distrib. Syst.
**2017**, 28, 72–86. [Google Scholar] [CrossRef] - Nickolls, J.; Buck, I.; Garland, M.; Skadron, K. Scalable parallel programming with CUDA. In Proceedings of the ACM SIGGRAPH 2008 Classes, Los Angeles, CA, USA, 11–15 August 2008. [Google Scholar]
- NVIDIA Corporation. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model. In NVIDIA CUDA Programming Guide; Version 2.0; NVIDIA: Santa Clara, CA, USA, 2008. [Google Scholar]
- NVidia Developer’s Guide, about CUDA. Available online: https://developer.nvidia.com/about-cuda (accessed on 2 October 2017).
- Kulkarni, M.A.V.; Barde, P.C.R. A Survey on Performance Modelling and Optimization Techniques for SpMV on GPUs. Int. J. Comput. Sci. Inf. Technol.
**2014**, 5, 7577–7582. [Google Scholar] - Baskaran, M.M.; Bordawekar, R. Optimizing Sparse Matrix-Vector Multiplication on GPUs. Ninth SIAM Conf. Parallel Process. Sci. Comput. 2009; IBM Research Report no. RC24704. [Google Scholar]
- Bell, N.; Garland, M. Efficient Sparse Matrix-Vector Multiplication on CUDA. In Proceedings of the 2010 2nd International Conference on Education Technology and Computer, Shanghai, China, 22–24 June 2010. [Google Scholar]
- Guo, D.; Gropp, W.; Olson, L.N. A hybrid format for better performance of sparse matrix-vector multiplication on a GPU. Int. J. High Perform. Comput. Appl.
**2016**, 30, 103–120. [Google Scholar] [CrossRef] - Echenique, P.; Alonso, J.L. A mathematical and computational review of Hartree-Fock SCF methods in quantum chemistry. Mol. Phys.
**2007**, 105, 3057–3098. [Google Scholar] [CrossRef] [Green Version] - Anthony, B.M.; Stevens, J. Virtually going green: The role of quantum computational chemistry in reducing pollution and toxicity in chemistry. Phys. Sci. Rev.
**2017**, 2. [Google Scholar] [CrossRef] - Siegbahn, P.E.M. The Configuration Interaction Method. Eur. Summer Sch. Quantum Chem.
**2000**, 1, 241–284. [Google Scholar] - Alavi, A. Introduction to Full Configuration Interaction Quantum Monte Carlo with Applications to the Hubbard model Stuttgart. Autumn Sch. Correl. Electrons
**2016**, 6, 1–14. [Google Scholar] - Aktulga, H.M.; Buluc, A.; Williams, S.; Yang, C. Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, 19–23 May 2014. [Google Scholar]
- Zardoshti, P.; Khunjush, F.; Sarbazi-Azad, H. Adaptive sparse matrix representation for efficient matrix-vector multiplication. J. Supercomput.
**2016**, 72, 3366–3386. [Google Scholar] [CrossRef] - Ren, Z.; Ye, C.; Liu, G. Application and Research of C Language Programming Examination System Based on B/S. In Proceedings of the 2010 Third International Symposium on Information Processing, Qingdao, China, 15–17 October 2010. [Google Scholar]
- Bell, N.; Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, OR, USA, 14–20 November 2009. [Google Scholar]
- Dziekonski, A.; Lamecki, A.; Mrozowski, M. A memory efficient and fast sparse matrix vector product on a GPU. Prog. Electromagn. Res.
**2011**, 116, 49–63. [Google Scholar] [CrossRef] - Ye, F.; Calvin, C.; Petiton, S.G. A study of SpMV implementation using MPI and OpenMP on intel many-core architecture. In VECPAR 2014: High Performance Computing for Computational Science—VECPAR 2014; Lecture Notes in Computer Science Book Series; Springer: Cham, Switzerland, 2015; pp. 43–56. [Google Scholar]
- Maggioni, M.; Berger-Wolf, T. AdELL: An adaptive warp-Balancing ELL format for efficient sparse matrix-Vector multiplication on GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing, Lyon, France, 1–4 October 2013. [Google Scholar]
- Eguly, I.R.; Giles, M. Efficient sparse matrix-vector multiplication on cache-based GPUs. In Proceedings of the 2012 Innovative Parallel Computing (InPar), San Jose, CA, USA, 13–14 May 2012. [Google Scholar]
- Buluç, A.; Fineman, J.T.; Frigo, M.; Gilbert, J.R.; Leiserson, C.E. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, Calgary, AB, Canada, 11–13 August 2009. [Google Scholar]
- Aktulga, H.M.; Afibuzzaman, M.; Williams, S.; Buluç, A.; Shao, M.; Yang, C.; Ng, E.G.; Maris, P.; Vary, J.P. A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations. IEEE Trans. Parallel Distrib. Syst.
**2016**, 28, 1550–1563. [Google Scholar] [CrossRef] - CUDA Toolkit Documentation. nvprof. Available online: https://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview (accessed on 19 July 2018).
- CUDA Toolkit Documentation, cuSPARSE. Available online: https://docs.nvidia.com/cuda/cusparse/index.html (accessed on 16 July 2018).
- Liu, W.; Vinter, B. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication Categories and Subject Descriptors. In Proceedings of the 29th ACM on International Conference on Supercomputing, Newport Beach, CA, USA, 8–11 June 2015. [Google Scholar]

**Figure 1.**The Graphics Processing Unit (GPU) memory hierarchy [1].

**Figure 7.**Performance with block size of 32 using 10 different 1,048,576 by 1,048,576 CI sparse matrices.

Excitation Level (n). | Total Number of CSFs |
---|---|

1 | 71 |

2 | 2556 |

3 | 42,596 |

4 | 391,126 |

5 | 2,114,666 |

6 | 7,147,876 |

7 | 15,836,556 |

8 | 24,490,201 |

9 | 29,044,751 |

10 | 30,046,752 |

Name | Tesla K20m |
---|---|

Major revision number | 3 |

Minor revision number | 5 |

Maximum memory pitch | 2,147,483,647 |

Clock rate | 705,500 |

Texture alignment | 512 |

Concurrent copy and execution | Yes |

Kernel execution timeout enabled | No |

Number of multiprocessors | 13 |

The maximum number of threads per multiprocessor | 2048 |

The maximum number of threads per block | 1024 |

The maximum sizes of each dimension of a block (x, y, z) | 1024, 1024, 64 |

The maximum sizes of each dimension of a grid (x, y, z) | 2,147,483,647, 65,535, 65,535 |

The total number of registers available per block | 65,536 |

The total amount of shared memory per block (Bytes) | 49,152 |

The total amount of constant memory (Bytes) | 65,536 |

The total amount of global memory (Bytes), (Gigabytes) | 4,972,937,216, 4.631409 |

Warp size (Threads) | 32 |

Ref. Non-Zeros | Exp. Space Non-Zeros | Total Non-Zeros | Length of the Longest Non-Zero Row–Row No. | |
---|---|---|---|---|

Matrix 1 | 21,463,040 | 9,678,327 | 31,141,367 | 1074–5314 |

Matrix 2 | 21,463,040 | 9,673,031 | 31,136,071 | 1069–6144 |

Matrix 3 | 21,463,040 | 9,666,827 | 31,129,867 | 1078–22,358 |

Matrix 4 | 21,463,040 | 9,670,529 | 31,133,569 | 1068–21,549 |

Matrix 5 | 21,463,040 | 9,669,733 | 31,132,773 | 1060–4233 |

Matrix 6 | 21,463,040 | 9,673,530 | 31,136,570 | 1069–28,394 |

Matrix 7 | 21,463,040 | 9,666,857 | 31,129,897 | 1079–21,836 |

Matrix 8 | 21,463,040 | 9,670,234 | 31,133,274 | 1077–20,148 |

Matrix 9 | 21,463,040 | 9,661,419 | 31,124,459 | 1067–12,787 |

Matrix 10 | 21,463,040 | 9,670,316 | 31,133,356 | 1064–25,678 |

Ref. Region Spar. (%) | Exp. Space Region Spar. (%) | Total Spar. (%) | |
---|---|---|---|

Matrix 1 | 0.800110 | 0.989985 | 0.970997 |

Matrix 2 | 0.800110 | 0.989990 | 0.971002 |

Matrix 3 | 0.800110 | 0.989997 | 0.971008 |

Matrix 4 | 0.800110 | 0.989993 | 0.971005 |

Matrix 5 | 0.800110 | 0.989994 | 0.971005 |

Matrix 6 | 0.800110 | 0.989990 | 0.971002 |

Matrix 7 | 0.800110 | 0.989997 | 0.971008 |

Matrix 8 | 0.800110 | 0.989993 | 0.971005 |

Matrix 9 | 0.800110 | 0.990002 | 0.971013 |

Matrix 10 | 0.800110 | 0.989993 | 0.971005 |

CSR | ELL | Sliced ELLPACK | Pro. Model | |
---|---|---|---|---|

Matrix 1 | 356.44 | 407.25 | 397.13 | 356.76 |

Matrix 2 | 356.46 | 398.25 | 395.06 | 356.70 |

Matrix 3 | 356.39 | 405.75 | 404.44 | 356.63 |

Matrix 4 | 356.49 | 403.88 | 397.03 | 356.67 |

Matrix 5 | 356.39 | 399.38 | 395.06 | 356.66 |

Matrix 6 | 356.38 | 399.75 | 397.59 | 356.70 |

Matrix 7 | 356.45 | 400.88 | 396.94 | 356.63 |

Matrix 8 | 356.51 | 401.25 | 396.56 | 356.67 |

Matrix 9 | 356.43 | 402.75 | 398.25 | 356.57 |

Matrix 10 | 356.44 | 405 | 403.13 | 356.67 |

Average | 356.44 | 402.41 | 398.12 | 356.67 |

Block Size | Min | Max | Avg |
---|---|---|---|

16 | 4.8877 ms | 4.8877 ms | 4.8877 ms |

16 | 4.8884 ms | 4.8884 ms | 4.8884 ms |

16 | 4.8891 ms | 4.8891 ms | 4.8891 ms |

16 | 4.8882 ms | 4.8882 ms | 4.8882 ms |

16 | 4.8870 ms | 4.8870 ms | 4.8870 ms |

Average | 4.8881 ms |

Block Size | Min | Max | Avg |
---|---|---|---|

32 | 3.2847 ms | 3.2847 ms | 3.2847 ms |

32 | 3.2838 ms | 3.2838 ms | 3.2838 ms |

32 | 3.2843 ms | 3.2843 ms | 3.2843 ms |

32 | 3.2834 ms | 3.2834 ms | 3.2834 ms |

32 | 3.2859 ms | 3.2859 ms | 3.2859 ms |

Average | 3.2844 ms |

Block Size | Min | Max | Avg |
---|---|---|---|

64 | 3.5451 ms | 3.5451 ms | 3.5451 ms |

64 | 3.5424 ms | 3.5424 ms | 3.5424 ms |

64 | 3.5437 ms | 3.5437 ms | 3.5437 ms |

64 | 3.5410 ms | 3.5410 ms | 3.5410 ms |

64 | 3.5448 ms | 3.5448 ms | 3.5448 ms |

Average | 3.5434 ms |

cuSPARSE | CSR5 | The Prop. Model | |
---|---|---|---|

Matrix 1 | 12.6567 | 7.3672 | 3.3661 |

Matrix 2 | 12.8334 | 8.1694 | 3.3573 |

Matrix 3 | 12.1227 | 7.8954 | 3.4127 |

Matrix 4 | 13.1043 | 8.2654 | 3.5434 |

Matrix 5 | 13.4352 | 8.1489 | 3.8167 |

Matrix 6 | 12.3672 | 8.2336 | 3.1884 |

Matrix 7 | 13.2559 | 8.3962 | 3.3655 |

Matrix 8 | 13.1655 | 7.3672 | 3.3784 |

Matrix 9 | 12.8985 | 8.5934 | 4.1836 |

Matrix 10 | 13.3568 | 7.8349 | 3.3471 |

Average | 12.9196 | 8.0272 | 3.3661 |

**Table 10.**Performance with block size of 16 using 10 different 1,048,576 by 1,048,576 CI sparse matrices.

cuSPARSE | CSR5 | The Prop. Model | |
---|---|---|---|

Matrix 1 | 11.9564 | 7.8736 | 3.1583 |

Matrix 2 | 12.3787 | 8.2376 | 2.9672 |

Matrix 3 | 12.4513 | 9.5462 | 3.1987 |

Matrix 4 | 13.5645 | 7.9456 | 2.7761 |

Matrix 5 | 12.6587 | 7.9348 | 3.3762 |

Matrix 6 | 12.7457 | 7.1764 | 2.9376 |

Matrix 7 | 12.4874 | 7.3672 | 2.8729 |

Matrix 8 | 12.4138 | 7.6327 | 3.6534 |

Matrix 9 | 12.0845 | 8.3432 | 3.6249 |

Matrix 10 | 13.0198 | 7.7482 | 3.2653 |

Average | 12.5761 | 7.9806 | 3.1831 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mahmoud, M.; Hoffmann, M.; Reza, H.
Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU. *Computation* **2018**, *6*, 45.
https://doi.org/10.3390/computation6030045

**AMA Style**

Mahmoud M, Hoffmann M, Reza H.
Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU. *Computation*. 2018; 6(3):45.
https://doi.org/10.3390/computation6030045

**Chicago/Turabian Style**

Mahmoud, Mohammed, Mark Hoffmann, and Hassan Reza.
2018. "Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU" *Computation* 6, no. 3: 45.
https://doi.org/10.3390/computation6030045