# Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)

## Abstract

## 1. Introduction

## 2. Dataset, Sparsity Features, and Performance Metrics

#### 2.1. Dataset and Sparsity Features

#### 2.2. Performance Metrics

## 3. Compressed Sparse Row (CSR)

#### 3.1. Execution Time

#### 3.2. GPU Throughput

#### 3.3. GPU Utilization

## 4. ELLPACK (ELL)

#### 4.1. Execution Time

#### 4.2. GPU Throughput

#### 4.3. GPU Utilization

## 5. Hybrid ELL/COO (HYB)

#### 5.1. Execution Time

#### 5.2. GPU Throughput

#### 5.3. GPU Utilization

## 6. Compressed Sparse Row 5 (CSR5)

#### 6.1. Execution Time

#### 6.2. GPU Throughput

#### 6.3. GPU Utilization

## 7. SpMV Performance on GPUs (Summary)

## 8. The Proposed Scheme (HCGHYB)

#### 8.1. HCGHYB: Motivation and Description

#### 8.2. HCGHYB: Performance Analysis

#### 8.2.1. Execution Time

#### 8.2.2. GPU Throughput

#### 8.2.3. GPU Utilization

## 9. Conclusions and Future Work

**Figure 1.**Compressed sparse row (CSR) execution time against: (

**a**) nonzero elements in the matrices ($\mathit{nnz}$) and (

**b**) nonzero elements per row ($\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$).

**Figure 2.**CSR giga floating point operations per second (GFLOPS) against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 3.**CSR achieved occupancy against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 4.**CSR instructions per warp (IPW) against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 5.**CSR warp efficiency against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 6.**ELLPACK (ELL) execution time against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 7.**ELL GFLOPs against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 8.**ELL achieved occupancy against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 9.**ELL instructions per warp (IPW) against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 10.**ELL warp efficiency against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 11.**Hybrid ELL/COO (HYB) execution time against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 12.**HYB GFLOPs against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 13.**HYB achieved occupancy against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 14.**HYB instructions per warp (IPW) against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 15.**HYB warp efficiency against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 16.**Compressed sparse row 5 (CSR5) execution time against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 17.**CSR5 GFLOPs against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 18.**CSR5 achieved occupancy against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 19.**CSR5 instructions per warp (IPW) against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 20.**CSR5 warp efficiency against: (

**a**) $\mathit{nnz}$ and (

**b**) $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$.

**Figure 22.**Heterogeneous CPU–GPU hybrid (HCGHYB) GPU throughput compared to CSR, ELL, HYB, and CSR5.

Structure | Matrix Name | Rows | Columns | $\mathit{nnz}$ | $\mathit{npr}\phantom{\rule{3.33333pt}{0ex}}\mathit{variance}$ | $\mathit{anpr}$ | $\mathit{maxnpr}$ | $\mathit{distavg}$ | Application Domain |
---|---|---|---|---|---|---|---|---|---|

bayer04 | 20,545 | 20,545 | 159,082 | 8.2 | 7 | 34 | 4115.99 | Chemical Simulation | |

ch7-8-b5 | 141,120 | 141,120 | 846,720 | 0 | 6 | 6 | 40,549.39 | Combinatorics | |

copter2 | 55,476 | 55,476 | 407,714 | 3.55 | 7 | 20 | 28,217.87 | Computational Fluid Dynamics | |

fd15 | 11,532 | 11,532 | 44,206 | 1.65 | 3 | 6 | 2690.89 | Materials | |

Fp | 7548 | 7548 | 848,553 | 207.83 | 112 | 957 | 6388.57 | Electromagnetics | |

lhr10 | 10,672 | 10,672 | 232,633 | 26.37 | 21 | 63 | 3380.56 | Chemical Simulation | |

lp_stocfor3 | 16,675 | 23,541 | 76,473 | 3.34 | 4 | 15 | 3123.99 | Linear Programming | |

mark3jac120 | 54,929 | 54,929 | 342,475 | 4.36 | 6 | 44 | 1960.54 | Economics | |

Meg4 | 5860 | 5860 | 26,324 | 16.66 | 4 | 1193 | 1758.92 | Circuit Simulation | |

poli4 | 15,575 | 15,575 | 33,074 | 8.93 | 2 | 491 | 261.04 | Economics | |

poli_large | 33,833 | 33,833 | 73,249 | 7.57 | 2 | 304 | 248.12 | Economics | |

sinc18 | 16,428 | 16,428 | 973,826 | 34.32 | 59 | 111 | 4369.81 | Materials | |

Tols4000 | 4000 | 4000 | 8784 | 5.92 | 2 | 90 | 1130.87 | Computational Fluid Dynamics | |

TSOPF_RS_b300_c2 | 28,338 | 28,338 | 2943,887 | 102.4 | 103 | 209 | 25,564.97 | Power Network | |

Tuma2 | 12,992 | 12,992 | 28,440 | 1.2 | 2 | 5 | 4226.74 | 2D/3D | |

xenon2 | 157,464 | 157,464 | 3866,688 | 4.11 | 24 | 27 | 4934.59 | Materials | |

Zd_Jac6 | 22,835 | 22,835 | 1711,983 | 175.49 | 74 | 1050 | 3436.54 | Chemical Simulation |

nnz | npr variance | distavg | anpr | maxnpr | |
---|---|---|---|---|---|

CSR | medium | high | medium | medium | high |

ELL | medium | high | low | medium | high |

HYB | medium | medium | low | medium | medium |

CSR5 | medium | low | low | medium | low |

