Author Contributions
Conceptualization, Z.S.; Methodology, Z.S.; Software, Z.S. and X.S.; Validation, Z.S. and X.S.; Formal analysis, Z.S. and X.S.; Investigation, Z.S. and X.S.; Resources, Z.S.; Data curation, Z.S.; Writing—original draft, Z.S.; Writing—review & editing, Z.S., Y.Z. and X.S.; Visualization, Z.S.; Supervision, Z.S. and Y.Z.; Project administration, Z.S.; Funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.
Figure 1.
This illustration contains four blocks that visualize the indexing and data-layout patterns of each format. It is an llustration of four common sparse matrix storage formats: COO, CSR, CSC, and BSR.
Figure 1.
This illustration contains four blocks that visualize the indexing and data-layout patterns of each format. It is an llustration of four common sparse matrix storage formats: COO, CSR, CSC, and BSR.
Figure 2.
Reproducibility and label-consistency issues for CPUs (left) and GPUs (right). In both cases, two rounds of execution (each averaging 50 runs) show that the optimal format for the same matrix may change across rounds. Moreover, for matrices such as mesh1em1.mtx, multiple formats perform nearly equally well, indicating that assigning a single “best format” can be insufficient.
Figure 2.
Reproducibility and label-consistency issues for CPUs (left) and GPUs (right). In both cases, two rounds of execution (each averaging 50 runs) show that the optimal format for the same matrix may change across rounds. Moreover, for matrices such as mesh1em1.mtx, multiple formats perform nearly equally well, indicating that assigning a single “best format” can be insufficient.
Figure 3.
Overall workflow of the proposed framework. The pipeline starts with device selection (CPU/GPU), followed by feature extraction and model prediction (executed in parallel, as highlighted by the blue dashed box). Depending on whether the original format is optimal, the matrix is either executed directly or converted to a new format before SpMV.
Figure 3.
Overall workflow of the proposed framework. The pipeline starts with device selection (CPU/GPU), followed by feature extraction and model prediction (executed in parallel, as highlighted by the blue dashed box). Depending on whether the original format is optimal, the matrix is either executed directly or converted to a new format before SpMV.
Figure 4.
(1) Feature importance ranking of log-transformed attributes for CPU vs. GPU selection using XGBoost. (2) Distribution of for CPU vs. GPU execution preference across different hardware platforms: (2a) AMD EPYC 7713 (CPU) and NVIDIA A40 (GPU); (2b) Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz (CPU) and NVIDIA A40 (GPU).
Figure 4.
(1) Feature importance ranking of log-transformed attributes for CPU vs. GPU selection using XGBoost. (2) Distribution of for CPU vs. GPU execution preference across different hardware platforms: (2a) AMD EPYC 7713 (CPU) and NVIDIA A40 (GPU); (2b) Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz (CPU) and NVIDIA A40 (GPU).
Figure 5.
Three-dimensional distributions of matrix formats on CPUs in the feature space (log10(nnz), log10(rows), log10(cols)). (a) Before clustering on AMD EPYC 7713; (b) After clustering on AMD EPYC 7713; (c) Before clustering on Intel Xeon Gold 6230; (d) After clustering on Intel Xeon Gold 6230. Dynamic clustering drives samples to stable fixed-label centroids and clarifies inter-class boundaries.
Figure 5.
Three-dimensional distributions of matrix formats on CPUs in the feature space (log10(nnz), log10(rows), log10(cols)). (a) Before clustering on AMD EPYC 7713; (b) After clustering on AMD EPYC 7713; (c) Before clustering on Intel Xeon Gold 6230; (d) After clustering on Intel Xeon Gold 6230. Dynamic clustering drives samples to stable fixed-label centroids and clarifies inter-class boundaries.
Figure 6.
Three-dimensional feature space distribution of formats on GPUs (left: before clustering; right: after clustering). Only CSR and CSC remain, with CSC dominating (∼80%). After clustering, some CSR points are absorbed into the CSC cluster in the upper-right region. COO is unsuitable for GPUs due to its high atomic operation and synchronization costs, while BSR/DIA suffer from high conversion overhead in general datasets.
Figure 6.
Three-dimensional feature space distribution of formats on GPUs (left: before clustering; right: after clustering). Only CSR and CSC remain, with CSC dominating (∼80%). After clustering, some CSR points are absorbed into the CSC cluster in the upper-right region. COO is unsuitable for GPUs due to its high atomic operation and synchronization costs, while BSR/DIA suffer from high conversion overhead in general datasets.
Figure 7.
Illustration of the clustering process for dynamic labels in a 3D feature space. Left: a multi-label matrix A is initially assigned to its nearest fixed-label centroid. Middle: a local radius r (dashed circle) filters dispersed points; only one neighboring point lies within the radius. Right: A obtains a stable assignment to the label with the majority and the smallest mean distance.
Figure 7.
Illustration of the clustering process for dynamic labels in a 3D feature space. Left: a multi-label matrix A is initially assigned to its nearest fixed-label centroid. Middle: a local radius r (dashed circle) filters dispersed points; only one neighboring point lies within the radius. Right: A obtains a stable assignment to the label with the majority and the smallest mean distance.
Figure 8.
Evaluation of Model S for hardware selection. (Left) Margin distribution validates that hardware decisions are stable and consistent. (Right) Feature-plane distribution shows the separation between CPU- and GPU-preferred matrices.
Figure 8.
Evaluation of Model S for hardware selection. (Left) Margin distribution validates that hardware decisions are stable and consistent. (Right) Feature-plane distribution shows the separation between CPU- and GPU-preferred matrices.
Figure 9.
Decomposition of
Figure 5b into individual format clusters in CPU feature space. (
a) COO, (
b) CSR, (
c) CSC, (
d) BSR. The clear separability among formats explains the higher prediction accuracy observed on CPU platforms.
Figure 9.
Decomposition of
Figure 5b into individual format clusters in CPU feature space. (
a) COO, (
b) CSR, (
c) CSC, (
d) BSR. The clear separability among formats explains the higher prediction accuracy observed on CPU platforms.
Figure 10.
Scatter plots over 1786 matrices: (left) feature-extraction time vs. conversion/build time (, proxy); (right) vs. execution time (, minimal SpMV time). Each dot is one matrix; the solid line denotes . Pearson r values are shown in the panel titles. All times are measured with the same timing source under the end-to-end accounting (selection, features, prediction, conversion, execution).
Figure 10.
Scatter plots over 1786 matrices: (left) feature-extraction time vs. conversion/build time (, proxy); (right) vs. execution time (, minimal SpMV time). Each dot is one matrix; the solid line denotes . Pearson r values are shown in the panel titles. All times are measured with the same timing source under the end-to-end accounting (selection, features, prediction, conversion, execution).
Table 1.
List of extracted structural features.
Table 1.
List of extracted structural features.
| Abbreviation | Full Name/Description |
|---|
| rows | Number of rows of the matrix |
| cols | Number of columns of the matrix |
| nnz | Number of non-zero elements |
| ndiags | Number of occupied diagonals |
| mean_rd | Mean number of non-zeros per row |
| max_rd | Maximum number of non-zeros per row |
| min_rd | Minimum number of non-zeros per row |
| std_rd | Standard deviation of non-zeros per row |
| mean_cd | Mean number of non-zeros per column |
| max_cd | Maximum number of non-zeros per column |
| min_cd | Minimum number of non-zeros per column |
| std_cd | Standard deviation of non-zeros per column |
| ER_RD | Row efficiency ratio |
| ER_CD | Column efficiency ratio |
| row_bounce | Average difference of non-zeros between adjacent rows |
| col_bounce | Average difference of non-zeros between adjacent columns |
| density | Matrix density () |
| row_var | Variance of normalized row non-zeros |
| max_mu | Difference between maximum and mean row non-zeros |
Table 2.
Hardware configuration of experimental platforms.
Table 2.
Hardware configuration of experimental platforms.
| Component | Server 1 | Server 2 |
|---|
| CPU | AMD EPYC 7713 (64 cores, 1.7 GHz) | Intel Xeon Gold 6230 (20 cores, 2.1 GHz) |
| Sockets × Cores/Socket × Threads/Core | (=256 logical) | (=80 logical) |
| CPU threads used (OpenMP) | 256 | 80 |
| GPU | NVIDIA A40 (48 GB) × 2 | NVIDIA A40 (48 GB) × 2 |
| GPU SMs (per A40) | 84 | 84 |
| GPU peak memory bandwidth (per A40) | 696 GB/s | 696 GB/s |
| CUDA Driver | 550.144.03 | 550.144.03 |
| CUDA Version | 12.4 | 12.4 |
| Memory | 512 GB DDR4 | 512 GB DDR4 |
| OS | Ubuntu 22.04 LTS | Ubuntu 22.04 LTS |
Table 3.
Dataset profile over 1786 matrices: summary statistics for rows (n), columns (m), and nonzeros (nnz).
Table 3.
Dataset profile over 1786 matrices: summary statistics for rows (n), columns (m), and nonzeros (nnz).
| | Min | P25 | Median | P75 | Max | Mean |
|---|
| rows (n) | 32 | 1007.5 | 4004 | 16,141.75 | 683,446 | 18,419.82 |
| cols (m) | 32 | 1182 | 4938 | 17,670.5 | 683,446 | 21,248.38 |
| nnz | 48 | 10,440.75 | 46,366.5 | 207,284.5 | 28,715,634 | 306,433.42 |
Table 4.
Classification accuracy (%) of different models before and after dynamic labeling and clustering. Accuracy is defined in Equation (
9).
Table 4.
Classification accuracy (%) of different models before and after dynamic labeling and clustering. Accuracy is defined in Equation (
9).
| Method | CPU1 | CPU2 | GPU |
|---|
| DecisionTree | 2_Layer_ANN | XGBoost | DecisionTree | 2_Layer_ANN | XGBoost | DecisionTree | 2_Layer_ANN | XGBoost |
|---|
| traditional | 79.47 | 83.40 | 86.92 | 76.45 | 80.61 | 81.34 | 92.34 | 93.75 | 91.71 |
| dynamic | 88.20 | 90.06 | 89.71 | 84.27 | 86.83 | 85.76 | 98.27 | 97.92 | 98.95 |
Table 5.
Accuracy comparison between 2D and 3D clustering under different numbers of features. Accuracy is defined in Equations (
10a) and (
10b).
Table 5.
Accuracy comparison between 2D and 3D clustering under different numbers of features. Accuracy is defined in Equations (
10a) and (
10b).
| Platform | Dim | 2f | 3f | 4f | 5f | 6f | 7f | 8f |
|---|
| CPU1 | 2D | 85.11% | 86.14% | 85.42% | 83.27% | 85.75% | 86.04% | 84.82% |
| 3D | 88.75% | 90.06% | 90.32% | 89.75% | 88.59% | 89.34% | 89.50% |
| CPU2 | 2D | 85.67% | 86.25% | 86.75% | 85.90% | 85.21% | 84.95% | 85.69% |
| 3D | 85.39% | 86.83% | 86.02% | 86.98% | 84.95% | 86.07% | 84.94% |
| GPU | 2D | 97.71% | 97.80% | 97.54% | 98.08% | 98.19% | 97.82% | 97.93% |
| 3D | 97.64% | 97.92% | 98.03% | 98.26% | 97.88% | 97.93% | 98.15% |
Table 6.
Performance comparison of GPUs versus CPUs on small and large matrices. Values larger than
indicate slowdown, while speedup is reported for large matrices. Ratios are defined in Equations (
12a) and (
12b).
Table 6.
Performance comparison of GPUs versus CPUs on small and large matrices. Values larger than
indicate slowdown, while speedup is reported for large matrices. Ratios are defined in Equations (
12a) and (
12b).
| Platform | Small Matrices (GPU Slowdown) | Large Matrices (GPU Speedup) |
|---|
| Average | Maximum | Average | Maximum |
|---|
| Server 1 (AMD EPYC 7713 + A40) | | | | |
| Server 2 (Intel Xeon Gold 6230 + A40) | | | | |
Table 7.
Acceleration over baseline CSR after Model S format prediction. CSR acceleration is defined in Equation (
13).
Table 7.
Acceleration over baseline CSR after Model S format prediction. CSR acceleration is defined in Equation (
13).
| Platform | CPU | GPU |
|---|
| Average | Maximum | Average | Maximum |
|---|
| Server 1 (AMD EPYC 7713 + A40) | | | | |
| Server 2 (Intel Xeon Gold 6230 + A40) | | | | |
Table 8.
Robustness test on CPU1 across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
Table 8.
Robustness test on CPU1 across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
| Round | Accuracy (%) | Speedup |
|---|
| Traditional | 2D | 3D | Traditional | 2D | 3D |
|---|
| 1 | 74.33 | 82.41 | 81.15 | 0.81× | 1.17× | 1.15× |
| 2 | 73.29 | 80.34 | 82.10 | 0.76× | 1.10× | 1.10× |
| 3 | 70.82 | 81.63 | 84.82 | 0.92× | 1.02× | 1.26× |
| 4 | 69.16 | 76.27 | 83.87 | 1.03× | 0.97× | 1.20× |
| 5 | 72.67 | 77.47 | 84.80 | 0.95× | 1.00× | 1.23× |
Table 9.
Robustness test on CPU2 across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
Table 9.
Robustness test on CPU2 across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
| Round | Accuracy (%) | Speedup |
|---|
| Traditional | 2D | 3D | Traditional | 2D | 3D |
|---|
| 1 | 71.20 | 79.92 | 82.33 | 1.00× | 1.13× | 1.16× |
| 2 | 67.52 | 80.43 | 80.06 | 1.02× | 1.02× | 1.10× |
| 3 | 74.36 | 83.78 | 84.65 | 0.95× | 1.18× | 1.16× |
| 4 | 70.05 | 78.60 | 81.17 | 0.82× | 1.07× | 1.00× |
| 5 | 66.39 | 81.97 | 85.31 | 0.79× | 1.01× | 1.20× |
Table 10.
Robustness test on GPU across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
Table 10.
Robustness test on GPU across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
| Round | Accuracy (%) | Speedup |
|---|
| Traditional | 2D | 3D | Traditional | 2D | 3D |
|---|
| 1 | 91.20 | 95.30 | 96.10 | 1.03× | 1.05× | 1.25× |
| 2 | 89.76 | 96.75 | 97.24 | 1.12× | 1.20× | 1.16× |
| 3 | 74.36 | 94.87 | 97.52 | 0.92× | 1.17× | 1.14× |
| 4 | 70.05 | 95.81 | 95.43 | 0.83× | 1.27× | 1.22× |
| 5 | 66.39 | 97.19 | 96.70 | 1.05× | 1.21× | 1.31× |