Abstract
Sparse matrix–vector multiplication (SpMV) is a fundamental kernel in high-performance computing (HPC) whose efficiency depends heavily on the storage format across central processing unit (CPU) and graphics processing unit (GPU) platforms. Conventional supervised approaches often use execution time as training labels, but our experiments on 1786 matrices reveal two issues: labels are unstable across runs due to execution-time variability, and single-label assignment overlooks cases where multiple formats perform similarly well. We propose a dynamic labeling strategy that assigns a single label when the fastest format shows clear superiority, and multiple labels when performance differences are small, thereby reducing label noise. We further extend feature analysis to multi-dimensional structural descriptors and apply clustering to refine label distributions and enhance prediction robustness. Experiments demonstrate 99.2% accuracy in hardware (CPU/GPU) selection and up to 98.95% accuracy in format prediction, with up to 10% robustness gains over traditional methods. Under cost-aware, end-to-end evaluation that accounts for feature extraction, prediction, conversion, and kernel execution, CPUs achieve speedups up to 3.15× and GPUs up to 1.94× over a CSR baseline. Cross-round evaluations confirm stability and generalization, providing a reliable path toward automated, cross-platform SpMV optimization.
1. Introduction
Sparse matrix–vector multiplication (SpMV) is a fundamental computational kernel that is widely used in scientific computing, graph analysis [1,2,3,4], and machine learning [5,6,7,8,9]. The efficiency of SpMV is strongly influenced by the chosen storage format, such as Coordinate (COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), and Block Sparse Row (BSR). Each format exhibits different performance according to the sparsity pattern, the hardware platform, and the memory hierarchy [10,11], making the automatic selection of an optimal format a challenging problem [9,12,13]. Practically, robust automatic format/device selection reduces manual tuning and prevents negative returns due to format-conversion overheads on heterogeneous CPU–GPU systems. As illustrated in Figure 1, these formats adopt different indexing strategies, which directly affect memory access efficiency and computational performance.
Figure 1.
This illustration contains four blocks that visualize the indexing and data-layout patterns of each format. It is an llustration of four common sparse matrix storage formats: COO, CSR, CSC, and BSR.
Over the past decade, significant research has been devoted to optimizing SpMV performance across CPUs and GPUs. Early studies investigated static heuristics and handcrafted rules for format selection, but these approaches often failed to generalize to diverse datasets. More recent works have explored adaptive and machine learning-based methods. For instance, Zhao et al. proposed bridging deep learning models with sparse matrix format selection to improve portability and robustness [13]. Similarly, Zhou et al. developed DTSpMV, an adaptive framework for graph analysis that dynamically selects formats at runtime [14]. In parallel, Stylianou and Weiland emphasized the importance of performance portability by leveraging dynamic sparse matrices in portable linear algebra operations [15].
In addition to adaptive frameworks, surveys have systematically reviewed the landscape of SpMV optimization, highlighting the trade-offs among computation, memory access, and preprocessing overhead [7,9,10]. Other works, such as that by Wang et al., focused on overlapping memory copy and computation to further enhance GPU SpMV performance [16], while Ashoury et al. presented Auto-SpMV, a fully automated optimization pipeline for GPU kernels [17]. Moreover, Chen proposed hybrid storage formats for CPUs that adaptively combine different representations to maximize throughput [18], Our prior work [19] was the first to introduce a dynamic-labeling-based classification method (DyLaClass), which achieved high classification accuracy and robust performance on CPU platforms. Collectively, these studies underline the necessity of flexible and adaptive strategies for exploiting heterogeneous architectures.
Despite these advances, two persistent challenges remain. First, the reproducibility of “best format” labels is problematic: repeated experiments on the same matrix may yield different optimal formats due to the variability of the execution time. Second, current methods often assign a single best format, overlooking cases where multiple formats perform similarly well. These issues are illustrated in Figure 2, where two rounds of execution (each round averages 50 runs) demonstrate that the optimal formats of four real matrices can change between rounds. Furthermore, for matrices such as mesh1em1.mtx, several formats achieve nearly identical performance, highlighting that assigning only one “best format” may oversimplify the problem and reduce prediction robustness.
Figure 2.
Reproducibility and label-consistency issues for CPUs (left) and GPUs (right). In both cases, two rounds of execution (each averaging 50 runs) show that the optimal format for the same matrix may change across rounds. Moreover, for matrices such as mesh1em1.mtx, multiple formats perform nearly equally well, indicating that assigning a single “best format” can be insufficient.
To address these limitations, our work makes the following contributions:
- (1)
- We generalize dynamic labeling to a true multi-label setting. For matrix A and format set , we define , which admits any number of near-optimal formats within threshold (not just one runner-up).
- (2)
- We propose clustering-based consolidation in feature spaces beyond 2D. Besides the 2D plane, we use 3D (and optionally higher-k) spaces formed by top F-score features to refine label distributions. Fixed-label samples provide class centroids; multi-label samples are assigned to the nearest centroid among their candidate labels, reducing label noise and improving classifier stability.
- (3)
- We extend the study from CPUs to a unified CPU–GPU pipeline. We first select the device (Model S) and then the format, explicitly accounting for prediction time (), conversion time (), and execution time (), with feature extraction time () executed in parallel with Model S. On a dataset of 1786 matrices, we show that multi-label dynamic labeling plus feature-based clustering substantially improves prediction stability and accuracy on both CPUs and GPUs.
2. Materials and Methods
Before diving into the details of dynamic labeling, feature extraction, and classification, it is essential to provide an overview of the entire experimental workflow. Figure 3 presents the overall framework: starting from an input sparse matrix, the system first decides whether the computation should be executed on the CPU or GPU. Next, feature extraction is performed to assist format prediction. To reduce overhead, the feature extraction time and the prediction time are executed in parallel (indicated by the dashed box in the figure). If the original format of the input matrix is already the optimal one, the system proceeds directly to computation; otherwise, the matrix is converted into the predicted format before the SpMV execution. This workflow integrates format prediction, dynamic labeling, and clustering-based optimization into a unified framework, providing the foundation for the subsequent sections.
Figure 3.
Overall workflow of the proposed framework. The pipeline starts with device selection (CPU/GPU), followed by feature extraction and model prediction (executed in parallel, as highlighted by the blue dashed box). Depending on whether the original format is optimal, the matrix is either executed directly or converted to a new format before SpMV.
2.1. Problem Formulation
Sparse matrix–vector multiplication (SpMV) is highly sensitive to the choice of storage format (e.g., COO/CSR/CSC/BSR) and to the target device (CPU or GPU). We model the end-to-end latency by explicitly accounting for (i) device selection, (ii) feature extraction, (iii) format prediction, (iv) format conversion, and (v) SpMV execution. Importantly, the feature extraction time and the device selection time proceed in parallel in our pipeline.
For a given sparse matrix A, the total latency is
where
- : device selection time (Model S; CPU vs. GPU);
- : feature extraction time;
- : prediction time of the format-selection model (Model P);
- : format conversion time (if needed);
- : SpMV execution time.
Compared with formulations that sums and , our pipeline executes and in parallel, reducing their contribution to and reflecting the actual scheduling in Figure 3.
Let be the set of formats and F be the feature space. For a candidate format and feature vector , the end-to-end cost becomes
where and depend on f (Model P uses the extracted features), while and depend on s. The optimization target is then
In the sequel, we use , , , , , and for brevity to denote the corresponding terms in Equation (2).
2.2. Design of Model S
The first step of our framework is to determine whether a sparse matrix should be executed on the CPU or GPU. Small matrices tend to favor CPUs because feature-extraction and format-conversion overheads are comparable to execution time, whereas large matrices benefit from GPU throughput; Model S encodes this regime-aware decision. This decision is crucial because the efficiency of sparse matrix–vector multiplication (SpMV) depends not only on the selected storage format but also on the underlying hardware platform. Executing on the wrong platform may result in significant overhead due to suboptimal utilization of memory bandwidth, cache hierarchy, or parallelism. Therefore, designing an effective model to select the appropriate hardware platform, referred to as Model S, is a fundamental component of our framework.
To construct Model S, we collected execution time data for 1786 sparse matrices on both CPU and GPU platforms. The execution time includes four components: feature extraction (), the prediction overhead of Model S (), format conversion (), and SpMV execution time (). Since and (Model S inference) are executed in parallel, their contribution is considered as rather than a simple sum. The overall time cost is thereby defined as
where the role of Model S is to minimize by predicting whether a given matrix should be executed on the CPU or GPU.
The impact of hardware selection is illustrated by two representative cases. For a small matrix mesh1em1.mtx with , the COO execution on GPUs takes nearly 200 times longer than on CPUs due to the dominant cost of data transfer. In contrast, for a large matrix nd6k.mtx with , the CSC execution on CPUs is nearly 8 times slower than on GPUs, where massive parallelism compensates for conversion overhead. These examples clearly demonstrate the necessity of Model S to select the proper platform according to matrix characteristics.
To identify the key determinants of hardware selection, we applied XGBoost for feature importance analysis across log-transformed structural attributes. Figure 4(1) shows that the number of nonzeros (nnz) and the number of rows (rows) are the dominant features, followed by , row bounce, and row variance. These features effectively describe the sparsity pattern and workload distribution, both of which strongly impact hardware efficiency.
Figure 4.
(1) Feature importance ranking of log-transformed attributes for CPU vs. GPU selection using XGBoost. (2) Distribution of for CPU vs. GPU execution preference across different hardware platforms: (2a) AMD EPYC 7713 (CPU) and NVIDIA A40 (GPU); (2b) Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz (CPU) and NVIDIA A40 (GPU).
To further validate the impact of nnz, we compared CPU and GPU performance across different hardware platforms. Figure 4(2) illustrates the density distribution of . Figure 4(2a) corresponds to AMD EPYC 7713 (CPU) with NVIDIA A40 (GPU), while Figure 4(2b) corresponds to an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10 GHz with NVIDIA A40 (GPU). A clear boundary can be observed in both cases: smaller matrices are consistently more efficient on CPUs, whereas larger matrices benefit from GPU acceleration. Although the exact threshold varies across platforms, the overall trend is robust.
Once the hardware platform is determined by Model S, the next challenge is to select the most suitable storage format for SpMV. However, repeated experiments reveal that the reproducibility of the “best format” is problematic, motivating us to propose a dynamic labeling strategy. This strategy, combined with clustering-based refinement, will be introduced in the following sections.
2.3. Dynamic Multi-Labeling and Classification
Generalized Dynamic Labels
In our previous work, we categorized matrices into fixed-label matrices (FLMs) and mixed-label matrices (MLMs).
Let denote the label set of matrix A, determined by its performance across multiple storage formats. If , A is considered a fixed-label matrix (FLM), where a single format significantly outperforms the others. If , A is categorized as a multi-label matrix (MLM), where multiple formats achieve comparable execution times within a given threshold. This definition provides the foundation for our dynamic labeling strategy.
To illustrate, consider two representative matrices as shown in Figure 2. For freeFlyingRobot.mtx, the CSR format exhibits a total execution time that is much smaller than the other three formats, thus its label set satisfies , making it a fixed-label matrix (FLM). In contrast, for mesh1em1.mtx, after two rounds of experiments (each round averaging 50 runs), the total execution times of COO, CSR, CSC, and BSR are extremely close. In this case, includes all four formats, making it a multi-label matrix (MLM).
In our previous work, only the two fastest formats among the multi-label candidates were preserved, but in this study we generalize the scheme to allow more than two labels, thereby accommodating cases where three or even all four formats perform similarly well. For visualization in Figure 5 and Figure 6, we plot the three axes as , , and ; the actual clustering operates on standardized (z-score) features. The clustering visualization belongs to the preprocessing/label-generation stage and is not part of end-to-end timing.
Figure 5.
Three-dimensional distributions of matrix formats on CPUs in the feature space (log10(nnz), log10(rows), log10(cols)). (a) Before clustering on AMD EPYC 7713; (b) After clustering on AMD EPYC 7713; (c) Before clustering on Intel Xeon Gold 6230; (d) After clustering on Intel Xeon Gold 6230. Dynamic clustering drives samples to stable fixed-label centroids and clarifies inter-class boundaries.
Figure 6.
Three-dimensional feature space distribution of formats on GPUs (left: before clustering; right: after clustering). Only CSR and CSC remain, with CSC dominating (∼80%). After clustering, some CSR points are absorbed into the CSC cluster in the upper-right region. COO is unsuitable for GPUs due to its high atomic operation and synchronization costs, while BSR/DIA suffer from high conversion overhead in general datasets.
In this work, we extend the scheme to a generalized multi-mixed label strategy, where all formats with execution times within the defined threshold are retained as valid labels. Formally, let denote the set of supported formats (e.g., COO, CSR, CSC, BSR), and denotes the execution time of format for a given matrix. We define the minimum execution time as
A format is selected as a valid label if its execution time satisfies
where is the predefined threshold.
When only one format meets Equation (6), the matrix is considered as a fixed-label matrix (FLM). When multiple formats satisfy the condition, the matrix is considered as a mixed-label matrix (MLM), extended here to allow multiple valid labels beyond two.
2.4. Dynamic Labeling and Clustering
Given the set-valued labels defined in Section 2.3, we cluster in 2D/3D feature space around fixed-label centroids and assign multi-label matrices to their nearest centroid. In addition to the traditional 2D plane formed by and , we also consider 3D feature spaces composed of the top-ranked features identified by F-score analysis (e.g., , , and ). This clustering step reduces label noise, aligns multi-label matrices with stable fixed clusters, and improves classification robustness.
Figure 7 illustrates this process. In the left subfigure, multi-label matrix A (white circle) initially possesses multiple candidate labels due to similar execution times. Through clustering, it is assigned to the nearest fixed-label centroid (colored points). The middle subfigure demonstrates the use of a local radius to filter out dispersed points: the local radius, denoted by r, is drawn as the dashed circle, ensuring that anomalies outside this radius do not distort the clustering. The right subfigure shows how A eventually converges toward the cluster with the majority of consistent labels, thereby improving stability and reducing noise.
Figure 7.
Illustration of the clustering process for dynamic labels in a 3D feature space. Left: a multi-label matrix A is initially assigned to its nearest fixed-label centroid. Middle: a local radius r (dashed circle) filters dispersed points; only one neighboring point lies within the radius. Right: A obtains a stable assignment to the label with the majority and the smallest mean distance.
Formally, the local radius r in a 3D (or higher-dimensional) feature space is defined as follows.
where is the value of feature f for matrix B, N is the set of all matrices in the feature space, and is a scaling factor. This formulation generalizes naturally to 3D or higher dimensions depending on the chosen feature set F.
In our experiments, we empirically found that provides the best trade-off between robustness and sensitivity, effectively reducing the influence of dispersed points.
Finally, the refined labels are used for classification. Given matrix A with feature vector and candidate labels , the classification decision is defined as follows.
where is the centroid of class l, and is the Euclidean distance in the chosen feature space. By considering all possible labels, this approach avoids false labeling that can occur when only the top-performing format is used as ground truth, ensuring robust and reliable classification.
2.5. Feature Set
In this work, we employ a set of structural features extracted from sparse matrices to guide format prediction and clustering. Table 1 summarizes the abbreviations and full names of these features.
Table 1.
List of extracted structural features.
These features capture both global characteristics (such as matrix dimensions, density, and diagonal structure) and local statistical properties (such as row/column distributions and variations). By integrating them into the learning and clustering framework, we provide a richer representation of sparsity patterns, thereby enabling more accurate predictions of the optimal format across CPU and GPU platforms.
2.6. 3D Feature Space Before/After Dynamic Clustering (CPU)
We leverage the feature set introduced earlier and the labels refined by dynamic labeling and clustering to build a 3D predictive view. Concretely, we project matrices into a three-dimensional space using the top-correlated features , , and , and visualize the class separation before and after clustering. As shown in Figure 5, (a) and (c) depict the raw distributions prior to clustering, while panels (b) and (d) show the results after applying the dynamic clustering procedure. The top row corresponds to an AMD EPYC 7713 CPU, and the bottom row corresponds to an Intel Xeon Gold 6230 CPU. Before clustering, samples of different formats overlap substantially, leading to fuzzy boundaries. After clustering, points aggregate around stable fixed-label centroids, the inter-class margins become clearer, and label noise is reduced, resulting in more robust and discriminative separation in the space.
2.7. GPU-Side 3D Distributions
On the GPU platform, dynamic labeling and clustering exhibit a distribution pattern that is distinctly different from the CPU side, as shown in Figure 6. Initially, formats such as COO, BSR, and DIA are almost entirely eliminated, leaving only CSR and CSC. Among them, CSC dominates with approximately ∼80% of the samples. After clustering (right), a small portion of CSR samples in the upper-right region of the 3D feature space (i.e., merge into the CSC cluster, resulting in clearer class boundaries.
The underlying causes are strongly related to the implementation characteristics of each format on GPUs: (i) COO, while storage-simple, requires atomic additions during SpMV, leading to high thread conflicts and synchronization overhead, which makes it significantly slower than CSR/CSC for large matrices; (ii) BSR/DIA can be efficient for highly structured matrices, but in general datasets their conversion costs are large and optimized GPU kernels are limited, so they rarely emerge as the best choice; (iii) CSR/CSC benefit from mature GPU implementations and better exploit memory bandwidth and thread parallelism. Among them, CSC tends to perform better for large, uniformly distributed sparsity patterns, which explains its dominance.
This result highlights the strong dependency of format selection on hardware characteristics and shows that optimal formats on GPUs differ significantly from those on CPUs.
3. Results
Scope of this section. We first summarize the experimental platforms and datasets, and then evaluate the (i) hardware-selection accuracy of Model S, (ii) format-prediction accuracy and robustness under dynamic multi-labeling with clustering, and (iii) end-to-end performance via slowdown/speedup ratios.
3.1. Experimental Platform
All experiments were conducted on two servers equipped with heterogeneous CPU processors but with identical GPU resources to ensure fair comparison. Each server was configured with NVIDIA GPUs, including GeForce RTX 3090 (24 GB), RTX 3060 (12 GB), and professional-grade A40 (48 GB). For reproducibility, all SpMV kernels were executed on the same GPU architecture to eliminate hardware bias. Table 2 summarizes the hardware setup. Our experimental dataset consists of 1786 sparse matrices obtained from the University of Florida Sparse Matrix Collection (UFSMC, also known as TAMU) [20].
Table 2.
Hardware configuration of experimental platforms.
The software stack includes Python 3.8/3.9, CuPy (v12.3.0), NumPy (v1.24.4), and SciPy (v1.11.3) under conda-managed environments. All experiments were conducted using CUDA toolkit 12.4 and NVIDIA driver 550.144.03 (NVIDIA Corporation, Santa Clara, CA, USA) to ensure reproducibility.
We summarize the corpus characteristics over 1786 matrices in Table 3, reporting the distributions of rows (n), columns (m), and nonzeros (nnz).
Table 3.
Dataset profile over 1786 matrices: summary statistics for rows (n), columns (m), and nonzeros (nnz).
3.2. Model S Evaluation
To evaluate the effectiveness of Model S in hardware selection, we first verify its ability to distinguish whether a given sparse matrix should be executed on the CPU or GPU. Here, the decision outcome is represented by Rule, where Rule = CPU(0) indicates that the matrix is classified as better suited for execution on the CPU, while Rule = GPU(1) indicates that the GPU is predicted to be the more efficient platform.
Figure 8 illustrates the evaluation results. The left part shows the distribution of performance margins defined as , grouped by Rule choice. A margin near zero indicates that both hardware platforms yield similar performance, while negative or positive values correspond to performance advantages on GPUs or CPUs, respectively. The results demonstrate that when Model S selects GPUs (Rule = 1), the margin is generally positive, meaning that the GPU offers a significant speedup, whereas when CPU is chosen (Rule = 0), the performance gap remains very small, confirming that Model S makes stable and reliable hardware decisions. It is also noteworthy that the margin distribution on the GPU side exhibits a wider variance, indicating that execution time differences are more sensitive to changes in matrix scale and structural features when running on GPUs.
Figure 8.
Evaluation of Model S for hardware selection. (Left) Margin distribution validates that hardware decisions are stable and consistent. (Right) Feature-plane distribution shows the separation between CPU- and GPU-preferred matrices.
The right part further illustrates the decision boundary by showing the distribution of Rule = 0 and Rule = 1 over the feature planes and . The figure reveals a clear separation: matrices with smaller sizes and lower nnz tend to favor CPU execution, while larger-scale matrices shift toward GPUs, demonstrating the effectiveness of the selected features in guiding the hardware decisions.
In this experiment, the hardware selection of Model S is guided by structural features extracted from the input matrices, including the number of rows (), the number of columns (), the number of nonzeros (), density, row- and column-wise statistics (mean, max, min, and standard deviation), and derived metrics such as effective ratio (, ) and row/column balance. Among these, and were found to be the most influential, consistent with the observation that smaller, sparser matrices are better handled by CPUs, while larger, denser matrices benefit more from GPU acceleration.
Hardware-Selection Accuracy (Definition)
For completeness, we formalize the metric used for hardware decisions as follows:
Here, denotes the device that minimizes the end-to-end latency in Equation (1), and is the prediction.
These features form the input space for Model S to predict the appropriate hardware platform. Overall, Model S achieves a prediction accuracy of 98.2%, demonstrating its reliability in guiding hardware selection.
3.3. Evaluation Metrics
We report format-prediction accuracy (single-label and dynamic-label), robustness across repeated rounds, and speedup/slowdown ratios. The definition of hardware-selection accuracy is given at the end of Section 3.2 (Equation (9)). Let denote the test matrices.
3.3.1. Format-Prediction Accuracy
In the single-label setting, the ground truth for is . In the dynamic-label setting, the ground truth is a near-optimal set (Section 2.3). Denote as the predicted format.
3.3.2. Robustness Across Rounds
We train on the clustered labels from the initial round and evaluate the same trained model on R additional rounds (new matrices/perturbed datasets). Robustness is summarized by the mean accuracy across rounds:
3.3.3. Speedup/Slowdown Ratios
For CPU–GPU contrast on the same matrix, the equation is as follows:
Accordingly, values like , , and in the “Small Matrices” column are instances of , whereas the “Large Matrices” column reports .
3.3.4. Acceleration over CSR
Acceleration over the default CSR baseline is defined as follows.
where denotes the end-to-end latency after hardware selection and format prediction.
3.4. Prediction Accuracy Analysis
Table 4 reports the classification accuracy of three models (DecisionTree, 2-layer ANN, and XGBoost) across two CPU platforms and one GPU platform, evaluated under both traditional single-label and proposed dynamic labeling schemes.
Table 4.
Classification accuracy (%) of different models before and after dynamic labeling and clustering. Accuracy is defined in Equation (9).
As shown in Table 4, dynamic labeling and clustering significantly improve prediction accuracy across all platforms and models. On CPU1, the accuracy of DecisionTree increased from 79.47% to 88.20%, while the 2-layer ANN improved from 83.40% to 90.06%. Similarly, CPU2 saw an average improvement of 6–8% across all models. On GPUs, the gains were even more remarkable: the DecisionTree accuracy rose from 92.34% to 98.27%, and XGBoost achieved the highest accuracy of 98.95%. Overall, dynamic labeling combined with clustering enhanced model robustness and reduced label noise, with accuracy improvements ranging from 5% to 10%.
These quantitative improvements are consistent with the visual evidence. As shown in Figure 5, dynamic labeling and clustering lead to clearer boundaries in the 3D feature space . The per-format decomposition of Figure 5b, presented in Figure 9, further illustrates that COO, CSR, CSC, and BSR formats occupy distinct and compact regions. This clear separation explains why CPU-side classification accuracy is significantly improved, as also reflected in Table 4.
Figure 9.
Decomposition of Figure 5b into individual format clusters in CPU feature space. (a) COO, (b) CSR, (c) CSC, (d) BSR. The clear separability among formats explains the higher prediction accuracy observed on CPU platforms.
As shown in Table 5, incorporating three-dimensional (3D) feature spaces significantly improves classification accuracy for CPU platforms compared to traditional two-dimensional (2D) clustering. For CPU1, the accuracy increases from 86.14% (2D, 3f) to 90.06% (3D, 3f), while for CPU2 the improvement is from 86.25% (2D, 3f) to 86.83% (3D, 3f). This demonstrates that 3D clustering is more effective in capturing the intrinsic structure of feature distributions, leading to more stable label assignments.
In contrast, GPU classification results show relatively small differences between 2D and 3D clustering, with accuracies already exceeding 97% across all feature combinations. This suggests that GPU label distributions are more concentrated, and additional feature dimensions provide only marginal improvements.
Overall, these results indicate that the benefit of 3D clustering is more pronounced on CPU platforms, where execution time variability and feature heterogeneity are greater. On GPU platforms, the already stable label distribution limits the additional gains from higher-dimensional feature spaces. Across additional rounds, the mean accuracy remains stable, indicating reproducibility under dynamic labeling; we therefore report the multi-round mean as the robustness measure.
3.5. Acceleration Effect Analysis
Before reporting ratios, Figure 10 visualizes how feature-extraction time () relates to conversion/build time (, proxy; left) and to execution time (, minimal SpMV kernel time; right) over 1786 matrices. is strongly correlated with both () and (), reflecting their common dependence on matrix size/density. For small matrices, is often of the same order as and thus non-negligible in the end-to-end latency; as matrices grow larger, increases more rapidly than , and becomes a small fraction of T. With this end-to-end accounting in place, we next report slowdown/speedup relative to CSR and contrast our cost-aware pipeline with traditional kernel-time-only selectors, to clarify improvements in end-to-end latency.
Figure 10.
Scatter plots over 1786 matrices: (left) feature-extraction time vs. conversion/build time (, proxy); (right) vs. execution time (, minimal SpMV time). Each dot is one matrix; the solid line denotes . Pearson r values are shown in the panel titles. All times are measured with the same timing source under the end-to-end accounting (selection, features, prediction, conversion, execution).
While the previous section focused on prediction accuracy, here we analyze the acceleration effect of hardware selection across small and large sparse matrices. Table 6 reports the performance overhead or speedup of GPU execution compared with CPUs across two platforms. For small matrices, GPU execution is inefficient due to memory transfer overheads: on Server 1 (AMD EPYC 7713, Advanced Micro Devices, Inc., Santa Clara, CA, USA + A40, NVIDIA Corporation, Santa Clara, CA, USA), the GPU is on average slower than the CPU, with the worst case reaching ; on Server 2 (Intel Xeon Gold 6230, Intel Corporation, Santa Clara, CA, USA + A40, NVIDIA Corporation, Santa Clara, CA, USA), the average slowdown is , and the worst case reaches .
In contrast, for large matrices, GPU execution significantly outperforms CPUs. On Server 1, the GPU achieves an average speedup of over the CPU, with the maximum reaching . On Server 2, the GPU is even more advantageous, achieving an average speedup of , with the maximum reaching .
These results demonstrate that the benefit of GPU computation strongly depends on matrix scale and characteristics. Hence, accurate hardware decision-making through Model S is crucial to avoid performance degradation on small matrices and to fully leverage GPU acceleration on large ones.
3.6. Acceleration over Baseline CSR
Beyond analyzing the raw CPU–GPU trade-offs, we further compare the acceleration achieved by Model S relative to the default CSR format. Table 7 summarizes the average and maximum speedups on both CPUs and GPUs across the two experimental platforms.
Table 7.
Acceleration over baseline CSR after Model S format prediction. CSR acceleration is defined in Equation (13).
On CPUs, Server 1 achieves an average speedup of and a maximum of , while Server 2 yields a higher average of with a maximum of . On GPUs, Server 1 obtains an average speedup, peaking at , whereas Server 2 achieves an average of with a maximum of .
These results indicate that even relative to the widely adopted CSR baseline, our hardware-aware format prediction framework delivers consistent and measurable improvements, validating the effectiveness of Model S.
To further evaluate the robustness of our method, we trained the model using the clustered labels from the initial round and tested it on five additional rounds of experiments. Table 8, Table 9, Table 10 summarize the results for CPU1, CPU2, and GPU platforms, respectively. Across all platforms, dynamic labeling with clustering (2D and 3D) consistently outperforms the traditional approach. On CPU1, 3D clustering improves accuracy by up to 14% (round 4) and achieves speedups as high as 1.26×. CPU2 exhibits a similar trend, though with slightly smaller gains. On the GPU platform, the effect is most pronounced: accuracy consistently exceeds 95%, and 3D clustering delivers speedups up to 1.31×. These results demonstrate the robustness of the proposed method—models trained with initial clustered labels generalize well to new rounds of data—and, taken together, they substantiate our accuracy–robustness–acceleration objectives under end-to-end accounting; Section 4 discusses practical implications, limitations, and future directions.
Table 8.
Robustness test on CPU1 across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
Table 9.
Robustness test on CPU2 across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
Table 10.
Robustness test on GPU across five rounds. Accuracy and speedup are compared among traditional, 2D, and 3D clustering labels.
4. Discussion
This study revisits SpMV format prediction on heterogeneous (CPU–GPU) platforms through four practical challenges that often cause instability or misleading conclusions: (C1) label instability across hardware/workloads, where the “best” format may fluctuate across repeated runs [12,13]; (C2) missing end-to-end cost accounting beyond kernel time (i.e., overlooking feature extraction and format conversion costs); (C3) conflating small- vs. large-matrix regimes in CPU–GPU comparisons; and (C4) the lack of a unified CPU–GPU pipeline that yields consistent device/format decisions across devices.
Recent work strengthens format selection and autotuning but typically addresses only subsets of the above challenges. GPU-centric predictors or CPU-only heuristics emphasize kernel execution while under-accounting conversion/feature costs [9,14,16,17,21,22,23]. Within the MDPI literature, Chen et al. propose an adaptive hybrid format strategy for multi-core SIMD CPUs that learns when to combine/choose formats [18], and Sanderson & Curtin show automatic switching among sparse formats within Armadillo, highlighting engineering trade-offs of conversion and storage [24]. The Overhead-Conscious Method (OCM) explicitly models format-conversion overhead and shows that keeping CSR can avoid negative returns at low iteration counts [25,26]; however, OCM relies on single-label procedures and does not include the feature-extraction cost, which may dominate for small matrices. By contrast, our pipeline jointly reasons about hardware selection (CPU vs. GPU), low-cost features, dynamic multi-labeling with clustering, format conversion, and execution in a unified heterogeneous setting.
Section 3 provides end-to-end evidence for these choices. The hardware selector (Model S) attains 98.2% hardware-selection accuracy (Section 3.2). Dynamic multi-labeling with clustering improves format-prediction accuracy by 5–10% over the single-label baseline (Section 3.3). Robustness is evaluated as multi-round reproducibility: the same trained model is tested on R additional rounds and we report the mean accuracy across rounds (Section 3.4). To avoid regime conflation, we report for small matrices and for large matrices, and we quantify acceleration over CSR as (Section 3.5).
Aligned with the four challenges above—and relative to OCM [25,26] and CPU-only/GPU-only predictors [9]—this work (a) optimizes end-to-end latency with explicit feature and conversion cost accounting, (b) stabilizes labels via dynamic multi-labeling + clustering (capturing near-optimal sets), and (c) separates small vs. large regimes in CPU–GPU comparisons; we further develop a unified heterogeneous pipeline with regime-aware reporting.
Explicitly modeling label uncertainty improves reproducibility and reduces overfitting to noisy measurements. Higher-dimensional feature clustering uncovers structure–hardware affinities: large matrices with high nonzeros-per-row density benefit more from GPU acceleration, whereas small matrices tend to favor CPUs due to transfer/conversion overheads. These observations are consistent with prior CPU/GPU-focused studies and the MDPI literature on adaptive/hybrid formats [18,24].
We currently evaluate four storage formats and do not cover distributed or multi-GPU settings. ELL width is chosen heuristically, and the near-optimal threshold may affect the dynamic-label set on edge cases. The UF/SMC corpus composition (sizes, sparsity ranges, application tags) may also introduce bias in format prevalence.
The pipeline can be integrated into heterogeneous HPC/ML/graph-analytics stacks to automate device/format choice with cost awareness. Promising directions include runtime adaptation with online re-labeling, expanding the format set (e.g., DIA/SELL-C/hybrid variants), learned cost models for features/conversion, and evaluation in distributed/multi-GPU environments; unified intermediate representations such as UniSparse [27] may further ease cross-format transformations.
Study Limitations
While the proposed pipeline achieves reproducible accuracy and end-to-end gains, several limitations should be noted.
- (1)
- Threshold/radius selection under distribution shift. We do not claim a universally optimal threshold or local-radius setting for dynamic/mixed labeling and clustering. Different matrix corpora can induce different label distributions, and clustering entails randomness (e.g., initialization, sampling). A principled, data-adaptive choice would require a separate methodological study and a large-scale sensitivity analysis, which are beyond the scope of this work.
- (2)
- Stochasticity of clustering and training. Label construction and learning involve stochastic procedures that can introduce run-to-run variance. We control randomness (fixed seeds, repeated trials) and report robust statistics, but a full variance decomposition across seeds/initializations is not included.
- (3)
- Scope and coverage. This study currently evaluates four storage formats and does not cover distributed or multi-GPU settings. Although feature-extraction and format-conversion costs are explicitly considered in our end-to-end evaluation, learning-based cost models for these components have not yet been integrated into the unified framework. This is a natural extension of the pipeline for future work.
5. Conclusions
In this paper, we take an end-to-end, cost-aware view of SpMV format prediction on heterogeneous (CPU–GPU) platforms. We unify hardware selection, low-cost feature extraction, dynamic multi-labeling with clustering, format conversion, and execution into a single pipeline, and evaluate it with regime-aware metrics.
Our contributions are threefold: (i) a dynamic multi-labeling strategy with clustering that stabilizes labels and improves format-prediction accuracy over a single-label baseline; (ii) a unified, cost-aware CPU–GPU pipeline that explicitly accounts for feature- and conversion-costs in addition to kernel time; and (iii) a regime-aware reporting protocol that separates small- and large-matrix cases (GPU slowdown for small; GPU speedup for large) and quantifies acceleration over CSR.
This design has a practical impact for heterogeneous HPC/ML/graph workloads, providing reproducible, cost-aware automation of device/format choices and clarifying when CPUs or GPUs are preferable and which formats dominate under realistic overheads.
Future work will extend the pipeline along several directions: runtime adaptation with online re-labeling; broader format sets (e.g., DIA/SELL-C/hybrid variants); learned cost models for features and conversion; and evaluation in distributed/multi-GPU settings, potentially aided by unified intermediate representations such as UniSparse.
Author Contributions
Conceptualization, Z.S.; Methodology, Z.S.; Software, Z.S. and X.S.; Validation, Z.S. and X.S.; Formal analysis, Z.S. and X.S.; Investigation, Z.S. and X.S.; Resources, Z.S.; Data curation, Z.S.; Writing—original draft, Z.S.; Writing—review & editing, Z.S., Y.Z. and X.S.; Visualization, Z.S.; Supervision, Z.S. and Y.Z.; Project administration, Z.S.; Funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding. The APC was funded by the authors.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Song, X.; Zou, Y.; Shi, Z.; Liu, Z. GIMS: Image Matching System Based on Adaptive Graph Construction and Graph Neural Network. Neural Netw. 2025, 193, 108030. [Google Scholar] [PubMed]
- Song, X.; Zou, Y.; Shi, Z. CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning. arXiv 2025, arXiv:2508.13716. [Google Scholar]
- Huang, G.; Dai, G.; Wang, Y.; Yang, H. Ge-spmm: General-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 9–19 November 2020; pp. 1–12. [Google Scholar]
- Buluç, A.; Gilbert, J.; Shah, V. Implementing Sparse Matrices for Graph Algorithms. In Graph Algorithms in the Language of Sparse Linear Algebra; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2011. [Google Scholar]
- Song, X.; Zou, Y.; Shi, Z.; Yang, Y. Image matching and localization based on fusion of handcrafted and deep features. IEEE Sens. J. 2023, 23, 22967–22983. [Google Scholar] [CrossRef]
- Ahmad, M.; Sardar, U.; Batyrshin, I.; Hasnain, M.; Sajid, K.; Sidorov, G. A Machine Learning-Based Threads Configuration Tool for SpMV Kernels on GPU. Information 2024, 15, 685. [Google Scholar] [CrossRef]
- Bell, N.; Garland, M. Efficient sparse matrix-vector multiplication on CUDA. In Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2009), Portland, OR, USA, 14–20 November 2009; pp. 1–11. [Google Scholar]
- Buluc, A.; Gilbert, J.R. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments. SIAM J. Sci. Comput. 2012, 34, C170–C191. [Google Scholar] [CrossRef]
- Dufrechou, E.; Buluc, A.; Williams, S. Machine learning models for selecting sparse matrix format in GPU-based SpMV. Int. J. High Perform. Comput. Appl. 2021, 35, 475–488. [Google Scholar]
- Vuduc, R.W.; Demmel, J.W.; Yelick, K.A. Performance modeling for sparse matrix–vector multiplication. In Proceedings of the SC’02: International Conference for High Performance Computing, Networking, Storage and Analysis, Baltimore, MD, USA, 16 November 2002; pp. 1–15. [Google Scholar]
- Im, E.-J.; Yelick, K.A.; Vuduc, R.W. Optimizing sparse matrix–vector multiplication on emerging multicore platforms. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’07), San Jose, CA, USA, 14–17 March 2007; pp. 21–30. [Google Scholar]
- Zhou, W.; Li, J.; Song, S.; Yang, R.; Hu, C.; Sun, X.; Garraghan, P.; Wo, T.; Wen, Z.; Li, C. Performance-aware sparse matrix format selection based on modeling and prediction. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 2560–2574. [Google Scholar]
- Zhao, S.; Gao, X.; Xue, W.; Zhao, Y.; Li, J.; Liao, C.; Shen, X. Bridging the Gap between Deep Learning and Sparse Matrix Formats. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT 2017), Portland, OR, USA, 9–13 September 2017; pp. 933–942. [Google Scholar]
- Zhou, T.; Xiao, G.; Chen, Y.; Li, K. DTSpMV: An Adaptive SpMV Framework for Graph Analysis on GPUs. In Proceedings of the 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Hainan, China, 18–20 December 2022. [Google Scholar]
- Stylianou, C.; Weiland, M. Exploiting Dynamic Sparse Matrices for Performance Portable Linear Algebra Operations. arXiv 2022, arXiv:2209.06478. [Google Scholar]
- Zeng, G.; Zou, Y. Leveraging Memory Copy Overlap for Efficient SpMV on GPUs. Electronics 2023, 12, 3687. [Google Scholar] [CrossRef]
- Ashoury, A.; Elghazawi, T. AutoSpMV: An Automated Framework for Sparse Matrix-Vector Multiplication on GPUs. In Proceedings of the 2023 IEEE International Conference on Cluster Computing, Santa Fe, NM, USA, 31 October–3 November 2023; pp. 103–115. [Google Scholar]
- Chen, S.; Fang, J.; Xu, C.; Wang, Z. Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core CPUs. Appl. Sci. 2022, 12, 9812. [Google Scholar] [CrossRef]
- Shi, Z.; Zou, Y. DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMV. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 2624–2639. [Google Scholar] [CrossRef]
- Davis, T.A.; Hu, Y. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar] [CrossRef]
- Gao, X.; Chen, Z.; Wang, L. Adaptive Format and Kernel Selection for Efficient SpMV on GPUs. J. Parallel Distrib. Comput. 2024, 181, 22–34. [Google Scholar]
- Ahmad, S.; Khan, M. Information-Theoretic Feature Selection for Sparse Matrix Format Prediction. Information 2024, 15, 112. [Google Scholar]
- Ahmad, S. Hybrid Feature–Model Co-Design for Sparse Matrix Computation on Heterogeneous Platforms. Information 2025, 16, 18. [Google Scholar]
- Sanderson, C.; Curtin, R. Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation. Math. Comput. Appl. 2019, 24, 70. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhou, W.; Shen, X.; Yiu, G. Overhead-conscious format selection for SpMV-based applications. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, 21–25 May 2018; pp. 950–959. [Google Scholar]
- Zhou, W.; Zhao, Y.; Shen, X.; Chen, W. Enabling runtime SpMV format selection through an overhead conscious method. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 80–93. [Google Scholar] [CrossRef]
- Liu, J.; Zhao, Z.; Ding, Z.; Brock, B.; Rong, H.; Zhang, Z. UniSparse: An intermediate language for general sparse format customization. Proc. ACM Program. Lang. 2024, 8, 99. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).