Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture
Abstract
1. Introduction
- ✓
- Building on our previous research, where we introduced Elegante, a machine learning tool that predicts a near-optimal number of threads for SpMV computations on shared-memory architectures, we now present Elegante+, an enhanced machine learning-based approach that predicts the best scheduling policies to further optimize SpMV execution. Unlike manual trial-and-error approaches, Elegante+ was evaluated against OpenMP’s default scheduling policy, demonstrating improved performance under multiple threads.
- ✓
- Elegante+ was trained and tested on a diverse dataset of nearly 100 real-world matrices sourced from 44 different applications domains such as linear programming, 2D/3D modeling, computer graphics, computer vision, and computational fluid dynamics (CFD).
- ✓
- Elegante+ evaluates and optimizes SpMV execution, empowering users to make informed decisions regarding scheduling policies, thereby maximizing computational efficiency across various architectures.
- ✓
- Through extensive benchmarking, Elegante+ provides actionable insights for HPC practitioners, facilitating improved workflow optimization and enhanced computational performance in fields ranging from machine learning to large-scale scientific simulations.
2. Related Survey
3. Methodology and Design
3.1. Construction of Dataset
3.2. Data Labeling, Training, and Testing
3.3. Feature Scaling
3.4. Model Evaluation Phase
4. Results and Analysis
4.1. Software and Hardware
4.2. Execution Time Analysis of SpMV Computation
4.3. Acceleration
4.4. Predictive Analysis
4.5. Performance Gain
5. Limitations of the Proposed Solution
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hu, Y.; Du, Y.; Ustun, E.; Zhang, Z. GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs. In Proceedings of the 2021 IEEE/ACM International Conference on Computer Aided Design (ICCAD), Munich, Germany, 1–4 November 2021. [Google Scholar]
- Yesil, S.; Heidarshenas, A.; Morrison, A.; Torrellas, J. Speeding up SpMV for power-law graph analytics by enhancing locality & vectorization. In Proceedings of the SC20: IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 9–19 November 2020. [Google Scholar]
- Sun, M.; Li, Z.; Lu, A.; Li, Y.; Chang, S.E.; Ma, X.; Lin, X.; Fang, Z. FILM-QNN: Efficient FPGA acceleration of deep neural networks with intra-layer, mixed-precision quantization. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, New York, NY, USA, 27 February–1 March 2022. [Google Scholar]
- Nathan, B.; Garland, M. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, OR, USA, 14–20 November 2009. [Google Scholar]
- Feng, X.; Jin, H.; Zheng, R.; Hu, K.; Zeng, J.; Shao, Z. Optimization of sparse matrix-vector multiplication with variant CSR on GPUs. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 7–9 December 2011; pp. 165–172. [Google Scholar]
- Kislal, O.; Ding, M.; Kandemir, I.; Demirkiran, M. Optimizing Sparse Matrix-Vector Multiplication on Emerging Multicores. In Proceedings of the IEEE 6th International Workshop on Multi-/Many-Core Computing Systems (MuCoCoS), Paris, France, 22–23 September 2013; pp. 1–10. [Google Scholar]
- Nickolls, J.; Buck, I.; Garland, M.; Skadron, K. Scalable parallel programming with CUDA: Is CUDA the parallel programming model that application developers have been waiting for? Queue 2008, 6, 40–53. [Google Scholar] [CrossRef]
- Baskaran, M.; Bordawekar, R. Optimizing Sparse Matrix-Vector Multiplication on GPUs; RC24704 W0812–047; IBM Research Reports: Yorktown Heights, NY, USA, 2009. [Google Scholar]
- Mike, G. Efficient sparse matrix-vector multiplication on cache-based GPUs. In Proceedings of the IEEE 2012 Innovative Parallel Computing (InPar), Jeju Island, Republic of Korea, 9–11 July 2012. [Google Scholar]
- Greathouse, J.L.; Daga, M. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proceedings of the SC’14 IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 16–21 November 2014. [Google Scholar]
- Weifeng, L.; Vinter, B. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing, Newport Beach, CA, USA, 8–11 June 2015. [Google Scholar]
- Daga, M.; Greathouse, J.L. Structural agnostic SpMV: Adapting CSR-adaptive for irregular matrices. In Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC), Bengaluru, India, 16–19 December 2015. [Google Scholar]
- Yang, W.; Li, K.; Mo, Z.; Li, K. Performance optimization using partitioned SpMV on GPUs and multicore CPUs. IEEE Trans. Comput. 2014, 64, 2623–2636. [Google Scholar] [CrossRef]
- Hosseinabady, M.; Nunez-Yanez, J.L. A streaming dataflow engine for sparse matrix-vector multiplication using high-level synthesis. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2019, 39, 1272–1285. [Google Scholar] [CrossRef]
- Bowen, L.; Liu, D. Towards high-bandwidth-utilization SpMV on FPGA via partial vector duplication. In Proceedings of the 28th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 6–19 January 2023. [Google Scholar]
- Kourtis, K.; Karakasis, V.; Goumas, G.; Koziris, N. CSX: An extended compression format for SpMV on shared memory systems. ACM SIGPLAN Not. 2011, 46, 247–256. [Google Scholar] [CrossRef]
- Geng, T.; Wang, T.; Sanaullah, A.; Yang, C.; Patel, R.; Herbordt, M. A framework for acceleration of CNN training on deeply-pipelined FPGA clusters with work and weight load balancing. In Proceedings of the IEEE 2018 28th International Conference on Field Programmable Logic and Applications (FPL), Dublin, Ireland, 3–6 September 2018. [Google Scholar]
- Du, Y.; Hu, Y.; Zhou, Z.; Zhang, Z. High-performance sparse linear algebra on HBM-equipped FPGA using HLS: A case study on SpMV. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 27 February–1 March 2022. [Google Scholar]
- Muhammed, T.; Mehmood, R.; Albeshri, A.; Katib, I. SURAA: A novel method and tool for Loadbalanced and coalesced SpMV computations on GPUs. Appl. Sci. 2019, 9, 947. [Google Scholar] [CrossRef]
- Davis, T.A.; Hu, Y. The University of Florida sparse matrix collection, ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar]
- Pinar, A.; Heath, M.T. Improving performance of sparse matrix-vector multiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, Portland, OR, USA, 14–19 November 1999; p. 30. [Google Scholar]
- Usman, S.; Mehmood, R.; Katib, I.; Albeshri, A. ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures. IEEE Access 2019, 7, 81279–81296. [Google Scholar] [CrossRef]
- Ahmed, M.; Usman, S.; Shah, N.A.; Ashraf, M.U.; Alghamdi, A.M.; Bahadded, A.A.; Almarhabi, K.A. AAQAL: A machine learning-based tool for performance optimization of parallel SPMV computations using block CSR. Appl. Sci. 2022, 12, 7073. [Google Scholar] [CrossRef]
- Usman, S.; Mehmood, R.; Katib, I.; Albeshri, A.; Altowaijri, S.M. ZAKI: A smart method and tool for automatic performance optimization of parallel SpMV computations on distributed memory machines. Mob. Netw. Appl. 2023, 28, 744–763. [Google Scholar] [CrossRef]
- Xiao, G.; Zhou, T.; Chen, Y.; Hu, Y.; Li, K. Machine Learning-Based Kernel Selector for SpMV Optimization in Graph Analysis. ACM Trans. Parallel Comput. 2024, 11, 1–25. [Google Scholar] [CrossRef]
- Yesil, S.; Heidarshenas, A.; Morrison, A.; Torrellas, J. Wise: Predicting the performance of sparse matrix vector multiplication with machine learning. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Montreal, QC, Canada, 25 February–1 March 2023; pp. 329–341. [Google Scholar]
- Gao, J.; Ji, W.; Liu, J.; Wang, Y.; Shi, F. Revisiting thread configuration of SpMV kernels on GPU: A machine learning based approach. J. Parallel Distrib. Comput. 2024, 185, 104799. [Google Scholar] [CrossRef]
- Shi, Y.; Dong, P.; Zhang, L.J. An irregular sparse matrix SpMV method. Comput. Eng. Sci. 2024, 46, 1175. [Google Scholar]
- Dufrechou, E.; Ezzatti, P.; Quintana-Orti, E.S. Selecting optimal SpMV realizations for GPUs via machine learning. Int. J. High-Perform. Comput. Appl. 2021, 35, 254–267. [Google Scholar] [CrossRef]
- Ahmad, M.; Sardar, U.; Batyrshin, I.; Hasnain, M.; Sajid, K.; Sidorov, G. Elegante: A Machine Learning-Based Threads Configuration Tool for SpMV Computations on Shared Memory Architecture. Information 2024, 15, 685. [Google Scholar] [CrossRef]
Matrix Name | Application Domain | Rows | Columns | NNZ |
---|---|---|---|---|
abb313 | Least-Squares Problem | 313 | 176 | 1557 |
arc130 | Materials Problem | 130 | 130 | 1037 |
bcsstk12 | Duplicate Structural Problem | 1473 | 1473 | 34,241 |
bcsstk32 | Structural Problem | 44,609 | 44,609 | 2,014,701 |
beacxc | Economic Problem | 497 | 506 | 50,409 |
bibd 18 9 | Combinatorial Problem | 153 | 48,620 | 1,750,320 |
bp 1200 | Optimization Problem Sequence | 822 | 822 | 4726 |
ccc | Undirected Graph Sequence | 104,856 | 1,048,576 | 4,194,298 |
circuit 1 | Circuit Simulation Problem | 2624 | 2624 | 35,823 |
CurlCurl 0 | Model Reduction Problem | 11,083 | 11,083 | 113,343 |
dielFilterV2clx | Electromagnetics Problem | 607,232 | 607,232 | 25,309,272 |
flowmeter0 | Model Reduction Problem | 9669 | 9669 | 67,391 |
fs 541 1 | 2D/3D Problem Sequence | 541 | 541 | 4282 |
fs 760 2 | Subsequent 2D/3D Problem | 760 | 760 | 5739 |
FX March2010 | Term/Document Graph | 1319 | 9498 | 301,899 |
GD00 c | Directed Multigraph | 638 | 638 | 1041 |
gemat12 | Subsequent Power Network Problem | 4929 | 4929 | 33,044 |
gyro k | Duplicate Model Reduction Problem | 17,361 | 17,361 | 1,021,159 |
hangGlider 3 | Optimal Control Problem | 15,561 | 15,561 | 149,532 |
Hardesty2 | Computer Graphics/Vision Problem | 929,901 | 303,645 | 4,020,731 |
impcol a | Chemical Process Simulation Problem | 207 | 207 | 572 |
kron g500- logn19 | Undirected Multigraph | 524,288 | 524,288 | 43,562,265 |
lp bnl2 | Linear Programming Problem | 2324 | 4486 | 14,996 |
mawi 201512012345 | Undirected Weighted Graph | 18,571,154 | 18,571,154 | 38,040,320 |
nemeth01 | Theoretical/Quantum Chemistry Problem Sequence | 9506 | 9506 | 725,054 |
nemeth02 | Subsequent Theoretical/Quantum Chemistry Problem | 9506 | 9506 | 394,808 |
NLR | Undirected Graph | 4,163,763 | 4,163,763 | 24,975,952 |
nv2 | Semiconductor Device Problem | 1,453,908 | 1,453,908 | 37,475,646 |
onetone2 | Frequency-Domain Circuit Simulation Problem | 36,057 | 36,057 | 222,596 |
Pd | Counter Example Problem | 8081 | 8081 | 13,036 |
Preferential Attachment | Random Undirected Graph | 100,000 | 100,000 | 999,970 |
rbsa480 | Robotics Problem | 480 | 480 | 17,088 |
S20PI n | Eigenvalue/Model Reduction Problem | 1182 | 1182 | 2881 |
saylr4 | Computational Fluid Dynamics Problem | 3564 | 3564 | 22,316 |
shl 200 | Subsequent Optimization Problem | 663 | 663 | 1726 |
thermal2 | Thermal Problem | 1,228,045 | 1,228,045 | 8,580,313 |
young1c | Acoustics Problem | 841 | 841 | 4089 |
Set | Features | Description | Formula | Complexity |
---|---|---|---|---|
Basic Features | The number of rows | M | ||
The number of columns | N | |||
The number of rows + the number of columns | M+N | |||
The number of non-zeros | ||||
High-Complexity Features | The minimum nnz | |||
The maximum | ||||
The average | ||||
The standard deviation of non-zero elements per row | ||||
The average column distance between the first and last non-zero element in each row | ||||
The minimum column distance between the first and last non-zero element in each row | ||||
The maximum column distance between the first and last non-zero element in each row | ||||
The standard deviation of the column distances between the first and last non-zero element in each row | ||||
Clustering |
Category | Detail | Specification |
---|---|---|
Hardware | CPU type | AMD EPYC 7401P |
Physical cores | 24 | |
Threads per core | 2 | |
Software | Operating system | Windows 10 Pro (version 18363.904) |
OpenMP | OpenMP 5.1 | |
Compiler | Visual Studio Professional 2019 (version 16.5) | |
Python | Google Colab (Python 3.6.7) | |
Scikit-Learn | Scikit-Learn 0.23.1 |
Features | Model | CV Score | Precision | Recall | F1-Score |
---|---|---|---|---|---|
All Features | RF | 0.792 | 0.92 | 0.92 | 0.92 |
XGB | 0.793 | 0.92 | 0.92 | 0.92 | |
DT | 0.78 | 0.92 | 0.92 | 0.92 | |
KNN | 0.639 | 0.77 | 0.77 | 0.76 | |
Important Features | RF | 0.791 | 0.92 | 0.92 | 0.92 |
XGB | 0.786 | 0.92 | 0.91 | 0.91 | |
KNN | 0.637 | 0.77 | 0.77 | 0.76 | |
DT | 0.783 | 0.92 | 0.92 | 0.92 | |
Basic Features | RF | 0.765 | 0.90 | 0.90 | 0.90 |
XGB | 0.757 | 0.89 | 0.89 | 0.88 | |
KNN | 0.634 | 0.75 | 0.75 | 0.75 | |
DT | 0.759 | 0.90 | 0.89 | 0.89 |
Features | Model | Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|---|---|
Important features | RF | dynamic | 0.93 | 0.96 | 0.95 | 376 |
runtime | 0.94 | 0.86 | 0.90 | 376 | ||
guided | 0.89 | 0.93 | 0.91 | 376 | ||
static | 0.92 | 0.93 | 0.93 | 376 | ||
All Features | XGB | dynamic | 0.94 | 0.96 | 0.95 | 376 |
runtime | 0.94 | 0.88 | 0.90 | 376 | ||
guided | 0.90 | 0.94 | 0.92 | 376 | ||
static | 0.94 | 0.94 | 0.94 | 376 | ||
Basic Features | RF | dynamic | 0.90 | 0.97 | 0.93 | 376 |
runtime | 0.91 | 0.85 | 0.88 | 376 | ||
guided | 0.88 | 0.89 | 0.88 | 376 | ||
static | 0.91 | 0.89 | 0.90 | 376 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmad, M.; Usman, S.; Hamza, A.; Muzamil, M.; Batyrshin, I. Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture. Information 2025, 16, 553. https://doi.org/10.3390/info16070553
Ahmad M, Usman S, Hamza A, Muzamil M, Batyrshin I. Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture. Information. 2025; 16(7):553. https://doi.org/10.3390/info16070553
Chicago/Turabian StyleAhmad, Muhammad, Sardar Usman, Ameer Hamza, Muhammad Muzamil, and Ildar Batyrshin. 2025. "Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture" Information 16, no. 7: 553. https://doi.org/10.3390/info16070553
APA StyleAhmad, M., Usman, S., Hamza, A., Muzamil, M., & Batyrshin, I. (2025). Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture. Information, 16(7), 553. https://doi.org/10.3390/info16070553