Calculating the Singular Values of Many Small Matrices on GPUs
Abstract
1. Introduction
1.1. Application Context
1.2. Limitations of Literature
1.3. Our Contributions
- we introduce a novel, multi-GPU algorithm for computing the singular values of many small matrices via interlaced global memory storage, Householder-based bidiagonalization, tridiagonalization, and Sturm-sequence bisection, allowing direct control of the accuracy–speed trade-off via the tolerance parameter ;
- we provide a CUDA implementation that achieves remarkable speedups over other existing approaches and whose target accuracy can be set to match MATLAB’s state-of-the-art svd routine;
- we demonstrate the algorithm scalability across multiple GPUs and show the computational performance behavior against the target accuracy.
1.4. Differences with the Literature
1.5. Paper Layout
2. The Approach
2.1. Statement of the Problem
2.2. Literature Review
2.2.1. The Golub-Reinsch Algorithm
2.2.2. The One-Sided Jacobi Method
2.2.3. Differences Between the Golub-Reinsch Algorithm and the One-Sided Jacobi Method
2.2.4. Outline of Our Approach
- Interlaced memory storage. A pre-processing step amounts to interlace the different matrices to be processed in order to enable coalesced memory accesses.
- Bidiagonalization. The first step consists of reducing the input matrix into a bidiagonal form, following the same lines as the Golub-Reinsch algorithm.
- Tridiagonalization. The second step performs a symmetric tridiagonalization of the bidiagonal matrix.
- Root finding. The third step searches for the roots of the characteristic polynomial associated with the symmetric tridiagonal matrix by bisection using Sturm sequences.

2.3. First Step: Bidiagonalization
2.3.1. Householder Reflection
2.3.2. Iterations
| Algorithm 1 Bidiagonalization. |
| if
then ; else ; end for do 1. Compute the ith left Householder vector relative to the column . 2. Update the ith row of to compute . 3. Compute the ith right Householder vector from . 4. Compute from (11) and (12), respectively. 5. Compute from (13). 6. Update the matrix using (10). end for do 1. Compute the ith left Householder vector relative to the column . 2. Update the input matrix using (8): . end |
2.4. Second Step: Tridiagonalization
2.5. Third Step: Root Finding
3. GPU Implementation
- data organization in device global memory to foster coalesced memory accesses;
- parallelization of the bidiagonalization step;
- parallelization of the tridiagonalization step;
- parallelization of the root finding approach using Sturm sequences;
- multi-GPU processing.
- Use of const and __restrict__ whenever applicable to the input parameters of __global__ functions, which, as known, enables the use of the read-only data cache available since the Kepler architecture [45].
- Use of template kernel parameters, which requires to know relevant problem parameters, as the matrix pool size K and the matrix dimensions m and n, at compile time. We have actually implemented two versions of the approach, one using template kernel parameters and the other dismissing template kernel parameters and the performance of both will be sketched in the results Section. This has been carried out to cope with different degrees of knowledge at compile time on the problem parameters.
3.1. Data Organization in Device Global Memory
| Algorithm 2 Memory reorganization __global__ function. |
__global__void (double *, double *, int K, int , int ) { int ; if for (int ++) for (int ++) } |
3.2. Thread-Level and Instruction-Level Parallelism
3.3. Parallelization of the First Step (Bidiagonalization)
| Algorithm 3 Bidiagonalization __global__ function. |
__global__void (double *, int , int K, int , int ) { int ; double , , , , , ; double , ; for (int ++) { if () { if () { (, , , , , K, , ); for (int ++) { ; for (int ++) ; ; ; } (h, , , , , K, , ); for (int ++) { ; for (int ++) ; ; } ; for (int ++) ; for (int ++) ; for (int ++) for (int ++) ; }} if () { if () { (, , , , , K, , ); for (int ++) { double ; for (int ++) ; ; } for (int ++) for (int ++) } } } } |
3.4. Parallelizations of the Second (Tridiagonalization) and Third (Root Finding) Steps
| Algorithm 4 Bisection __global__ function. |
__global__void (double * d, double * b, double * , double *, double *, int K, int , double ) { int ; int ; double , , ; __shared__double , , ; if ( && ) { (, , , , K, ); (, , K, ); (, , , , K, ); } __syncthreads(); if ( && ) { int ; int ; double ; ; ; ; double ; double ; while () { ; (, , , c, , K, ); if ; else ; ; ; } ; } } |
3.5. Multi-GPU Processing
3.6. Computational Complexity and Memory Requirements Per-Thread
3.6.1. Bidiagonalization
3.6.2. Tridiagonalization
3.6.3. Root-Finding by Sturm Bisection
3.7. Comparison with the Approaches in [14,22,23]
4. Numerical Results
4.1. Accuracy, Runtime, and Stability
- we quantify the effect of the tolerance parameter on both accuracy and runtime;
- we assess the robustness of the method on ill-conditioned matrices;
- we assess the robustness of the method against noisy matrices.
4.1.1. Percentage Root Mean Square Error
4.1.2. Tolerance Against Accuracy and Runtime
4.1.3. Stability Against the Condition Number
4.1.4. Noise Perturbation
4.2. Performance of the Approach for the Single GPU Case
4.3. Performance of the Approach in Comparison to cuSOLVER’s Gesvd
4.4. Performance of the Approach for the Multi-GPU Case
5. Conclusions and Future Developments
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
List of Abbreviations
| AWS | Amazon Web Services |
| CUDA | Compute Unified Device Architecture |
| GPU | Graphics Processing Unit |
| MKL | Intel Math Kernel Library |
| MIMO | Multiple-Input/Multiple-Output |
| RMS | root-mean-square |
| SVD | Singular Value Decomposition |
Appendix A
- handling the division by zero;
- preserving the monotonicity.
| Algorithm A1 - pivmin version. |
; ; for
do ; if then ; end if then ; end end |
| Algorithm A2 . |
; ; for
do ; if then ; end end |
References
- Bertero, M.; Boccacci, P. Introduction to Inverse Problems in Imaging; Institute of Physics Publishing: Bristol, UK, 1998. [Google Scholar]
- Hua, Y.; Liu, W. Generalized Karhunen-Loeve transform. IEEE Signal Proc. Lett. 1998, 5, 141–142. [Google Scholar]
- Bavirisetti, D.P.; Dhuli, R. Fusion of infrared and visible sensor images based on anisotropic diffusion and Karhunen–Loeve Transform. IEEE Sensors J. 2016, 16, 203–209. [Google Scholar] [CrossRef]
- Narwaria, M.; Lin, W. SVD-based quality metric for image and video using machine learning. IEEE Trans. Syst. Man Cybern.-Part B Cybern. 2012, 42, 347–364. [Google Scholar] [CrossRef] [PubMed]
- Swaminathan, S.; Garg, D.; Kannan, R.; Andres, F. Sparse low rank factorization for deep neural network compression. Neurocomputing 2020, 398, 185–196. [Google Scholar] [CrossRef]
- Capozzoli, A.; Curcio, C.; D’Elia, G.; Ferrara, F.; Gennarelli, C.; Guerriero, R.; Liseno, A. A probe-compensated helicoidal NF-FF transformation for aperture antennas using a prolate spheroidal expansion. Int. J. Antennas Prop. 2012, 2012, 753156. [Google Scholar] [CrossRef]
- Capozzoli, A.; Curcio, C.; Liseno, A. Singular Value Optimization in inverse electromagnetic scattering. IEEE Antennas Wirel. Prop. Lett. 2017, 16, 1094–1097. [Google Scholar] [CrossRef]
- Xiong, Y.; Liang, B.; Yu, H.; Chen, J.; Jin, Y.; Xing, M. Processing of bistatic SAR data With nonlinear trajectory using a controlled-SVD algorithm. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5750–5759. [Google Scholar] [CrossRef]
- Breglia, A.; Capozzoli, A.; Curcio, C.; Donna, S.D.; Liseno, A. GPU-based electromagnetic optimization of MIMO channels. ACES Express J. 2018, 33, 172–175. [Google Scholar]
- Capozzoli, A.; Curcio, C.; Donna, S.D.; Liseno, A. Massive computation of singular values of small matrices on GPUs. In Proceedings of the IEEE Computational Electromagnetics International Workshop (CEM), Izmir, Turkey, 1–4 July 2015; pp. 36–37. [Google Scholar]
- Messer, O.E.B.; Harris, J.A.; Parete-Koon, S.; Chertkow, M.A. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Applied Parallel and Scientific Computing; Manninen, P., Ed.; Springer: Berlin/Heidelberg, Germany; pp. 92–106.
- Sedghi, H.; Gupta, V.; Long, P.M. The singular values of convolutional layers. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–12. [Google Scholar]
- Jia, K.; Tao, D.; Gao, S.; Xu, X. Improving training of deep neural networks via Singular Value Bounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3994–4002. [Google Scholar]
- Dong, T.; Haidar, A.; Tomov, S.; Dongarra, J. Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs. J. Comput. Sci. 2018, 26, 237–245. [Google Scholar] [CrossRef]
- Drinea, E.; Drineas, P.; Huggins, P. A randomized singular value decomposition algorithm for image processing applications. In Proceedings of the 8th Panhellenic Conference in Informatics, Nicosia, Cyprus, 8–10 November 2001; pp. 1–10. [Google Scholar]
- Kirk, D.B.; Hwu, W.-M. Programming Massively Parallel Processors, A Hands-On Approach, 2nd ed.; Morgan Kaufmann: Waltham, MA, USA, 2013. [Google Scholar]
- Breglia, A.; Capozzoli, A.; Curcio, C.; Liseno, A. CUDA expression templates for electromagnetic applications on GPUs. IEEE Antennas Prop. Mag. 2013, 55, 156–166. [Google Scholar] [CrossRef]
- Capozzoli, A.; Kilic, O.; Curcio, C.; Liseno, A. The success of GPU computing in applied electromagnetics. Appl. Electromagn. Soc. J. 2018, 33, 148–151. [Google Scholar]
- Golub, G.H.; Kahan, W. Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numer. Anal. 1965, 2, 205–224. [Google Scholar] [CrossRef]
- Vandebril, R.; Barel, M.V.; Mastronardi, N. A QR–method for computing the singular values via semiseparable matrices. Numer. Math. 2004, 99, 163–195. [Google Scholar] [CrossRef][Green Version]
- Hassan, D.A. Deep neural network-based approach for computing singular values of matrices. Sci. J. Univ. Zakho 2025, 13, 1–6. [Google Scholar] [CrossRef]
- Xiao, J.; Xue, Q.; Ma, H.; Zhang, X.; Tan, G. A W-cycle algorithm for efficient batched SVD on GPUs. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’22, Seoul, Republic of Korea, 2–6 April 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 465–466. [Google Scholar]
- Boukaram, W.H.; Turkiyyah, G.; Ltaief, H.; Keyes, D.E. Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comp. 2018, 74, 19–33. [Google Scholar] [CrossRef]
- Sachdev, G.S.; Vanjani, V.; Hall, M.W. Takagi factorization on GPU using CUDA. In Proceedings of the Symposium on Application Accelerators in High Performance Computing, Knoxville, Tennessee, 13–15 July 2010; pp. 1–3. [Google Scholar]
- Clarkson, E.; Palit, R.; Kupinski, M.A. SVD for imaging systems with discrete rotational symmetry. Opt. Express 2010, 18, 25306–25320. [Google Scholar] [CrossRef] [PubMed]
- Golub, G.H.; Loan, C.F.V. Matrix Computations, 3rd ed.; Johns Hopkins University Press: Baltimore, MD, USA, 1996. [Google Scholar]
- Demmel, J.W.; Dhillon, I.; Ren, H. On the correctness of some bisection-like parallel eigenvalue algorithms in floating point arithmetic. Electron. Trans. Numer. Anal. 1995, 3, 116–149. [Google Scholar]
- Hermann, E.; Raffin, B.; Faure, F.; Gautier, T.; Allard, J. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In Proceedings of the European Conference on Parallel Processing, Ischia, Italy, 31 August–3 September 2010; pp. 235–246. [Google Scholar]
- NVIDIA. cuSOLVER Library. DU-06709-001_v11.6, February 2022. Available online: https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://docs.nvidia.com/cuda/archive/11.6.1/pdf/CUSOLVER_Library.pdf&ved=2ahUKEwiJ4sCjioSPAxWOavUHHXUEKHYQFnoECBgQAQ&usg=AOvVaw0It8ZVo3CW4ZU2r33qj4zv (accessed on 5 May 2025).
- Lahabar, S.; Narayanan, P.J. Singular value decomposition on GPU using CUDA. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing, Rome, Italy, 23–29 May 2009; pp. 1–10. [Google Scholar]
- Kotas, C.; Barhen, J. Singular value decomposition utilizing parallel algorithms on graphical processors. In Proceedings of the OCEANS MTS/IEEE KONA, Waikoloa, HI, USA, 19–22 September 2011; pp. 1–7. [Google Scholar]
- Golub, G.H.; Reinsch, C. Handbook Series Linear Algebra. Singular Value Decomposition and Least Squares Solutions. Numer. Math. 1970, 14, 403–420. [Google Scholar] [CrossRef]
- Nash, J.C. A one-sided transformation method for the Singular Value Decomposition and algebraic eigenproblem. Comput. J. 1975, 18, 74–76. [Google Scholar] [CrossRef]
- Hestenes, M.R. Inversion of matrices by biorthogonalization and related results. J. Soc. Ind. Appl. Math. 1958, 6, 51–90. [Google Scholar] [CrossRef]
- Zhao, S.; Li, R.; Tian, W.; Xiao, W.; Dong, X.; Liao, D.; Khan, S.U.; Li, K. Divide-and-conquer approach for solving singular value decomposition based on MapReduce. Concurr. Comput. Pract. Exp. 2016, 38, 331–350. [Google Scholar] [CrossRef]
- Drmać, Z.; Veselixcx, K. New Fast and Accurate Jacobi SVD Algorithm. I. SIAM J. Matrix Anal. Appl. 2008, 29, 1322–1342. [Google Scholar] [CrossRef]
- Drmać, Z.; Veselixcx, K. New Fast and Accurate Jacobi SVD Algorithm. II. SIAM J. Matrix Anal. Appl. 2008, 29, 1343–1362. [Google Scholar] [CrossRef]
- Zhou, B.B.; Brent, R.P. On parallel implementation of the one-sided Jacobi algorithm for singular value decompositions. In Proceedings of the Euromicro Workshop on Parallel and Distributed Processing, San Remo, Italy, 25–27 January 1995; pp. 401–408. [Google Scholar]
- Zhou, B.B.; Brent, R.P. A Parallel ring ordering algorithm for efficient one-sided Jacobi SVD computations. J. Parallel Distrib. Comput. 1997, 42, 1–10. [Google Scholar] [CrossRef][Green Version]
- Luk, F.T.; Park, H. On parallel Jacobi orderings. SIAM J. Sci. Stat. Comput. 1989, 10, 18–26. [Google Scholar] [CrossRef]
- Demmel, J.; Veselić, K. Jacobi’s method is more accurate than QR. SIAM J. Matrix Anal. Appl. 1992, 13, 1204–1245. [Google Scholar] [CrossRef]
- Wilkinson, J.H. The Algebraic Eigenvalue Problem; Oxford University Press: New York, NY, USA, 1988. [Google Scholar]
- Barth, W.; Martin, R.S.; Wilkinson, J.H. Calculation of the eigenvalues of a symmetric tridiagonal matrix by the method of bisection. Numer. Math. 1967, 9, 386–393. [Google Scholar] [CrossRef]
- Gerschgorin, S.A. Über die abgrenzung der eigenwerte einer matrix. Izv. Akad. Nauk. SSSR Ser. Fiz.-Mat. 1931, 7, 749–754. [Google Scholar]
- Yi, X.; Stokes, D.; Yan, Y.; Liao, C. CUDAMicroBench: Microbenchmarks to assist CUDA performance programming. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, Portland, OR, USA, 17–21 June 2021; pp. 397–406. [Google Scholar]
- Choi, J.; Dongarra, J.J.; Walker, D.W. The design of a parallel dense linear algebra software library: Reduction to Hessenberg, tridiagonal, and bidiagonal form. Numer. Algorith. 1995, 10, 379–399. [Google Scholar] [CrossRef]
- Ralha, R.M.S. A new algorithm for Singular Value Decompositions. In Proceedings of the 2nd Euromicro Workshop on Parallel and Distributed Processing, Malaga, Spain, 26–28 January 1994; pp. 240–244. [Google Scholar]
- Kahan, W. Accurate Eigenvalues of a Symmetric Tridiagonal Matrix; Computer Science Department Technical Report CS41; Standford University: Standford, CA, USA, July 1966. [Google Scholar]














| D | PRMSE |
|---|---|
| PRMSE | |
|---|---|
| Size | MKL | Double | Double | Float | Size | Double |
|---|---|---|---|---|---|---|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Capozzoli, A.; Curcio, C.; Di Donna, S.; Liseno, A. Calculating the Singular Values of Many Small Matrices on GPUs. Electronics 2025, 14, 3217. https://doi.org/10.3390/electronics14163217
Capozzoli A, Curcio C, Di Donna S, Liseno A. Calculating the Singular Values of Many Small Matrices on GPUs. Electronics. 2025; 14(16):3217. https://doi.org/10.3390/electronics14163217
Chicago/Turabian StyleCapozzoli, Amedeo, Claudio Curcio, Salvatore Di Donna, and Angelo Liseno. 2025. "Calculating the Singular Values of Many Small Matrices on GPUs" Electronics 14, no. 16: 3217. https://doi.org/10.3390/electronics14163217
APA StyleCapozzoli, A., Curcio, C., Di Donna, S., & Liseno, A. (2025). Calculating the Singular Values of Many Small Matrices on GPUs. Electronics, 14(16), 3217. https://doi.org/10.3390/electronics14163217

