An Alternate GPU-Accelerated Algorithm for Very Large Sparse LU Factorization
Abstract
:1. Introduction
2. SuperLU3D_Alternate Algorithm
2.1. Background
2.1.1. Elimination Trees
2.1.2. Block Algorithm for LU Factorization
- (1)
- Diagonal factorization: and are obtained from LU factorization of block matrix in Equation (2);
- (2)
- Panel updates: and ;
- (3)
- Schur-complement update: , and update ;
2.1.3. The 2D SuperLU_DIST Algorithm
- (1)
- Diagonal factorization: The process factorizes into . The matrix block belongs to ;
- (2)
- Diagonal broadcast (blue arrows in Figure 3): the process broadcasts along its process row to block , and broadcasts along its process column to block ;
- (3)
- Panel solve: Each process in the calculates , and converts the , which is stored in a sparse format, into a dense format, (because matrix is a dense supernode mentioned in Section 2.1). Each process in the calculates ;
- (4)
- Panel broadcast: each process in or broadcasts blocks of or to its process column or row, respectively;
- (5)
- Schur-complement update: the process updates by , where the process is No. of process.
2.1.4. The SuperLU3D Algorithm
2.2. The SuperLU3D_Alternate Algorithm
- (1)
- Cluster computing resources: We assume that the hybrid CPU/GPU cluster is composed of m compute nodes. The i-th node has memory and GPUs with the same video memory size (), where . The number of compute nodes of is defined as , and the nodes are sorted from the largest of to the smallest;
- (2)
- Matrix blocking, process binding, and data structures: For simplicity, it is assumed that the leaf nodes of the elimination tree E are load-balanced on each level, in other words, the amount of computation and storage on each leaf node is almost the same. Moreover, we assume that the levels in E are indexed in the top-down order. Therefore, the parent node index of the tree is zero, while the leaf node index is from 1 to l. At each level, we define as the activation state of subtree-i. If the leaf subtrees have been merged to form the parent trees, we will make . We define , and is the number of 2D process grids, where . We map each submatrix to a 2D process grid (), so it is equivalent to dividing the total matrix into submatrices. Therefore, we consider that the processes are organized in a three-dimensional virtual processor grid with rows, columns, and depths using the MPI Cartesian grid topology construct. Each process has a 3D process coordinate . A 2D process grid contains those processes with the same . It is convenient for processes with the same coordinates to communicate between different 2D process grids. Each process invokes a piece of GPU memory, which is less than . We bind processes with the same to the same GPU, where is the MPI rank of process. We use a data structure that consists of sparse matrices and elimination trees, which is the same as the data structure in the SuperLU3D algorithm. The SuperLU3D algorithm uses a principal data structure SuperMatrix to represent a general matrix, sparse or dense. The SuperMatrix structure contains two levels of fields. The first level defines all the properties of a matrix that are independent of how it is stored in memory. The properties contain the storage type, the data type, the number of rows, the number of columns, and a pointer to the actual storage of the matrix (*Store). The second level (*Store) points to the actual storage of the matrix. The *Store points to a structure of four elements that consists of the number of nonzero elements in the matrix, a pointer to an array of nonzero values (*nzval), a pointer to an array of row indices of the nonzeros elements, and a pointer to the array of beginning of columns in nzval[]. The elimination tree is a vector of parent pointers for a forest;
- (3)
- The per-process host memory and GPU memory, and , required to store all the LU factors: For LU factorization of the matrices arising from 2D PDEs, ; for LU factorization of those arising from 3D PDEs, [4], where is the dimension of the supernodes in the level-i. However, these two equations are not suitable for the matrix with a more complex structure. We can obtain these two memories using the SuperLU3D_Alternate program output instead.
2.2.1. Case A: Sufficient Memory for GPU Nodes
- I.
- II.
- After factorization of , we synchronize the GPU data structure into the host memory, denoted as (function copyLU2D_GPU2CPU() in Figure 5b, see Algorithm 1). Then, we free the GPU data structure in the device memory;
- III.
- Like steps I and II, we decompose other submatrices from to with acceleration (②, ③, and ④ in Figure 5a) and save into the host memory in turn. Finally, we reduce all the common ancestor nodes (function Ancestor_Reduction(), see Algorithm 1) to finish the LU factorization of the l-th level;
- IV.
Algorithm 1: Case A |
1 for to 0 do /* grid3d has a 3D MPI communicator and lower-dimensional subcommunicator. */ /* is the input matrix. */ /* stored in the GPU memory has a similar data structure to the matrix . */ /* stored in the host memory is copied from the modified . */ /* are the subtrees at the lvl-th level. The elimination tree can have multiple disjoint subtrees as a node. So, the final partition of the elimination tree is a tree of forests, which we call the elimination tree-forest [4]. If the leaf subtrees are merged to form the parent trees, then . */ 2 3 if then /* reduce the nodes of the ancestor matrix before factorizing of the next level [4]*/ 4 5 end 6 end |
7 Data: Result: /* Perform LU factorization of grid- to grid- with GPU acceleration */ 8 for to do /* grid3d.comm is the 3D MPI communicator. */ 9 /* grid3d.zscp.rank is the z-coordinate in the 3D MPI communicator. */ 10 rank grid3d.zscp.rank; 11 if then 12 if then 13 if then /* initialize the GPU data structures */ 14 15 else /* restore the GPU data structures from the host memory */ 16 17 end /* dSparseLU2D_GPU() is a call to the modified factorization routine of 2D SuperLU DIST with GPU for the sub-block of a matrix [4]. */ 18 /* Based on 3D Sparse LU factorization algorithm, if , then . */ 19 if and then /* copy GPU data structures to the host memory */ 20 21 end /* freeLU2D_GPU() is a call to free the GPU memory of . */ 22 if then 23 if then 24 25 end 26 else 27 if then 28 29 end 30 end 31 end 32 end 33 end 34 end |
2.2.2. Case B: Insufficient Memory for GPU Nodes
- I.
- exchanges the LU factors with (① in Figure 6a,c). By now, the LU factors of are in and the LU factors of are in . Then, we initialize GPU in Node 1 and 2 and factorize with GPU acceleration (① in Figure 6b,c), while we exchange the other submatrices’ data independently (②→③→④ in Figure 6a,c) until the exchange of and is finished. The LU factors of are in and the LU factors of are in ;
- II.
- After the LU factorization of is finished in , we synchronize the LU factors from GPU to , which are in , to prepare for the next level, and free the GPU memory occupied by the LU factors (① in Figure 6a,c);
- III.
- We factorize , , and with GPU acceleration and synchronize the LU factors from GPU to in turn (②→③→④ in Figure 6b,c). At this moment, are in ;
- IV.
- The first four submatrices and the last four submatrices exchange data including and LU factors again in turn after finishing the LU factorization (①→②→③→④ in Figure 6b,c). The LU factors of are in and the LU factors of are in . are in and are in ;
- V.
- After we factorize the first four submatrices following the same procedures as Case A, we reduce all the common ancestor nodes to finish the LU factorization of the lowest level;
- VI.
- At the next level l-1, according to the step IV of Case A, submatrix , , , and are active. We treat the remaining levels in a similar way to Case A. are kept in the host memory for Ancestor_Reduction() in the rest of levels.
2.2.3. Case C: Insufficient Memory for the Entire Cluster
3. Time Complexity and Space Complexity
3.1. Time Complexity
3.2. Space Complexity
4. Numerical Experiments and Results
4.1. Setup
4.2. Test Matrices
4.3. Comparison of Results
4.4. Correctness of Time Complexity Analysis
5. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Amestoy, P.R.; Guermouche, A.; L’Excellent, J.Y.; Pralet, S. Hybrid scheduling for the parallel solution of linear systems. Parallel Comput. 2006, 32, 136–156. [Google Scholar] [CrossRef] [Green Version]
- Amestoy, P.R.; Duff, I.S.; L’Excellent, J.Y.; Koster, J. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matrix Anal. Appl. 2001, 23, 15–41. [Google Scholar] [CrossRef] [Green Version]
- Schenk, O.; Gartner, K. On fast factorization pivoting methods for sparse symmetric indefinite systems. Electron. Trans. Numer. Anal. 2006, 23, 158–179. [Google Scholar]
- Sao, P.Y.; Li, X.Y.S.; Vuduc, R. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems. J. Parallel Distrib. Comput. 2019, 131, 218–234. [Google Scholar] [CrossRef]
- Sao, P.Y.; Vuduc, R.; Li, X.Y.S. A distributed CPU-GPU sparse direct solver. In European Conference on Parallel Processing; Springer: Berlin/Heidelberg, Germany, 2014; pp. 487–498. [Google Scholar] [CrossRef]
- Li, X.Y.S.; Demmel, J.W. SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Trans. Math. Softw. 2003, 29, 110–140. [Google Scholar] [CrossRef]
- Demmel, J.W.; Eisenstat, S.C.; Gilbert, J.R.; Li, X.S.; Liu, J.W.H. A supernodal approach to sparse partial pivoting. SIAM J. Matrix Anal. Appl. 1999, 20, 720–755. [Google Scholar] [CrossRef]
- Abhyankar, S.; Brown, J.; Constantinescu, E.M.; Ghosh, D.; Smith, B.F.; Zhang, H. PETSc/TS: A modern scalable ODE/DAE solver library. arXiv 2018. [Google Scholar] [CrossRef]
- Heroux, M.A.; Bartlett, R.A.; Howle, V.E.; Hoekstra, R.J.; Hu, J.J.; Kolda, T.G.; Lehoucq, R.B.; Long, K.R.; Pawlowski, R.P.; Phipps, E.T.; et al. An overview of the Trilinos Project. ACM Trans. Math. Softw. 2005, 31, 397–423. [Google Scholar] [CrossRef] [Green Version]
- Falgout, R.D.; Jones, J.E.; Yang, U.M. Conceptual interfaces in HYPRE. Future Gener Comp Syst. 2006, 22, 239–251. [Google Scholar] [CrossRef]
- Chen, L.; Song, X.; Gerya, T.V.; Xu, T.; Chen, Y. Crustal melting beneath orogenic plateaus: Insights from 3-D thermo-mechanical modeling. Tectonophysics 2019, 761, 1–15. [Google Scholar] [CrossRef]
- Chen, L.; Capitanio, F.A.; Liu, L.; Gerya, T.V. Crustal rheology controls on the Tibetan plateau formation during India-Asia convergence. Nat. Commun. 2017, 8, 15992. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chen, L.; Gerya, T.V. The role of lateral lithospheric strength heterogeneities in orogenic plateau growth: Insights from 3-D thermo-mechanical modeling. J. Geophys. Res. Solid Earth 2016, 121, 3118–3138. [Google Scholar] [CrossRef] [Green Version]
- Li, Z.-H.; Xu, Z.; Gerya, T.; Burg, J.-P. Collision of continental corner from 3-D numerical modeling. Earth Planet. Sci. Lett. 2013, 380, 98–111. [Google Scholar] [CrossRef]
- Gaihre, A.; Li, X.S.; Liu, H. GSOFA: Scalable sparse symbolic LU factorization on GPUs. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 1015–1026. [Google Scholar] [CrossRef]
- Duff, I.S. The impact of high-performance computing in the solution of linear systems: Trends and problems. J. Comput. Appl. Math. 2000, 123, 515–530. [Google Scholar] [CrossRef] [Green Version]
- Li, X.S.; Demmel, J.W. Making sparse Gaussian elimination scalable by static pivoting. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, Orlando, FL, USA, 7–13 November 1998; p. 34. [Google Scholar] [CrossRef] [Green Version]
- Li, X.Y.S.; Demmel, J.W.; John, R.G.; Laura, G.; Piyush, S.; Meiyue, S.; Ichitaro, Y. SuperLU Users’ Guide; University of California: Berkeley, CA, USA, 2018. [Google Scholar]
- Davis, T.A.; Hu, Y.F. The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar] [CrossRef]
- Gerya, T.V. Introduction to Numerical Geodynamic Modelling, 2nd ed.; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
- Potluri, S.; Hamidouche, K.; Venkatesh, A.; Bureddy, D.; Panda, D.K. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing, Lyon, France, 1–4 October 2013; pp. 80–89. [Google Scholar] [CrossRef]
Symbol | Description |
---|---|
P | number of MPI processes |
3D process grid dimensions | |
number of processes in xy plane | |
groups of processes assigned to the k-th supernodal row | |
i~j-th group of processes assigned to the k-th supernodal row (i < j) | |
groups of processes assigned to the k-th supernodal column | |
i~j-th groups of processes assigned to the k-th supernodal column (i < j) | |
No. process | |
pid | process ID |
the i-th submatrix | |
the i-th 2D process grid | |
n | dimension of a matrix |
nnz | the number of nonzero elements in a matrix |
E | elimination tree |
l | level of elimination tree () |
m | number of compute nodes in the cluster |
memory size of the i-th compute node | |
GPU DRAM | |
number of processes of the i-th node | |
number of GPU of the i-th node | |
number of nodes with GPU () | |
number of submatrices at lvl-th level | |
memory occupied by LU factors per process | |
GPU memory occupied by LU factors per process | |
host memory copied from GPU per process | |
host memory of j-th process in copied from j-th GPU | |
grid3d.comm | the 3D MPI communicator |
Parameter | GPU-Server | Intel-Server | AMD-Server |
---|---|---|---|
CPU | 2 × Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz | 2 × Intel(R) Xeon(R) Silver 4110 CPU @ 2.10 GHz | 2 × AMD EPYC 7742 @2.25 GHz |
Memory | 256 GB | 256 GB | 512 GB |
GPUs per node | 2 | 0 | 0 |
Type of GPU | Tesla V100S “Volta” | -- | -- |
GPU DRAM | 32 GB HMB2 | -- | -- |
CUDA Cores per GPU | 5120 | -- | -- |
NVMe SSD | 4 TB | 4 TB | 4 TB |
InfiniBand Bandwidth | 2.5 Gb/s | 2.5 Gb/s | 2.5 Gb/s |
Operating system | CentOS Linux release 7.4.1708 | ||
Compiler | NVIDIA HPC-X | ||
Libraries | CUDA 10.2, Intel MKL |
Name | n a | nnz a | nnz/n | Application |
---|---|---|---|---|
CoupCons3D b | 4.2 × 105 | 1.7 × 107 | 41.5 | Structural Problem |
Ldoor b | 9.5 × 105 | 4.2 × 107 | 44.6 | Structural Problem |
Seren b | 1.4 × 106 | 6.4 × 107 | 46.1 | Structural Problem |
audikw_1 b | 9.4 × 105 | 7.8 × 107 | 82.0 | Structural Problem |
dielFilterV3real b | 1.1 × 106 | 8.9 × 107 | 81.0 | FEM/EM |
G3_circuit b | 1.6 × 106 | 7.7 × 106 | 4.8 | Circuit Sim. |
nlpkkt80 b | 1.1 × 106 | 2.8 × 107 | 26.5 | KKT matrices |
geodynamics1 c | 3.3 × 106 | 4.5 × 107 | 14.0 | FDM 3D d |
geodynamics2 c | 4.5 × 106 | 6.3 × 107 | 14.0 | FDM 3D d |
geodynamics3 c | 5.1 × 106 | 7.1 × 107 | 14.0 | FDM 3D d |
geodynamics4 c | 5.4 × 106 | 7.5 × 107 | 14.0 | FDM 3D d |
Name | Per-Process GPU Memory (GB) | SuperLU3D (CPU) (Second) | SuperLU3D (CPU+GPU) (Second) | SuperLU3D_ Alternate (Second) | ||
---|---|---|---|---|---|---|
ldoor | 4.3 | 3.54 | 4.21 | 18 | 0.840855 | 0.196667 |
CoupCons3D | 4.3 | 4 | 15 | 20 | 0.266667 | 0.2 |
G3_circuit | 4.5 | 7 | 27 | 23 | 0.259259 | 0.304348 |
dielFilterV3real | 5.0 | 31 | 9 | 18 | 3.444444 | 1.722222 |
audikw_1 | 6.7 | 45 | 37 | 31 | 1.216216 | 1.451613 |
Serena | 10.8 | 252 | 67 | 106 | 3.761194 | 2.377358 |
nlpkkt80 | 11.8 | 333 | 108 | 88 | 3.083333 | 3.784091 |
geodynamics1 | 18.7 | 864 | 394 | 156 | 2.192893 | 5.538462 |
geodynamics2 | 30.3 | 1932 | 752 | 307 | 2.569149 | 6.29316 |
geodynamics3 | 30.3 | 2560 | 833 | 338 | 3.073229 | 7.573964 |
geodynamics4 | 30.3 | -- | -- | 896 | -- | -- |
Name | a | b (Second) | c (Second) | c (Second) | c (Second) | c (Second) | |
---|---|---|---|---|---|---|---|
geodynamics1_case_A | 4.63 | 2 | 0 | 394 | 156 | 238 | 136.92 |
geodynamics1_case_B | 4.31 | 2 | 62.11 | 394 | 225 | 169 | 54.34 |
geodynamics2_case_A | 4.34 | 2 | 0 | 752 | 307 | 445 | 247.83 |
geodynamics2_case_B | 3.28 | 2 | 118.34 | 752 | 467 | 285 | 17.35 |
geodynamics3_case_A | 3.64 | 2 | 0 | 833 | 338 | 495 | 275.47 |
geodynamics3_case_B | 3.37 | 2 | 145.76 | 833 | 586 | 247 | 53.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, J.; Zhu, P. An Alternate GPU-Accelerated Algorithm for Very Large Sparse LU Factorization. Mathematics 2023, 11, 3149. https://doi.org/10.3390/math11143149
Chen J, Zhu P. An Alternate GPU-Accelerated Algorithm for Very Large Sparse LU Factorization. Mathematics. 2023; 11(14):3149. https://doi.org/10.3390/math11143149
Chicago/Turabian StyleChen, Jile, and Peimin Zhu. 2023. "An Alternate GPU-Accelerated Algorithm for Very Large Sparse LU Factorization" Mathematics 11, no. 14: 3149. https://doi.org/10.3390/math11143149
APA StyleChen, J., & Zhu, P. (2023). An Alternate GPU-Accelerated Algorithm for Very Large Sparse LU Factorization. Mathematics, 11(14), 3149. https://doi.org/10.3390/math11143149