A Method of Ultra-Large-Scale Matrix Inversion Using Block Recursion

: Ultra-large-scale matrix inversion has been applied as the fundamental operation of numerous domains, owing to the growth of big data and matrix applications. Using cryptography as an example, the solution of ultra-large-scale linear equations over ﬁnite ﬁelds is important in many cryptanalysis schemes. However, inverting matrices of extremely high order, such as in millions, is challenging; nonetheless, the need has become increasingly urgent. Hence, we propose a parallel distributed block recursive computing method that can process matrices at a signiﬁcantly increased scale, based on Strassen’s method; furthermore, we describe the related well-designed algorithm herein. Additionally, the experimental results based on comparison show the efﬁciency and the superiority of our method. Using our method, up to 140,000 dimensions can be processed in a supercomputing center.


Introduction
With the growth of computer applications, matrix inversion has become a basic operation that is widely used in various industries. For example, online video service providers (such as YouTube) use various types of the matrices to store user and item information [1]; in satellite navigation and positioning, matrix inversion is used to solve positioning equations [2]; triangular matrix inversion is used in the fast algorithm of radar pulse compression (based on reiterative minimum mean-square error); and, matrices are also used in multivariate public key encryption [3], etc. Next, the application of matrix in cryptography is described in detail. In some of existing cryptographic algorithms, a matrix is used to directly encrypt a message. When the process is duplicated, it is found that the encrypted message can be obtained by solving the inversion of the encryption matrix. Additionally, some algorithms can be transfromed into multiple linear equations, represented as Ax = b, where A is an n × n matrix; x and b are n × 1 vectors. Subsequently, the equation can be solved by computing the inverse of the matrix A, denoted by A −1 , in order to obtain x = A −1 × b. For general matrices, some common inversion algorithms exist such as Gaussian elimination, Gauss-Jordan elimination [4], Cholesky decomposition [5], QR decomposition [6], LU decomposition [6], and so on. Meanwhile, other algorithms focus on special types of matrices, such as the positive definite matrix [7], tridiagonal matrix [8], adjacent pentadiagonal matric [9], and triangular matrix [10]. However, these algorithms for general matrices are computationally intensive and they require a cubic number of operations. Hence, many studies have been performed to reduce the complexity. In 1969, Strassen [11] presented a method that reduces the complexity from O(n 3 ) to O(n 2.808 ), where n denotes the order of the matrix. In 1978, Pan et al. [12] proved that the complexity can be less than O(n 2.796 ). Coppersmith and Winograd [13] were the first to reduce the index of n less than 2.5: the obtained time complexity was O(n 2.496 ). Despite various optimizations, the index of n had never been less than 2. However, with increasing matrix dimensions and the exponential growth of data volume, it is impossible to solve the problem that is mentioned above by using only these methods proposed previously and their optimizations. With the rapid development of parallel computing, the problem of designing an efficient algorithm for large scale distributed matrix inversion has garnered the significant attention of researchers.
Message Passing Interface (MPI) is a programming model that can effectively support parallel matrix inversion. Apart from MPI, plenty of new distributed computing technologies have emerged as platforms for large data processing tasks in recent years; the most popular ones are MapReduce and Spark, which exhibit outstanding scalability and fault-tolerance capabilitys. Xiang et al. [14] proposed a scalable matrix inversion method that was implemented on MapReduce and discussed some optimizations. Liu et al. [15] described an inversion algorithm using LU decomposition on Spark. In this paper, we propose a novel distributed matrix inversion algorithm for a large-scale matrix that is based on Strassen's original serial inversion scheme and implement it on the MPI. Essentially, it is a block-recursive method to decompose the original matrix into a series of small ones that can be processed on a single server. We present a detailed derivation and the detailed steps of the algorithm and then show the experimental results on the MPI. However, it is noteworthy that the primary aim of this paper is not to compare the performance of the MPI, Spark, and MapReduce. In summary: 1. We propose and implement a large-scale matrix block-recursive inversion algorithm based on the Strassen's method while using distributed and parallel technologies. 2. We describe the theorem used in our algorithm; furthermore, we prove the theorem and describe the algorithm steps. 3. Through numerous experiments, we prove that the proposed method is superior to LU decomposition and Gauss-Jordan elimination under the same conditions.
Note that the all operations for solving matrix inversion is based on finite fields and only for a square matrix.
The remainder of this paper is structured, as follows. Section 2 provides a literature review. Our method is presented in Setion 3 and the experimental evaluation results in Secion 4. In Section 5, we conclude the paper and discuss future works.

Related Work
Owing to the numerous existed methods for matrix inversion, we focus on only two representational methods: Gauss-Jordan Elimination and LU decomposition. In the rest of this section, we introduce the key ideas of the two methods briefly and analyze their deficiency in solving the inversion of high-order matrix as compared with our presented methods. Some of the other methods are applied to special matrices, e.g., SVD decomposition is used for the pseudo inversion of non-square matrix, and Cholesky decomposition is used in symmetric positive definite linear equations. Additionally, we omit some other methods that are substantially similar to the two most representative methods mentioned above. Furthermore, we selected the two methods, because they are the most frequently used in matrix inversion parallel computing of the existing papers.
At the beginning of the section, we give the basic definition of an inverse matrix: Definition 1. Given an n-th order square matrix A, if an n-th order square matrix B exists satisfying AB = BA = I n , where I n is the n-th indentity matrix, the matrix A is invertible, and its inverse matrix is B, written as A −1 .

Gauss-Jordan Elimination
Gauss-Jordan is an efficient, classic, and well-known method for solving low order matrix inversion. It has two common forms that are based on either row transformation or column transformation. The row-based version is more widely used in daily life. As the two methods have the same principle, we will only briefly introduce the row-based one here.
In the row transformation method, the given n × n matrix A is connected with I, which is an identity matrix having the same order as A, to obtain an augmented matrix W, denoted as w = [A|I], whose dimension becomes n × 2n. Then, we apply row transformation operations on the matrix W to convert W from [A|I] to [I|B], and the matrix B is equal to A −1 . The proof is shown below. As the operations performed on the matrix W contain only the row transformations, the entire process can be viewed as multiplying W by an invertible matrix P to satisfy PW = P[A|I] = [PA|P] = [I|B]. We can obtain P = B and PA = I and infer that BA = I. Therefore, the matrix B is equal to A −1 . The process can be divided into the following four steps: Step 1: transform the left part of the matrix W into an upper triangular matrix by applying row transformations from the top to the bottom. The specific steps are as follows: (1) Multiply the first row by a constant, such that a 11 becomes 1 (initially, if this element is zero, add any other row whose first element is not zero to this row). (2) Subtract the first row multiplied by a constant from the second row such that a 21 become zero.
Perform the same operation for the remaining rows, written as r i − r 1 × j. Step 2: transform the left part of the matrix W into a diagonal triangular matrix by performing row transformations from the bottom to the top. This is similar to step 1.
Step 3: multiply each line by a coefficient to transform the left part of the augmented matrix into an identity matrix.
Step 4: the right part of the augmented matrix W is the inverse of matrix A. Apparently, the time complexity of the program is O(n 3 ) without any parallelization. For the Gauss-Jordan, the operations on rows are sequential, and the parallelization of this method is not highly effective.

LU Decomposition
LU decomposition can be viewed as another form of Gauss elimination in essence. Multiple types of decomposition have been proposed, e.g., PLE decomposition, PLS decomposition, CUP decomposition, LQUP decomposition, and LUP decomposition. Among them, PLE decomposition breaks the initial matrix A into the product of permutation matrix P , unit lower triangular matrix L (i.e., the lower triangular matrix whose diagonal elements are all 1) and row elementary transformation matrix E. Meanwhile, CUP decomposition resolves A into the product of column elementary transformation matrix C, unit lower triangular matrix U, and permutation matrix P.
These forms have been shown to be equivalent: it is sufficient to use a permutation matrix to obtain different decomposition forms that are based on LU decomposition. Their mutual conversion required no domain operation; only the permutation operation is involved, which can be scaled to matrix multiplication. Hence, we only discuss the most direct LU decomposition. LU decomposition essentially transforms the original matrix A into an upper triangular matrix through elementary row transformation, and the transformation matrix is a unit lower triangular matrix. Inverting the two matrices respectively and multiplying their inverses to get the inverse of A is known as the Doolittle algorithm. In addition, the Crout algorithm only replaces the two decomposed matrices with a lower triangular matrix and unit upper triangular matrix, which is similar to the Doolittle algorithm in essence. We only present the Doolittle as an example. Note that it has been proven in linear algebra that LU decomposition is existent and unique provided that the square matrix is non-singular.
The critical step in LU decomposition is to obtain the 'L' and 'U'. Subsequently, we use Gauss-Jordan (or other methods) to get the inversion of 'L' and 'U'. The main purpose of decomposition is to reduce the computational cost of elimination and ease the storage within computer programs. The decomposed forms are shown below: As shown in Equation (1), we can deduce the following formulas according to the rules of matrix multiplication and the equality of matrices on both sides of the equations: Matrix L and U can be calculated by the formula above. Subsequently, we calculate the inverse of L and U, respectively, and multiply them to get the inverse of original matrix. The formula to calculation every element of L and U has been provided, and it is obvious that the time complexity is O(n 3 ). The worst complexity of multiplying matrices is still O(n 3 ). In conclusion, the complexity of LU decomposition is O(n 3 ) without parallelization. However, in the actual implementation of LU decomposition, the row-permuted matrix PA is decomposed instead of the original matrix A, when considering that the LU decomposition of the original matrix may fail to materialize in some cases. The permutation matrix P can render the factorization more stable. Therefore, the inverse matrix A −1 can be obtained by calculating U −1 L −1 P. Obviously, the complexity has not changed.
It is impractical to calculate the inversion by the formula above directly when the dimension of the matrix is extremely large. Liu et al. [13] presented a method that is based on LU decomposition in detail and we will only mention their formula for inversion. Their method first used the following equation: Here, is the permutation matrix P mentioned previously that renders the factorization to be more stable. Subsequently, performing multiplication for the equation above, the result are as follows: The details of the intermediate substitution are not unrolled here, and the final equations are the following: By comparing the most optimized LU decomposition (Algorithms 5-7 in [13]) with the proposed method, it is clear that in the primary parts the number of basic operation, such as multiplication and inversion, of the two methods has a difference and the cost of LU is larger than the latter.

Theoretical Demonstration and Algorithm Design
In this section, we present the detailed theoretical derivation that is required for our proposed method and describe the algorithm. The process can be divided into two parts: matrix completion and block recursive inversion. After partitioning the original matrix, we obtained four submatrices of the same order. Next, we analyze the process, formulate the theorem, and provide the revelant proofs of each part.

Matrix Completion
For a more general conclusion, we assume that the order of the given square matrix is 2 r × s, where r is a natural number and s is the order of the block matrix that can be decomposed on a single server. The original matrix does not always satisfy this criterion, and we must fill the matrix in this case. The completion lemma is as follows: Lemma 1. Fill the original matrix H with null matrix O and identity matrix I, whose order is subject to specific circumstances. The form after filling is as follows: H O O I . Apparently, the revision of filling matrix can be obtained easily: Subsequently, we obtain the inversion of the original matrix directly.

Block Inversion Theorem
In this section, we present and prove the theorem for obtaining the inversion of the original matrix by invertible submatrices. We first introduce the following lemmas before proving the inverse theorems.
Proof. Firstly, submatrix A is invertible, based on the preconditions; therefore, we have the following equation: Multiply both sides of this equation by it, and the following is obtaied:

Based on the equation, H is invertible and its inversion is
The following theorem can be proved similarly: By the lemmas above, we prove the following theorem: According to the definition of the invertible matrix, HH −1 = H −1 H = I. Performing the matrix multiplication results in the following: This results in the following equations: Based on Equation (2b), we have By substituting Equation (3) into Equation (2d), we obtain Subsequently, we can obtain Y by substituting Equation (4) into Equation (3): According to Equation (2a), we have By substituting Equation (6) into Equation (2c), we obtain Finally, we obtain the Z by substituting Equation (7) into Equation (6): To summarize, we obtain the inversion of H, as follows: Similarly, we can prove the following theorems:

Recursive Algorithm
Algorithm 1 shows the corresponding deterministic algorithm, and the partition of original matrix is presented in the Figure 1. As shown, we stored the partition process as a tree. For a more concise and clear figure, the first and last nodes illustrate the partition results at every layer, and the partition of the other blocks is similar. Although we have only divided it by a certain number of times, the actual number is much higher. Information regarding each node is represented in a three-tuple, written as (n, i, j), where n is the dimension of the matrix contained in the node; i and j are the row and column of the first element of the matrix, respectively. Finally, we attempt to find a path, which is called the inversion chain, where the matrix contained in each node is invertible.

Algorithm 1 The recursion algorithm.
Input: H → The original matrix; s → the dimension solved by a single server output: H −1 → the inversion of H step 1: For the given matrix H, determine whether it needs to be filled according to Lemma 1 and write the augmented matrix as H * ; step 2: Divide the n-dimensional H or H * obtained in the step 1 into A B C D ; step 3: Repeat the division of step 2 for each sub-block matrix until the dimension of them is less than s; step 4: Find the inversion chain, e.g., (s, q, k), · · · , (m, i, j), · · · , (n, 0, 0); step 5: Calculate the inversion of the parent nodes from the leaf (s, q, k) along the path sequentially, finally get the matrix M, i.e., inversion of root (n, 0, 0); step 6: if the matrix is not filled then goto step 7; else take the first n rows and the first n columns of M to form the new matrix, which is H −1 ; end if; step 7: Return H −1 .
It is noteworthy that the method that we adopted to obtain the inversion of the leaf node was Gauss-Jordan elimination. Theoretically, 2 × n × (n − 1) threads exist to reduce the time complexity from O(n 3 ) to O(n) when the program is executed in a supercomputing center. However, the real complexity is slightly more than O(n) because of some bottlenecks such as memory access. Meanwhile, we use the shared memory of the GPU to reduce excessive access times due to the large matrix order.
To more clearly describe the process, we take a 16-dimensional matrix as an example.

Experimental Evaluation
We performed experiments on personal computers in order to evaluate the performance of the proposed algorithm (later written as NovelIn) and compared it with the Gauss-Jordan Elimination (later written as G-J). In this section, except the comparison with the G-J, the performance about applying the algorithm to different matrix sizes is also presented. Finally, we present the results of executing the NovelIn on GPU clusters.

Experimental Environment
Firstly, we implement G-J and NovelIn on the personal computer, which has a 2.3 GHz Intel CPU with four physical cores and NVIDIA GTX960M. On the personal computer, we measure the runtime for solving the general matrix, whose order is not large than 20,000, to compare the performance of G-J and NovelIn. Owing to the limitations of the personal computer, the operations with high order matrices mentioned before were performed on a parallel computing platform, namely GPU clusters that were composed of servers that are deployed in a data center with well designed architecture. Each server has 128 GB RAM and two 2.4 GHz Inter Xeon CPUs with 20 physical cores.
In order to investigate the performance of the algorithms, we generated different matrices of different orders, as shown in Table 1.

Comparison of Performance
In consideration of the actual situation, we evaluate the performance of the two methods, i.e., G-J and NovelIn, on GF(2 8 ), GF(2 16 ) and GF(2 32 ). For GF(2 8 ), GF(2 16 ), which contain fewer elements, its matrix elements are inversed by looking up the tables, which consist of a positive table, storing the polynomial value of the element, and a negative table, storing the opposite. It is obvious that the memory that tables required is 512 B and 64 KB for GF (2 8 ) and GF(2 16 ) individually. For GF(2 32 ), the memory required is enormous; therefore, we take the algorithm of exponentiating by squaring in order to obtain the inversion of each element, and its time complexity is O(n 2 ). The specifical experimental data on the different finite fields are shown in the following tables. Table 2 presents the data used on GF(2 8 ). Most of our experiments are based on GF(2 8 ), when considering that GF(2 8 ) has the least number of elements, as shown in Table 2. For the G-J, we set the recursive threshold, namely the order solved by personal computer, to 100. We present the comparison in Figure 2 to show the effects of number of threads on runtime; we discovered that it is not as simple as cutting it in half when the number of threads increases from four to eight. In the case of four threads, we did not provide the running time of the 20,000 dimensional matrix because it is extremely large. Hence, we then set the number of threads to eight. The number followed by the word "NovelIn" is the corresponding recursive threshold when using NovelIn.  For the NovelIn, we set four recursive thresholds and measured their runtimes to obtain the best one. The result is presented in Figure 3a. It is not difficult to find that, with the decrease of recursive threshold, the runtime gap of the process, which deals with the same order matrix, becomes gradually unclear. Therefore, in the experiments based on the GF(2 16 ) and GF(2 32 ), the recursive threshold is set to 100 directly. Figure 3b shows that the runtime of NovelIn is superior than that of the other, and the gap becomes wider with a higher dimension of the matrix when the experimental environment remains the same. To reflect the advantage more directly, the speedup-the time required by G-J divided by the time required by NovelIn-is shown in Figure 3c. Comparing the performance of the two methods, it is evident that the performance of NovelIn is more superior. Next, we execute the algorithm in a supercomputing center. Table 3 shows the relative data based on all the finite fields. Given the capability of the supercomputing center, we set the recursive threshold to 2000. The results from the supercomputing center are presented in Figure 3d, and its effect is remarkable. For the 140,000 dimensional matrix, NovelIn only requires 4700 s in order to obtain its inversion.   We mentioned that the recursive threshold is set to 100 directly in the experiments based on GF(2 16 ) and GF(2 32 ); additionally, the relevant experimental results are listed in Table 4. The results are presented in Figures 4 and 5, respectively. According to Figure 4a, it is obvious that NovelIn performed better than the other. By contrast, Figure 5a shows an interesting result, in that the performance of NovelIn is inferior to that of the G-J when the matrix order is not large enough on the GF(2 32 ). Additionally, we depict the change in the speedup with the growth of the matrix order in the Figure 4b. As expected, the speedup shows an upward trend in general, which resembles the change on GF(2 8 ). Otherwise, Figures 4c and 5b show the time of applying NovelIn to a real high order matrix on the supercomputing center, where the specific data are shown in Table 3. The performance and scalability of NovleIn are apparent.

Conclusions
In this paper, we have focused on the problem of large-scale matrix inversion, which has become increasingly fundamental, owing to the constant growth of the data in various fields. We presented an efficient inversion algorithm for large-scale matrices and its implementation.
The key idea is to breakdown the large-scale matrices into a set of sub-blocks and then calculate the inversion by the proposed method. Our experimental evaluation demonstrated that the excellent performance of our method. Specifically, it performed better when executing the program on the supercomputing center. This algorithm, which is aimed at large-scale matrix inversion, may render some existing encryption algorithms no longer secure, such as an algebraic attack. Because the essence of an algebraic attack is matrix inversion, the method that is proposed in this paper can make the algorithms that rely solely on ultra-large-scale matrix to resist algebraic attacks no longer secure.
The analysis of evaluation results on clusters with varied dimensions demonstrated the good scalability of our algorithm. For the future work, we will attempt to utilize multi-GPU collaborative computing, such that matrices can be distributed to different GPUs, in order to further extend the practicability of this algorithm for larger matrices.