Acceleration of Approximate Matrix Multiplications on GPUs

Matrix multiplication is important in various information-processing applications, including the computation of eigenvalues and eigenvectors, and in combinatorial optimization algorithms. Therefore, reducing the computation time of matrix products is essential to speed up scientific and practical calculations. Several approaches have been proposed to speed up this process, including GPUs, fast matrix multiplication libraries, custom hardware, and efficient approximate matrix multiplication (AMM) algorithms. However, research to date has yet to focus on accelerating AMMs for general matrices on GPUs, despite the potential of GPUs to perform fast and accurate matrix product calculations. In this paper, we propose a method for improving Monte Carlo AMMs. We also give an analytical solution for the optimal values of the hyperparameters in the proposed method. The proposed method improves the approximation of the matrix product without increasing the computation time compared to the conventional AMMs. It is also designed to work well with parallel operations on GPUs and can be incorporated into various algorithms. Finally, the proposed method is applied to a power method used for eigenvalue computation. We demonstrate that, on an NVIDIA A100 GPU, the computation time can be halved compared to the conventional power method using cuBLAS.


Introduction
Computing matrix products is a fundamental and essential operation in various information-processing tasks [1][2][3].For example, one can obtain the largest eigenvalue and eigenvector of a positive definite matrix A by repeating the following calculations: In the power method, the computation of the matrix product is responsible for most of the computational burden.The same applies to optimization algorithms [4][5][6].Therefore, it is crucial to reduce the computation time of matrix products to speed up various scientific and practical calculations.A well-known method for accelerating matrix multiplication is to use GPUs.Other research has focused on implementing fast matrix multiplication libraries [7][8][9], designing custom hardware to accelerate specific classes of matrix multiplication [10][11][12], accelerating distributed matrix multiplication [13][14][15], designing exact and low-computation-order algorithms based on the divide-and-conquer method [16][17][18], and designing efficient approximate matrix multiplication (AMM) algorithms [19][20][21].These areas are still active research fields.Recently, an AI-powered method that utilizes reinforcement learning to discover matrix arithmetic algorithms that are computationally less expensive than existing methods has been reported [22], building upon a previous study by Strassen [16].
For practical purposes, in some cases, the exact result of the matrix product may be optional.In this paper, we propose a method for such a case.It is known that when only an approximate result of the matrix product is required, the AMM methods can significantly reduce the computation time compared to other matrix acceleration approaches.MADDNESS, proposed in [20], provides high-speed approximate results.The paper demonstrated that MADDNESS, when executed on a CPU, can reduce the computation time by a factor of 100 compared to the exact matrix product while retaining sufficient computational accuracy to enable neural network inference.Allowing approximate calculations enables MADDNESS to achieve an overwhelming speed-up with other matrix optimization approaches.However, GPU computation can typically perform matrix products two orders of magnitude faster than CPU computation.Therefore, MADDNESS is practical for running neural networks in environments like the Internet of Things (IoT), where power consumption and installation space limitations make GPUs infeasible.On the other hand, if GPUs are available and there are no such limitations, obtaining the exact result by GPU computation rather than an approximate result by MADDNESS on a CPU is preferable.
A significant body of research proposing custom hardware exists to improve matrix multiplication performance.Hardware research often focuses on reducing energy consumption [23][24][25][26][27][28].Existing hardware research has introduced various methods to reduce the energy consumption of artificial intelligence (AI) algorithms while maintaining acceptable accuracy.One approach involves approximate computing, which trades computational accuracy for improved energy efficiency.This strategy is particularly effective for AI algorithms that exhibit inherent error tolerance.Several papers have proposed using approximate multipliers, simplifying multiplication operations, and reducing energy consumption.Another approach involves using logarithmic multipliers, simplifying multiplication operations while sacrificing some degree of accuracy.This research significantly advances energy-efficient computing techniques tailored to AI algorithms, a central and pertinent topic within computer science.Our proposed method pertains to the approximate matrix multiplication algorithm and does not result in a reduction in the power consumption of a single FMA operation.However, it can potentially reduce the overall computation time of the application, thus making a valuable contribution to energy conservation.
To the best of our knowledge, no prior research other than the method for sparse matrices [29] has endeavored to accelerate approximate computations on GPUs, despite their capability for fast and precise computation of matrix products.Computations that can be efficiently accelerated by GPUs typically involve a large number of parallel executions of simple operations with regular memory access patterns.MADDNESS proposes an alternative approach to matrix multiplication by using hashing trees for table look-ups, which is considered unsuitable for GPU implementation.In addition, FD-AMM [30] incorporates the frequent direction algorithm [31] into AMM.However, the previous study [30] has reported that FD-AMM requires an SVD iteration to generate a sketch matrix whenever one of the multiplication matrices changes, resulting in longer computation times than existing methods.Thus, traditional AMM algorithms, which lack design considerations for parallel processing efficiency on GPUs, cannot guarantee the expected acceleration simply by implementing them on GPUs.Nevertheless, given that GPUs are already employed to expedite matrix products in numerous applications, it is also advantageous to leverage such devices for approximate calculations.Accordingly, we identified a known Monte Carlo AMM that can exploit the parallel computing capabilities of GPUs.Additionally, we devised an algorithm to enhance the approximation performance of this extant method while minimizing the increase in computational complexity.Our numerical experiments demonstrate that the proposed approach can expedite the computation of eigenvalues, which is crucial in both scientific and engineering domains.

Theory of Monte Carlo AMM
In this section, we present the theory behind the Monte Carlo method [32] used to compute the approximate matrix multiplication (AMM) of matrices A ∈ R m×k and B ∈ R k×n .We denote the lth column and row of a real matrix M by M (l) and M (l) , respectively.The Frobenius norm of the matrix M is denoted by M , defined as the square root of the sum of the squares of all elements in the matrix.Furthermore, the Euclidean norm of vector v is denoted as |v|.
Algorithm 1 presents the pseudocode for the Monte Carlo AMM.It is a randomized algorithm that returns an approximate solution Z that is close to the exact solution AB with high probability.The expected square Frobenius error between AB and Z is evaluated using the following equation: Algorithm 1 Monte Carlo AMM [32] Require: A ∈ R m×k and B ∈ R k×n , and c ∈ N cp i t 6: end for It has been proven that this equation is minimized when the probability p l of sampling the lth column/row is set to its optimal value, defined as In this case, the error is given by The statement that AB = ∑ k l=1 A (l) B (l) and Equation (3) imply that Algorithm 1 with the optimal Monte Carlo sampling is designed to assign a higher probability to columns/rows with larger Frobenius norms during sampling and to compute their product as an approximate solution.

AMM with Preprocessing: Proposed Methods
A straightforward method to reduce the expected squared error is to increase the number of sampling iterations c.The error decreases inversely with the number of iterations.However, increasing c results in a longer computation time for the Monte Carlo AMM.Therefore, in this paper, we propose a method to reduce the expected squared error without significantly increasing the computation time by further exploiting the information used in the conventional AMM.
Let α i and β i be the mean of A (i) and B (i) , respectively.We define the following two matrices where and e l is an l-dimensional vector whose elements are all one.These mean that we obtain X and Y by subtracting a constant value α i + x i from the ith column of A and β i + y i from the ith row of B. Then, The second, third, and fourth terms in Equation ( 6) can be computed using the matrixvector and inner products instead of matrix-matrix products.By leveraging this fact, these terms can be accurately calculated with fewer computations for given matrices A, B, x, and y.Consequently, we utilize this approach to evaluate an AMM of matrices X and Y instead of directly using A and B.Then, the expected error of this approximation relies solely on the error of XY.To simplify the notation, we define A := A − e m α and B := B − βe n .Notably, the matrices A and B are constant, and the crucial observation is that all column sums of A and row sums of B are equal to zero.Algorithm 2 presents the pseudocode of our proposed method.The only differences between the proposed and conventional method lie in the second and third lines.The second line solely involves matrix-vector multiplication and inner product computation, which are less computationally intensive than matrix multiplication.Furthermore, the third line allows parallel execution for each pair of i and j.Consequently, the proposed method can be regarded as integrating the less computationally demanding preprocessing employed by the existing Monte Carlo AMM.This algorithm represents the calculation of an approximate solution to AB, as shown by the following calculation.
cp i t 8: end for Equation (7) can be regarded as the case of x = y = 0 in Equation (6).Remarkably, we can prove theoretically that the expected error by Algorithm 2 is equal to or less than the error by Algorithm 1. Considering the discussion on Equation (4), we see that the expected squared error for computing AB using Equation (6) and the combination of AMM with exact computations is proportional to The expected squared error can be represented by E/c.Hereafter, our objective is to establish the superior accuracy of the proposed method compared to the original Monte Carlo AMM.To achieve this, we aim to demonstrate that, through the presentation of several theorems, the global minimum of E occurs at x = y = 0.
Theorem 1. Equation ( 8) is minimized at x i = 0 if all elements in A (i) are zero.
Proof.Suppose that all elements of A (i) are zero for a certain i.Let z := [ z 1 . . .z k ] be the vector, such that For simplicity, we introduce the following notations.
Since we have A (i) − z i e m = 0, the following equalities hold.
In addition, we have where •, • denotes the Frobenius inner product.
We can easily see that |v| 2 |w| 2 = vw 2 holds for vectors v and w.Combining this equality, the Cauchy-Schwarz inequality, and the triangle inequality, we obtain From Equation ( 12), Equation ( 13), and Equation ( 14), we have This inequality means that Equation ( 8) is minimized at If all the elements in A (i) are identical, then according to the definitions of A and A , all elements in A (i) are also identical.Hence, applying Theorem 1, we can deduce that the value of x i yields the minimum value of E is 0, which implies A (i) − x i e m = 0. Combining this equation with Equation ( 8), we can obtain the values of x l (l = i), which minimize E by extracting the submatrices, respectively, after removing the ith column from A and the ith row from B .
Therefore, to minimize E, we can eliminate the ith column of A and the ith row of B if all of their elements are identical.Subsequently, we can minimize the error E using the resulting submatrix, which can be obtained by removing identical elements, as presented in Equation ( 8).Hereafter, we denote the standard deviation of the elements in A (i) and B (i) by σ i and δ i , respectively, with the assumption that σ i > 0 and δ i > 0 for all i.
Theorem 2. The following equations hold for any i.
Proof.Let [M] i,j be the ith row and jth column element of matrix M. For matrices A ∈ R m×k and B ∈ R k×n , we have the following equalities: Using Equation ( 17), we can reformulate a term in Equation ( 8) as Since the sum of each column of A and each row of B is zero, A (l) e m = 0 and B (l) e n = 0 are valid for any l.Applying this to the equality above, we have Since the average of elements in A (i) is zero, we have , where [v] i denotes the ith element of vector v. From this equality, we obtain In the same way, we have Substituting Equation (19), Equation ( 20), and Equation ( 21) into Equation ( 8), we obtain Because we have Thus, In the same way, we have By combining Equation (24) with Equation ( 25), we can derive Equation ( 16).
Theorem 3. The largest eigenvalue of M M is less than or equal to M 2 for any real nonzero matrix M. The equality holds if the rank of M is one.
Proof.Let λ i be the ith largest eigenvalue of M M .Since M M is positive semidefinite, all eigenvalues of M M are non-negative, and we have Therefore, the largest eigenvalue of M M is less than or equal to M 2 .The equality holds if and only if all eigenvalues other than the largest one are zero, which means that the rank of M M is one.Hereafter, we assume that the rank of M M is one.Let r be the rank of M ∈ R m×n .We then have the singular value decomposition (SVD) of M as M := UΣV , where U ∈ R m×r , Σ ∈ R r×r , and V ∈ R n×r .Because V is an orthogonal matrix and Σ is a diagonal matrix, M M = UΣ 2 U holds.This is nothing but the SVD of M M , and it implies that the rank of M M is r as well.Since the rank of M M is one, we have r = 1; thus, the rank of M is one.Theorem 4. The following equation holds for any combination of i and j.
Proof.From Equation ( 24), we have Furthermore, holds for any i = j.Therefore, we have Equation ( 26).
Theorem 5.The Hessian matrices of E, regarding x and y, and ∇ 2 xx E and ∇ 2 yy E, are positive semidefinite at x = y = 0.
l for simplicity.Then, we have Moreover, the following equalities hold.
Substituting these equalities into Equation ( 24), we can simplify Since [B B ] i,i is equal to the square of the Euclidean length of B (i) , we have [B B ] i,i = nδ 2 i .Substituting this into Equation (31), we obtain This equation implies that For i = j, we can simplify ∂ 2 E/∂x i ∂x j as 2mn This equation implies that Combining Equations ( 33) and ( 35), we can express the Hessian matrix of E regarding x at x = y = 0 as follows: Let Z be a diagonal and regular matrix Z := diag( √ δ 1 /σ 1 , . . ., where I is an identity matrix.Therefore, nZ ∑ k l=1 σ l δ l Z − B B O holds, which shows that ∇ 2 xx E is positive semidefinite at x = y = 0.In the same way, we can prove that ∇ 2 yy E is positive semidefinite at x = y = 0. Theorem 6.The function E is locally optimal at x = y = 0.
Proof.The Hessian matrix of E can be expressed as Theorem 4 provides ∇ 2 xy E = ∇ 2 yx E = O at x = y = 0. Furthermore, Theorem 5 claims that ∇ 2 xx E and ∇ 2 yy E are positive semidefinite at x = y = 0, which means ∇ 2 E is also positive semidefinite at x = y = 0. Based on the positive semidefinite property and Theorem 2, it is apparent that the point x = y = 0 serves as a locally optimal solution for the minimization of E.
The preceding discussion demonstrates that the point x = y = 0 constitutes a local optimum of E. Hereafter, we establish that, in almost all cases, this point is a global optimum because E has only one local optimum, and, in other cases, the point is globally optimal as well because all local optima are connected.

Theorem 7.
Let M be a rank-1 matrix and v be a vector satisfying M M = vv .The vector v is the only eigenvector of M M .The matrix M 2 I − M M has an eigenvalue of 0, and only a scaled version of v is its eigenvector.
Since M is a rank-1 matrix, M M is a rank-1 matrix as well and has only one eigenvalue.Combining this and M 2 v = M M v, we can see that v is the unique eigenvector of M M .Moreover, based on the equality above, the matrix M 2 I − M M has an eigenvalue of 0, and only a scaled version of v becomes its eigenvector.
Theorem 8. Let M be a rank-1 matrix, and v be a vector satisfying M M = vv .Let Z be a regular and diagonal matrix.The vectors x that satisfy x Proof.Since M is a rank-1 matrix and Z is regular, Z −1 M is a rank-1 matrix as well.Let , and ṽ be a vector satisfying ṽ ṽ = Z −1 M Z −1 M .
Theorem 7 implies that Q has an eigenvalue of 0 and only a scaled version of ṽ becomes its eigenvector.Because (Z ṽ)(Z ṽ) = M M holds, we have v = Z ṽ.Thus, a scaled version of Z −1 v is an eigenvector of Q corresponding to an eigenvalue of 0. Therefore, the following equality holds for any real value of c.
We can easily see that ZQZ cZ −2 v = 0 holds, and it implies that the matrix ZQZ has the eigenvalue of 0 and only a scaled version of Z −2 v is one of its eigenvectors.Moreover, we have the following equalities: Therefore, the matrix Z −1 M 2 Z 2 − M M has an eigenvalue of 0, and a scaled version of Z −2 v is one of its eigenvectors.Theorem 3 claims that Q is positive semidefinite, and it implies that ZQZ is positive semidefinite as well.Thus, the vector x satisfying x Z −1 M 2 Z 2 − M M x = 0 is zero vector or a scaled version of the eigenvector corresponding to the eigenvalue of 0. Therefore, x are scaled versions of Z −2 v.

Theorem 9.
The function E is globally optimal at x = y = 0.
Proof.From Equation (24), we obtain ∇ x E as This implies that We study the conditions that x must satisfy in the local solution of E. In that case, x ∇ x E must be zero due to m+1) and Z be a diagonal and regular matrix From Theorem 3, the largest eigenvalue of Z −1 B B (Z −1 ) is less than or equal to Thus, we have In particular, " " holds if rank(B ) > 1.
(i) rank(B ) > 1, or rank(B ) = 1 and y is not a scaled version of a column vector of B Because the rank of B is greater than 1, we have n O. Combining this inequality and Equation (42), we see that x ∇ x E = 0 holds at x = 0 only.Therefore, if x is a local minimum of E, then ∇ x E = 0 holds and we have x = 0. Since a local minimum is limited to x = 0, it also becomes the global minimum.
(ii) rank(B ) = 1 and y is a scaled version of a column vector of B All rows of B are dependent due to rank(B ) = 1.As y is a scaled version of a column vector of B , we can express B as B = [ y 1 u . . .y k u ] by introducing a vector u.By definition, y i is not always zero.
Substituting Equation (44) into Equation (42), we have The right-hand side of Equation ( 46 must be satisfied with a certain real value of c, where c := c (|u| 2 + n)/(|u| 2 + 1).Solving this equation, we have 32), we have 1 2mn We also obtain the following equality for i = j by applying Summarizing Equations ( 48), (49), and the equality The assumption on B and definition of δ i attain δ 2 i + y 2 i = y i |u| 2 + 1 for all i.Thus, we have The discussion regarding the positive semidefiniteness of Equation ( 36) also demonstrates that the first term on the right-hand side of Equation ( 50) is positive semidefinite.Therefore, Through the discussion on the conditions under which x must satisfy in the local solution of E when y is a constant vector, the following revelations emerge: Both discussions (i) and (ii) yield the conclusion that x = 0 is valid when E is minimized under the condition of a constant y.Similar observations can be made regarding investigating the conditions for y in the local solution of E. Hence, the function E attains global optimality at x = y = 0.

Assessing the Accuracy of Approximate Matrix Multiplications
We evaluate the approximation performance of the proposed method for matrix multiplication.First, we set the size of the matrices to m = n = k = 2 15 .Then, we randomly determine each element of the matrices independently of a uniform distribution between 0 and 1.
Figure 1A evaluates the speedup ratio of the computation time and the approximation performance obtained by applying the conventional and proposed methods for AMM.The horizontal axis compares the computation time of the approximate calculation with that of the exact matrix product calculation.In the experiments of this chapter, we implemented the CUDA program for AMM using CUDA 12.0, which was executed on an NVIDIA A100 GPU.The exact matrix product was computed using an API incorporated into cuBLAS (NVIDIA's official matrix library).The numerical precision was set to a single-precision floating point.The vertical axis in this figure represents the median relative accuracy between an approximate matrix Z and the exact solution AB, calculated by executing each method 100 times.The relative accuracy is defined as Figure 1A demonstrates that the proposed method achieves higher acceleration while maintaining relative accuracy.For instance, at a relative accuracy of 0.95, the acceleration rate is 87.2 times faster for the conventional method and 443.2 times faster for the proposed method.Although the relative accuracy required depends on the application, the proposed method outperforms the traditional method at all levels of relative accuracy.6SHHGXSUDWLR The trade-off between the speedup ratio and Frobenius norm error when calculating AMM using the conventional and proposed methods.The matrices are of size m = n = k = 2 15 , with elements randomly generated from a uniform distribution between 0 and 1.The speedup ratio is the ratio of the time taken to calculate the matrix product precisely using cuBLAS on a GPU to the time taken to execute AMM.(B) Comparison of speedup ratios for achieving a relative accuracy of 0.95.It is clear that as the matrix size increases, the difference in speedup rate increases.
Figure 1B displays a graph showing the speedup ratio trend achieving a relative accuracy of 0.95 when the matrix size is changed from 2 10 to 2 15 .This demonstrates that the proposed method can achieve higher speedup ratios than the conventional method regardless of the matrix size.Furthermore, the difference in the speedup ratio expands as the matrix size increases.If the matrix size is small, the exact product does not generally become the bottleneck of the entire calculation.The approximate product is required mainly for large-scale computations.The property of the proposed method (that the advantage becomes more pronounced as the matrix size increases) is desirable.

Application to a Practical Problem: Computing Eigenvalues and Eigenvectors
Next, we evaluate the performance of the proposed methodology in practical applications.We focus on the computation of eigenvalues, particularly the largest eigenvalue.Several algorithms have been proposed for computing the largest eigenvalue, including power iteration, inverse power iteration [33], accelerated stochastic power iteration [34], and the LOBPCG method [35].This paper applies the proposed AMM to the power iteration, which forms the backbone of many existing techniques.
We assume for simplicity that the matrix A ∈ R m×m is symmetric.Then, all eigenvalues of A are real.See Algorithm 3.For any square matrix A, let λ be an eigenvalue of A. Then, A 2 has an eigenvalue of λ 2 .Therefore, we substitute A with A 2 in Algorithm 3.

Algorithm 3 Power method
Require: A ∈ R m×m , x = 0 : m-dimensional randomized unit vector.Ensure: λ: the eigenvalue of A with the largest absolute value 1: for while not converged do 2: x ← y/ y 5: end for 6: Return λ In this case, the second line is expressed as y ← A 2 x, and we apply the proposed AMM to this step.From Equation ( 7 end for 9: λ ← x y 10: x ← y/ y 11: end for 12: Return

√ λ
In this section, we assess the performance of the proposed methodology by employing the principal component analysis (PCA) [36] as a representative example, which serves as one of the applications for computing eigenvalues.The PCA is a method employed to examine and reduce the dimensionality of the dataset.It involves identifying the principal components and linear combinations of the original features that effectively capture the most substantial variations within the data.The eigenvalues and eigenvectors of a covariance matrix hold a pivotal role in PCA.When presented with a dataset comprising p features, the initial step in PCA involves the computation of the covariance matrix.This matrix assumes a symmetric structure with dimensions of p-by-p, where each element denotes the covariance between two specific features.The quest for the most dominant principal components is tantamount to identifying the largest eigenvector within the covariance matrix.Hence, within this section, we focus on image analysis, an area in which PCA finds frequent application.Specifically, we have obtained the largest eigenvalues and corresponding eigenvectors from a matrix with a size of 11750 × 11750, which is acquired via the eigenface algorithm [37] on the LFW dataset [38].
Figure 2A illustrates the variation in convergence over time when CUDA programs of the power method (PM) are run on an NVIDIA A100 GPU.Both figures share a common horizontal axis representing the computation time. &RPSXWDWLRQWLPHVHF Temporal convergence curves of power methods.The differences between the estimated largest eigenvalue λmax and its corresponding eigenvector ṽmax at each iteration of the power method from their respective true values, denoted as λ max and v max , are plotted on the vertical axis of each figure.
The vertical axis in Figure 2A represents the relative error between the largest and estimated eigenvalue during iterations.In contrast, the vertical axis in Figure 2B represents the deviation between the eigenvector corresponding to the largest eigenvalue and the estimated eigenvector during iterations.Incorporating the proposed AMM into the power method requires the execution of the first line of Algorithm 4, resulting in a longer duration until the completion of the initial iteration compared to the conventional method incorporating AMM in the power method.However, since this computation only needs to be performed once, its impact on the overall computation time is limited.Compared to the power method with the conventional AMM, the method with the proposed AMM demonstrates favorable convergence in both indicators.The power method with the proposed AMM cannot achieve the same level of convergence as the one employing exact matrix multiplication.This is because the expected variance from the AMM breaks the condition revealed in previous studies on the noisy power method [39].
To solve this problem, we propose a variance reduction method that performs exact matrix multiplication under some criteria.When obtaining the approximate matrix product of the matrices A and B + B, if C = AB is known, we can compute the approximate solution C ≈ A B and provide C + C as an approximate solution.As mentioned in Section 2, the expected error will decrease if the Frobenius norm of B is small.The estimated eigenvector converges gradually through the power method.Therefore, computing the exact matrix product only sometimes and minimizing the expected error of the AMM are expected to be effective.Based on this motivation, we present Algorithm 5.It uses the proposed AMM while performing exact matrix multiplication only for the steps that satisfy the given conditions.x ← y/ y 15: end for 16: Return

√ λ
Various conditions can be considered for performing exact matrix multiplication.For instance, we can employ a static strategy where an exact computation is performed every few steps.Alternatively, a dynamic strategy is also promising, where convergence stagnation due to approximation errors is detected, and an exact computation is executed.In this section, we adopt a hybrid approach that combines an AMM with the exact computation based on this dynamic strategy.To assess the convergence of eigenvector calculations, we adopt the following rules: 1.
If the dot product between the current and previous vectors of x exceeds a threshold value θ, then we execute the exact matrix product at the next step.2.
Even if the initial condition is met, the subsequent calculation employs an AMM if the exact matrix multiplication was previously executed.
The experimental results are presented in Figure 3.We set the number of samplings c to 512.We executed each method 1000 times while changing random seeds.In this figure, the markers represent the median, while the shaded region represents the interquartile range from the first to the third quartiles.The top row corresponds to θ = 1 − 10 −4 , while the bottom row corresponds to θ = 1 − 10 −3 .The left panel depicts the relative error between the true largest eigenvalue λ max and the estimated largest eigenvalue λmax .The right panel illustrates the discrepancy between the eigenvectors corresponding to the largest eigenvalue, denoted as v max , and the estimated eigenvectors denoted as ṽmax .The threshold is a hyperparameter, and it is not clear how to determine an appropriate value for it.Therefore, the method should exhibit stable convergence for various threshold values.For θ = 1 − 10 −4 , the power method with the conventional AMM fails to converge.In contrast, the power method with the proposed AMM demonstrates convergence performance comparable to the power method employing exact matrix multiplication while achieving a reduced computation time.When the threshold is changed to θ = 1 − 10 −3 , the steps involving exact matrix multiplication increase.Consequently, even the power method with the conventional AMM exhibits good convergence.However, the increased frequency of exact matrix multiplications requires more computation time than the power method with the proposed AMM.Thus, the power method with the proposed AMM demonstrates more robust convergence than the one utilizing only exact matrix multiplication.Moreover, it is faster than the power method using only exact matrix multiplication.For instance, when θ = 1 − 10 −3 , the computation time until reaching 1 − v max ṽmax 2 = 10  The notation θ is a parameter representing the threshold for switching between exact and approximate matrix products in power methods; the details of which are described in the main text.

Discussion
One of the challenges of the proposed method is that its effectiveness diminishes when the elements of matrix A's columns, representing α, and the elements of matrix B's rows, representing β, are close to zero.In practical applications and datasets, matrices often exhibit element distributions with specific tendencies or possess the property of being positive semidefinite, thereby deviating from the scenarios where the proposed method is less effective.However, there are cases in specific applications where these conditions apply, requiring alternative methods to improve the approximation accuracy of the matrices.
In this paper, we applied AMM to the simple power method and computed the largest eigenvalue.By incorporating QR decomposition into the iterations of the power method, we can obtain the first several singular eigenvectors [39].Furthermore, by extending the application of AMM to other variants of the power method discussed in this paper, we anticipate accelerating the computation of various eigenvalue calculations.As mentioned in the previous section, some prior studies on the noisy power method analyze the convergence behavior when noise is introduced in each iteration of the power method.Although we determined the threshold θ in this paper, it is possible to analytically or dynamically adjust θ based on such prior research.This remains a challenge for future investigation.
Besides the power method, least square minimization can also be considered among the applications of AMM.Previous research has explored accelerating the analysis of largescale single nucleotide polymorphisms by utilizing approximate calculations to reduce computational complexity, as the use of pseudo-inverse matrices for obtaining approximate

AB 7 ) 2
= A B + A β e n + e m α B + e m α β e n = A B + (Aβ)e n + e m α B − e m α β e n (Algorithm Proposed method: Monte Carlo AMM with preprocessing Require: A ∈ R m×k and B ∈ R k×n , and c ∈ N Ensure: Z ) is zero due to x ∇ x E = 0. Theorem 8 claims that the vectors x satisfying x Z −1 B Z 2 − B B x = 0 are scaled versions of Z −2 v, where vv = B B .The equality B = y[ u √ n ] gives us B B = |u| 2 + n yy , and it means v = |u| 2 + ny.Thus,

Algorithm 4 y ← e m x v + z x e m 5 : 6 :
), we have A 2 = AA = A A + (Aα)e m + e m (Aα) − e m α α e m .(53) We execute the calculation of A A in Equation (53) using Algorithm 2. By incorporating this AMM into Algorithm 3, we obtain Algorithm 4, a power iteration algorithm that utilizes the approximation.The fourth to eighth lines of Algorithm 4 correspond to the Monte Carlo AMM with the proposed method.Power method with the proposed Monte Carlo AMM Require: A: m × m symmetric positive-semidefinite matrix, x = 0: m-dimensional randomized unit vector, c ∈ N Ensure: λ: the largest absolute eigenvalue of A 1: z ← Aα, v ← z − |α| 2 e m 2: Set p 1 , . . ., p m , such that ∑ m l=1 p l = 1 3: for while not converged do 4: for t = 1, . . ., c do Pick i t with P[i t = l] = p l 7:

Figure 3 .
Figure3.Temporal convergence curves of variance reduction power methods.The differences between the estimated largest eigenvalue λmax and its corresponding eigenvector ṽmax at each iteration of the power method from their respective true values, denoted as λ max and v max , are plotted on the vertical axis of each figure.The top row (A,B) corresponds to θ = 1 − 10 −4 , while the bottom row (C,D) corresponds to θ = 1 − 10 −3 .The notation θ is a parameter representing the threshold for switching between exact and approximate matrix products in power methods; the details of which are described in the main text.
∇ 2 xx E O holds if x is a scaled version of vector [ σ 1 . ..σ k ] .In other words, the set of local solutions is globally minimal, and x = 0 also becomes a global minimum.
• If x is a local solution, it must be a scaled version of vector [ σ 1 . . .σ k ] .• All local solutions are continuous and have the same value of E. • −11is reduced by