Abstract
The randomized coordinate descent (RCD) method is a simple but powerful approach to solving inconsistent linear systems. In order to accelerate this approach, the Nesterov accelerated randomized coordinate descent method (NARCD) is proposed. The randomized coordinate descent with the momentum method (RCDm) is proposed by Nicolas Loizou, we will provide a new convergence boundary. The global convergence rates of the two methods are established in our paper. In addition, we show that the RCDm method has an accelerated convergence rate by choosing a proper momentum parameter. Finally, in numerical experiments, both the RCDm and the NARCD are faster than the RCD for uniformly distributed data. Moreover, the NARCD has a better acceleration effect than the RCDm and the Nesterov accelerated stochastic gradient descent method. When the linear correlation of matrix A is stronger, the NARCD acceleration is better.
MSC:
65F10; 65F45
1. Introduction
Consider a large-scale overdetermined linear system
where , . We can solve the least-squares problem . We assume that the columns of A are normalized:
This assumption has no substantial impact on the implementation costs. We could just normalize each the first time the algorithm encounters it. However, we do not assume (2) about the algorithms, and include factors as needed. Regardless of whether normalization is performed, our randomized algorithms yield the same sequence of iterates.
The coordinate descent (CD) technique [1], which can also be produced by applying the conventional Gauss−Seidel iteration method to the following normal equation [2], is one of the iteration methods that may be used to solve the problem (1) cheaply and effectively.
and it is also the same as the quadratic programming problem with no constraints.
From [1], we can obtain
In solving problem (1), the coordinate descent approach has a long history of addressing optimization issues and various applications in a wide range of fields such as biological feature selection [3], machine learning [4], protein structure [5], tomography [6,7], and so on. Inspired by the randomized coordinate descent (RCD) method, a lot of related works were presented, such as greedy versions of the randomized coordinate descent [8,9] and block versions of the randomized coordinate descent [10,11,12]. The coordinate descent method is a column projection method and the Kaczmarz [13] method is a row projection method. The RCD method is inspired by the randomized Kaczmarz(RK) [14] method. For the Kaczmarz-type approach; a lot of relevant work has also been conducted. Readers can refer to [15,16,17,18,19,20].
In this paper, for solving large systems of linear equations, we use two methods to accelerate the RCD method. First, we obtained an accelerated RCD method by adding Nesterov’s acceleration mechanism to the traditional RCD algorithm, called the Nesterov accelerated randomized coordinate descent method (NARCD). It is commonly known that by using an appropriate multi-step technique [21], the traditional gradient method may be turned into a quicker system. To solve the number of unconstrained minimization problems with strongly convex objectives, Nesterov improved this accelerated format [22]. Second, we can apply the heavy ball method (momentum method) to accelerate the RCD. Polyak invented the heavy ball method [23], which is a common approach for speeding up the convergence rate of gradient-type algorithms. Many researchers looked into variations of the heavy ball method, see [24]. By these two methods—to accelerate the RCD—the Nesterov accelerated randomized coordinate descent method (NARCD) and the randomized coordinate descent with momentum method (RCDm) were obtained.
In this paper, given a positive semidefinite matrix M, is defined as , and stands for the scalar product and the spectral norm, where the column vector is denoted by , with 1 at the ith position and 0 elsewhere. In addition, for a given matrix A, , , , and are used to denote its ith column, Frobenius norm, the smallest nonzero singular value, and the transpose of A respectively. is the Moore–Penrose pseudoinverse of A. Note that . Let us denote as the index randomly generated at iteration k, and let denote all random indices that occurred or before iteration k, so that
the sequences ,, are determined by . In the following part of the proof, we use to denote the expectation of a random variable condition of with respect to the index . So that
The organization of this paper is as follows. In Section 2, we propose the NARCD method naturally and prove the convergence of the method. In Section 3, we propose the RCDm method and prove its convergence. In Section 4, to demonstrate the efficacy of our new methods, several numerical examples are offered. Finally, we present some brief concluding remarks in Section 5.
2. Nesterov’s Accelerated Randomized Coordinate Descent
The NARCD algorithm applies the Nesterov accelerated procedure [22], which is more well-known in terms of the gradient descent algorithm. Moreover, the Nesterov acceleration scheme creates the sequences , , and . When applied to , gradient descent sets , where is the objective gradient and is the step-size. We define the following iterative scheme:
The aforementioned scheme’s key addition is that it employs acceptable values for the parameters , , and , resulting in improved convergence in traditional gradient descent. In [25], the Nesterov-accelerated procedure is applied to the Kaczmarz method, which is a row action method. The RCD is a column action method and the Nesterov-accelerated procedure can be applied in the same way. The relationship between parameters , , and is given in [22,25]. Now, using the general setup of Nesterov’s scheme, we can obtain the NARCD algorithm (Algorithm 1).
The framework of the NARCD method is given as follows.
| Algorithm 1 Nesterov’s accelerated randomized coordinate descent method (NARCD) |
Input:, , , , .
|
Remark 1.
In order to avoid the calculation of the product of matrix and vector ( in steps 7 and 8), we adopt the following
and , . At the same time, we can use to estimate the residue.
Lemma 1.
For any solution to , and , . We can obtain
where a random variable i satisfies the uniform distribution of the set {1,2,..,n}.
Proof.
Using to donate the expectation with respect to the index i, we have
where the last equality uses . □
Lemma 2.
For any , we have
where the random variable i satisfies the uniform distribution of the set {1,2,..,n}.
Proof.
We know the compact singular value decomposition of A as , where , r is the rank of A, and Σ is the positive diagonal, and we can obtain .
As the sixth equation is a consequence of trace(ABC)=trace(BCA), and , , so that the last inequality holds. □
Lemma 3.
As all , the following definition:
for both sequences , lie in the interval if and only if satisfies the following property:
and , if , then .
Proof.
The first part of the lemma clearly holds. For the second part, recall from (5) that is the larger root of the following convex quadratic function:
we note the following:
which together imply that . □
Lemma 4.
Let a, b, and c be any vector in , then the following identity holds:
Theorem 1.
The coordinate descent method with Nesterov’s acceleration for solving linear equations, and is the least-squares solution. Define and , then for all , we have the following
and
Proof.
We follow the standard notation and steps shown in [22,25]; by (5) and (6), the following relation holds:
From (5) and (7), we have
Now, let us define . Then we have
Now, we divide (15) into three parts and simplify them separately. From the convexity of and in Lemma 3, we know . So the first part of (15) is as follows:
where the last inequality makes use of . Using Lemmas 1 and 2, the second part of (15) is as follows:
We use the identity of (8) in the last part of our proof. We take expectations in the last part of (15) and can obtain
where the fifth and sixth equalities make use of the (13), and (10), respectively. Substituting all three parts of (16)–(18) into(15), we have
where the last equality is the consequence of (14). Let us define two sequences , as follows:
we know the , and , . We have . Because of the , we have . Moreover, in Lemma 3, so that we can obtain that the is also an increasing sequence. Now, multiplying both sides of (19) by and using the (20), we have
and then
So, by the (22), we can obtain
we now need to analyze the growth of two sequences and . Following the proof in [22,26] for the Nesterov accelerated scheme and the accelerated sampling Kaczmarz Motzkin algorithm [25], we have
It implies that
then
Moreover, because the and the are increasing sequences, we can simplify them and obtain
Similarly, we have
where the second equality uses (14) and the third equality uses (20). Using the above relationship, we have
Therefore,
By combining the two expressions of (25) and (24), we have
The Jordan decomposition of the matrix in the above expression is
Here, and . Because of , we have
The above relationship gives us the growth bound for the sequences and . Substituting these above bounds in (23), we have
and we have completed the proof. □
Remark 2.
From the relationship between , , and , , we know that
and
where the is the nonzero minimum singular value of A. By the above inequality and Theorem 1, we have
3. Randomized Coordinate Descent with Momentum Method
The iterative formula of the gradient descent (GD) method is as follows,
where is a positive step-size parameter. Polyak [23] proposed the gradient descent method with momentum (GDm) by introducing a momentum term , which was also known as the heavy ball method
where is a momentum parameter. Letting be an unbiased estimator of the true gradient , we have the stochastic gradient descent with momentum (mSGD) method.
The randomized coordinate descent with momentum (RCDm) method is proposed in [27]. We will give a new convergence boundary.
The RCDm method takes the explicit iterative form
The framework of the RCDm method is given as follows (Algorithm 2).
| Algorithm 2 Randomized coordinate descent with momentum method (RCDm) |
Input:, , , , .
|
Remark 3.
In order to avoid the calculation of the product of matrix and vector ( in step 5), we adopt the following method
and , .
Lemma 5
([27]). Fix and let be a sequence of nonnegative real numbers satisfying the relation
where , , and at least one of the coefficients , is positive. Then the sequence satisfies the relation for all , where and . Moreover,
with equality if and only if (in that case, and ).
Theorem 2.
Assume , and that the expressions and satisfy , where is nonzero minimum singular value of A. Let be the iteration sequence generated by the RCDm method starting from initial guess . Then, it holds that
where , , is the least-squares solution. Moreover, , , q obeys .
Proof.
From the algorithm of RCDm, we have
We consider the three terms in (28) in turn. For the first term, we have
where the last inequality is the consequence of singular value inequality , and . From the second term, we have
From the third term, we have
where the second equality uses (10), the first inequality uses a singular value inequality and (10), and the last inequality is a consequence of . Using the (29)–(31), we obtain
Moreover, then
By Lemma 5, let , we have the following relation
and we have
where , , , and we have completed the proof. □
Remark 4.
, . When , we can obtain , satisfy . When δ takes a small value, we can obtain that this relation is satisfied. In addition, the RCDm method degenerates to the RCD method when . The RCDm method converges faster than the RCD method if we choose a proper δ. Numerical experiments will show the effectiveness of the RCDm method.
When , we can conclude that . For the above theorem, we have to satisfy , so . We set . It can be concluded that . When , the RCDm method converges. However, in the later experiments, the choice of will exceed this range because it took a lot of scaling to reach this range.
4. Numerical Experiments
In this section, we compare the influence of different on the RCDm algorithm and the effectiveness of the RCD, RCDm, and NARCD methods for solving the large linear system . All experiments were performed in MATLAB [28] (version R2018a), on a personal laptop with a 1.60 GHz central processing unit (Intel(R) Core(TM) i5-10210U CPU), 8.00 GB memory, and a Windows operating system (64 bits, Windows 10).
In all implementations, the starting point was chosen to be x0 = zeros(n, 1), the right vector where and . The relative residual error (RRE) at the kth iteration is defined as follows:
The iterations are terminated once the relative solution error satisfies or the number of iteration steps exceeds 5,000,000. If the number of iteration steps exceeds 5,000,000, it is denoted as “-”. IT and CPU denote the number of iteration steps and the CPU times (in seconds) respectively. In addition, the CPU and IT mean the arithmetical averages of the elapsed running times and the required iteration steps with respect to 50 trials of repeated runs of the corresponding method. The speed-up of the RCD method against the RCDm method is defined as follows:
and the speed-up of the RCD method against the NARCD method is defined as follows:
4.1. Experiments for Different on the RCDm
The matrix A is randomly generated by using the MATLAB function unifrnd (0,1,m,n). We observe that RCDm, with appropriately chosen momentum parameters , always converges faster than their no-momentum variants. In this subsection, we let to compare their performances. Numerical results are reported in Table 1, Table 2 and Table 3 and Figure 1. We can conclude some observations as follows. when , the acceleration effect is good.
Table 1.
For different , IT, and CPU of RCDm for matrices with m = 8000 and different n.
Table 2.
For different , IT, and CPU of RCDm for matrices with m = 800 and different n.
Table 3.
For different , IT, and CPU of RCDm for matrices with m = 300 and different n.
Figure 1.
(a,b): m = 300 rows and n = 150, 100 columns for different . (c,d): m = 800 rows and n = 300, 200 columns for different . (e,f): m = 8000 rows and n = 3000, 2000 columns for different .
4.2. Experiments for NARCD, RCDm, RCD, NASGD
Matrix A is randomly generated by using the MATLAB function unifrnd (0,1,m,n). For the RCDm method, let us take the momentum parameter . For the NARCD method, let us take the Nesterov accelerated parameter . For the Nesterov accelerated stochastic gradient descent method (NASGD), it is the step size . We observe the performances of RCD, RCDm, and NARCD methods with matrices A of different sizes. From Figure 2 and Table 4, Table 5, Table 6 and Table 7, we found that both the NARCD and the RCDm with appropriate momentum parameters can accelerate the RCD; the NARCD and the RCDm always converge faster than the RCD. Moreover, we found that the NARCD has a better acceleration effect than the RCDm. From Table 7, for matrix A , the NARCD method demonstrates the best numerical results than the other matrices in terms of the value of the speed-up, where the speed-up is . From Figure 3, we found that the acceleration of NARCD method and RCDm method experience gentle changes as the matrix becomes larger, so we can see that these two methods still have good speed-ups when the matrix is very large. From Figure 4, we found that the NARCD converges faster than the NASGD.
Figure 2.
(a,b): m = 4000 rows and n = 800, 1000 columns for RCD, RCDm, and NARCD. (c,d): m = 8000 rows and n = 2000, 3000 columns for RCD, RCDm, and NARCD. (e,f): m = 12,000 rows and n = 2000, 4000 columns for RCD, RCDm, and NARCD.
Table 4.
IT and CPU of RCD, RCDm, NARCD for matrices with m = 4000 and different n.
Table 5.
IT and CPU of RCD, RCDm, NARCD for matrices with m = 8000 and different n.
Table 6.
IT and CPU of RCD, RCDm, NARCD for matrices with m = 12,000 and different n.
Table 7.
The speed-up of the RCD method against the NARCD and RCDm.
Figure 3.
The speed-up of the RCD method against the NARCD and RCDm for matrices with and .
Figure 4.
m = 800 and n = 300 for NARCD and NASGD.
4.3. Experiment with Different Correlations of Matrix A
Matrix A is randomly generated by using the MATLAB function unifrnd (c,1,m,n), . We let . For the RCDm method, let us take the momentum parameter . For the NARCD method, let us take the Nesterov accelerated parameter . As the value of c increases, the correlation of matrix A becomes stronger. From Table 8, Table 9, Table 10, Table 11 and Table 12, we know that as c increases, the condition number for the matrix increases. The larger the condition number, the more ill-conditioned the matrix. The more ill-conditioned the matrix, the more time it takes to solve. From Table 10 and Table 12, we know that the acceleration effect of RCDm does not change much with the increase of the c value, but the acceleration effect of NARCD is becoming better.
Table 8.
The condition number of different matrices.
Table 9.
IT and CPU of RCD, RCDm, NARCD for matrices m = 800, n = 300.
Table 10.
The speed-up of the RCD method against the NARCD and RCDm, m = 800, n = 300.
Table 11.
IT and CPU of RCD, RCDm, NARCD for matrices m = 1000, n = 800.
Table 12.
The speed-up of the RCD method against the NARCD and RCDm, m = 1000, n = 800.
4.4. The Two-Dimensional Tomography Test Problems
In this section, we use the previously and newly proposed methods to reconstruct 2D seismic travel time tomography. The 2D seismic travel-time tomography reconstruction is implemented in the function seismictomo (N, s, p) in the MATLAB package AIR TOOLS [29], which generated a sparse matrix A, an exact solution (which is shown in Figure 4a) and the right vector where . We set N = 20, s = 30 and p = 100 in the function seismictomo (N, s, p). We utilize the RCD, RCDm(=0.3) and NARCD() methods to solve the linear least-squares problem (1). The experiment ran 90,000 iterations. From Figure 5, we see that the results of the NARCD methods are better than those of the RCD and RCDm methods under the same number of iteration steps.
Figure 5.
Performance of RCD, RCDm, and NARCD methods for the seismictomo test problem. (a) Exact seismic. (b) RCD. (c) RCDm. (d) NARCD.
5. Conclusions
To solve a large system of linear equations, two new acceleration methods for the RCD method are proposed, called the NARCD method and the RCDm method. Their convergences were proved, and the estimations of the convergence rates of the NARCD method and the RCDm method are given, respectively. The two methods are shown to be equally successful in numerical experiments. In uniformly distributed data, with appropriately chosen momentum parameters, the RCDm is better than the RCD in IT and CPU. The NARCD and the RCDm are faster than the RCD, and the NARCD has a better acceleration effect than the RCDm. In the case of an overdetermined linear system, for the NARCD method, the fatter the matrix, the better the acceleration. The acceleration effect of NARCD becomes better when c in the MATLAB function unifrnd (c, 1, m,n) increases. The block coordinate descent method is a very efficient method for solving large linear equations; in future work, it would be interesting to apply the two accelerated formats to the block coordinate descent method.
Author Contributions
Software, W.B.; Validation, F.Z.; Investigation, Q.W.; Writing—original draft, Q.W.; Writing—review and editing, W.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research is supported by the National Key Research and Development program of China (2019YFC1408400).
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Leventhal, D.; Lewis, A.S. Randomized methods for linear constraints: Convergence rates and conditioning. Math. Oper. Res. 2010, 35, 641–654. [Google Scholar] [CrossRef]
- Ruhe, A. Numerical aspects of Gram-Schmidt orthogonalization of vectors. Linear Algebra Its Appl. 1983, 52, 591–601. [Google Scholar] [CrossRef]
- Breheny, P.; Huang, J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 2011, 5, 232. [Google Scholar] [CrossRef]
- Chang, K.W.; Hsieh, C.J.; Lin, C.J. Coordinate descent method for large-scale l2-loss linear support vector machines. J. Mach. Learn. Res. 2008, 9, 1369–1398. [Google Scholar]
- Canutescu, A.A.; Dunbrack Jr, R.L. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 2003, 12, 963–972. [Google Scholar] [CrossRef] [PubMed]
- Bouman, C.A.; Sauer, K. A unified approach to statistical tomography using coordinate descent optimization. IEEE Trans. Image Process. 1996, 5, 480–492. [Google Scholar] [CrossRef] [PubMed]
- Ye, J.C.; Webb, K.J.; Bouman, C.A.; Millane, R.P. Optical diffusion tomography by iterative-coordinate-descent optimization in a Bayesian framework. JOSA A 1999, 16, 2400–2412. [Google Scholar] [CrossRef]
- Bai, Z.Z.; Wu, W.T. On greedy randomized coordinate descent methods for solving large linear least-squares problems. Numer. Linear Algebra Appl. 2019, 26, e2237. [Google Scholar] [CrossRef]
- Zhang, J.; Guo, J. On relaxed greedy randomized coordinate descent methods for solving large linear least-squares problems. Appl. Numer. Math. 2020, 157, 372–384. [Google Scholar] [CrossRef]
- Lu, Z.; Xiao, L. On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 2015, 152, 615–642. [Google Scholar] [CrossRef]
- Necoara, I.; Nesterov, Y.; Glineur, F. Random block coordinate descent methods for linearly constrained optimization over networks. J. Optim. Theory Appl. 2017, 173, 227–254. [Google Scholar] [CrossRef]
- Richtárik, P.; Takáč, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 2014, 144, 1–38. [Google Scholar] [CrossRef]
- Karczmarz, S. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let. Cl. Sci. Math. Nat. 1937, 35, 355–357. [Google Scholar]
- Strohmer, T.; Vershynin, R. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 2009, 15, 262–278. [Google Scholar] [CrossRef]
- Bai, Z.Z.; Wu, W.T. On greedy randomized Kaczmarz method for solving large sparse linear systems. SIAM J. Sci. Comput. 2018, 40, A592–A606. [Google Scholar] [CrossRef]
- Bai, Z.Z.; Wu, W.T. On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems. Appl. Math. Lett. 2018, 83, 21–26. [Google Scholar] [CrossRef]
- Liu, Y.; Gu, C.Q. Variant of greedy randomized Kaczmarz for ridge regression. Appl. Numer. Math. 2019, 143, 223–246. [Google Scholar] [CrossRef]
- Guan, Y.J.; Li, W.G.; Xing, L.L.; Qiao, T.T. A note on convergence rate of randomized Kaczmarz method. Calcolo 2020, 57, 1–11. [Google Scholar] [CrossRef]
- Du, K.; Gao, H. A new theoretical estimate for the convergence rate of the maximal weighted residual Kaczmarz algorithm. Numer. Math. Theory Methods Appl. 2019, 12, 627–639. [Google Scholar]
- Yang, X. A geometric probability randomized Kaczmarz method for large scale linear systems. Appl. Numer. Math. 2021, 164, 139–160. [Google Scholar] [CrossRef]
- Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O (1/k2). Dokl. Akad. Nauk Sssr 1983, 269, 543–547. [Google Scholar]
- Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 2012, 22, 341–362. [Google Scholar] [CrossRef]
- Polyak, B.T. Some methods of speeding up the convergence of iteration methods. Ussr Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
- Sun, T.; Li, D.; Quan, Z.; Jiang, H.; Li, S.; Dou, Y. Heavy-ball algorithms always escape saddle points. arXiv 2019, arXiv:1907.09697. [Google Scholar]
- Sarowar Morshed, M.; Saiful Islam, M. Accelerated Sampling Kaczmarz Motzkin Algorithm for The Linear Feasibility Problem. J. Glob. Optim. 2019, 77, 361–382. [Google Scholar] [CrossRef]
- Liu, J.; Wright, S. An accelerated randomized Kaczmarz algorithm. Math. Comput. 2016, 85, 153–178. [Google Scholar] [CrossRef]
- Loizou, N.; Richtárik, P. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Comput. Optim. Appl. 2020, 77, 653–710. [Google Scholar] [CrossRef]
- Higham, D.J.; Higham, N.J. MATLAB Guide; SIAM: Philadelphia, PA, USA, 2016. [Google Scholar]
- Hansen, P.C.; Jørgensen, J.S. AIR Tools II: Algebraic iterative reconstruction methods, improved implementation. Numer. Algorithms 2018, 79, 107–137. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).