Abstract
This paper aims to improve the response speed of SPDC (stochastic primal–dual coordinate ascent) in large-scale machine learning, as the complexity of per-iteration of SPDC is not satisfactory. We propose an accelerated stochastic primal–dual coordinate ascent called ASPDC and its further accelerated variant, ASPDC-i. Our proposed ASPDC methods achieve a good balance between low per-iteration computation complexity and fast convergence speed, even when the condition number becomes very large. The large condition number causes ill-conditioned problems, which usually requires many more iterations before convergence and longer per-iteration times in data training for machine learning. We performed experiments on various machine learning problems. The experimental results demonstrate that ASPDC and ASPDC-i converge faster than their counterparts, and enjoy low per-iteration complexity as well.
1. Introduction
In this paper, we consider a composite convex optimization problem, Regularized Empirical Risk Minimization (RERM), that can be solved by SPDC [1]. Our goal is to use our proposed ASPDC find the approximate solution of the following optimization problem:
where is a feature vector, is the corresponding label in a machine learning task, are n samples in the dataset, is the proper convex function of the linear predictor , and the simple convex regularization function.
RERM is one of the central problems in machine learning. It is now prevalent in the data mining and machine learning domain. More background information on RERM can be found in [2]. The following are four examples of RERM:
- Linear SVM, where ,
- Ridge Regression, where ,
- Lasso, where ,
- Logistic Regression, where ,
Here, we focus on the scenario in which the number of samples n is very large, as the per-iteration complexity of SPDC is intolerable in this scenario. Computing a full gradient becomes extremely expensive in terms of time and space costs. Therefore, RERM algorithms with a lower per-iteration complexity are more attractive in large-scale machine learning applications.
General optimization methods to the RERM problem using gradients are categorized into two types, namely, first-order and second-order. Second-order methods such as the Newton algorithm employ a Hessian matrix at each iteration to decrease the objective value. The disadvantage of these second-order methods is that both obtaining and using a Hessian matrix is computationally expensive. On the other hand, while first-order optimization schemes are lightweight in gradient computation, they may converge slowly [3,4].
Among the algorithms for solving the RERM problem, we are more interested in dual algorithms such as stochastic dual coordinate ascent-SDCA, as the dual-gap is a clearer stopping criterion than gradients. In addition, they are capable of handling non-differentiable primal optimal functions more easily [5]. SDCA is a first-order optimization method and is widely used in the current machine learning domain. Dual coordinate methods have been implemented in open machine learning libraries [4].
The dual methods do not solve the primal problem directly. Instead, they solve the dual or saddle point problem of the primal problem. The corresponding dual problem of the primal problem in Equation (1) is formulated as follows:
where and are the convex conjugate functions of g and , respectively. Due to the structure of this dual problem, coordinate ascent methods can be more efficient than full gradient methods [4,6,7].
In the stochastic dual coordinate ascent method (SDCA) [5], a dual coordinate is picked randomly at each iteration and then updated to increase the dual objective value. This helps SDCA to reach a low per-iteration computational complexity. Nevertheless, the convergency speed of SDCA becomes much slower as the condition number grows. A large condition number leads to an ill-conditioned problem. An ill-conditioned scenario refers to a case in which a small change in one of the values of the coefficient matrix causes a large change in the solution vector [8,9,10,11]. Hence, SDCA is not applicable to large-scale data processing in ill-conditioned scenarios. Unfortunately, many traning tasks involving large-scale data involve ill-conditioned scenarios. Ill-conditioned problems are particularly common in mathematics and geosciences [12].
Paper Organization. The rest of this paper is organized as follows. In Section 2, we describe related works.
In Section 3, we describe the relevant assumptions and preliminaries.
In Section 4, we discuss the accelerated stochastic primal–dual coordinate method. In this section, we present ASPDC in Algorithm 1 and its convergence analysis for the saddle point problem in Equation (3).
In Section 5, we extend ASPDC to ill-conditioned problems, in particular, those in which . Our proposed extension method is called ASPDC-i, where i means “for ill-conditioned problems”.
In Section 6, we evaluate the performance of our proposed ASPDC algorithms with several state-of-art algorithms for solving machine learning problems, then discuss the experimental results.
In Section 7, we conclude the paper and discuss potential avenues for future work.
2. Related Work
Shalev-Shwartz and Zhang [13] developed an accelerated proximal stochastic dual coordinate ascent method (ASDCA), which converges faster than traditional methods when the condition number is large (Table 1). ASDCA can be regarded as a variant of a proximal point algorithm equipped with Nesterov’s accelerated technique [14,15,16]. ASDCA uses an inner–outer iteration procedure, where the outer loop is a minimization of an auxiliary problem with a regularized quadratic term. Then, the proximal SDCA starts to solve the auxiliary problem with a customized precision. At the end of each outer loop, Nesterov’s accelerated update is performed on the primal variable w. Nonetheless, ASDCA requires to be limited to a range of low-level values, for example, , where is the smooth parameter of , n is the number of samples, and .
Table 1.
Abbreviations used in this study.
Studies have extended the inner–outer iteration method in order to derive more general accelerated proximal-point algorithms, e.g., Catalyst, [17,18]. Theoretically, one can replace the inner-loop proximal SDCA algorithm using other algorithms, such as SVRG [19] and Prox-SVRG [20], to obtain the same overall complexity concerning the number of outer loops.
More recently, Zhang and Xiao [1,21] proposed a stochastic primal–dual coordinate (SPDC) method to solve the RERM problem defined in Equation (1). SPDC achieves a faster convergence rate in reducing the dual-gap than ASDCA and other dual methods in general optimization problems with condition numbers that are not very large. The per-iteration computation complexity of SPDC is much higher than ASDCA and SDCA. Theoretically, the per-iteration complexity of SPDC is . However, due to the auxiliary variable update and the momentum item, SPDC requires much more time to process one pass of a dataset, as verified in our experiments. When the condition number is large, the SPDC per-iteration computation complexity of SPDC is intolerable, which makes SPDC inapplicable to large-scale data processing. Our experiments verified that SPDC is more time-consuming than ASDCA and other low per-iteration complexity methods. Moreover, the dual-gap of SPDC is much larger when the data are sparse and have high dimensionality.
The above issue leads to the following key question: “Can we design an algorithm with both a low per-iteration complexity and a fast convergence rate, especially for ill-conditioned scenarios in large-scale data processing?” We propose the ASPDC and ASPDC-i algorithms as the answer to this question. ASPDC methods have the following three advantages:
- Simple structure at each iteration. In comparison with SPDC or other accelerated variants, ASPDC does not need to keep track of any other auxiliary variables; it only maintains the primal and dual variable. Each iteration only involves a dual update and primal update. This design makes its per-iteration complexity much lower than SPDC and other variants. The simple iteration design makes it easy to be implemented as well.
- Short running time. Our experiments show that to reach the same precision, our methods need far less time and fewer epochs (numbers of passes through the entire data) to satisfy the stop condition.
- Theoretical guarantee. ASPDC adopts Nesterov’s estimation technique [22,23]. We present a new proof onf the convergence of proposed methods.
3. Assumptions and Preliminary
Throughout this paper, the standard Euclidean is denoted as an equation such as . We use E to denote the expectation that is taken with respect to the randomness of . For the sake of convenience, we use the new notation . Without loss of generality, we continue to assume . Then, we make the following assumptions to clearly specify the problem in Equation (1) as follows:
Assumption 1.
Each is lower semi-continuous and convex, and its derivative is -Lipschitz continuous (or equivalently: is -smooth), i.e., there exist such that
It is widely known that Assumption 1 implies that is -strongly convex (see Theorem 4.2.2 in the convex fundamental book [24]).
Assumption 2.
The primal function is λ-strongly convex: There exists such that ,
The convexity of may come from either or or both. For instance, if , Assumption 2 holds.
Assumption 3.
Assumption 3 is not a strict one, as when data are normalized, Assumption 3 holds.
Under the three assumptions above, the RERM problem defined in Equation (1) can be rewritten as the following convex–concave saddle point problem [1]:
where is a convex conjugation function of . Lemma 1 demonstrates the relationship between the primal problem of Equation (1) with the problem of Equation (3).
Lemma 1.
Let and , then we have
- (1)
- (2)
- (3)
- There exists a unique solution such that
Proof.
Presented in Appendix A. □
4. Accelerated Stochastic Primal–Dual Coordinate Method
In this section, we present ASPDC in Algorithm 1 and its convergence analysis for the saddle point problem in Equation (3).
Each iteration in ASPDC can be divided into two steps: the dual update step and the primal update step. The dual update step is executed first. As shown in lines 4–6 of Algorithm 1, a dual coordinate, , is picked randomly and updated to increase the objective value of while keeping the primal variable w and other ) fixed. Then, the primal update step is executed later. As shown in line 7 of Algorithm 1, the primal variable w is updated to decrease the objective value of while keeping fixed.
The update of the dual variable is extremely simple. It can be simplified as a univariate optimal problem, which makes its per-iteration complexity much lower than traditional SPDC algorithms. Specifically, the local update of dual variable is
where is a unit vector with the element being one.
The update of primal variable w is shown in Equation (5) as follows:
where the last equation is derived from the conjugation sub-gradient theorem in [25]. In this way, we turn the optimization process into a derivative operation of . For instance, if the update of primal variable can be written as .
We compare the complexity of SPD1, SPD1-VR, and SVRG [19] with our methods in Table 2. In Table 2, r is the maximum number of non-zero elements in each sample, S is the number of non-zero elements in the whole data sets, d is the dimension of the dataset, and n the number of data samples. Usually, S is much smaller than when the data are sparse and high-dimensional. Apparently, in most large-scale data applications the data sets are sparse have high dimensionality, i.e., most of the attributes are zeros. At each iteration, SPD1 and SPD1-VR choose (the j-th value of sample ) to update the primal variable and dual variable regardless of whether is 0 or not. This method enables the per-iteration complexity of SPD1 and SPD1-VR to be reduced to . However, their complexity of pass-through data is , which is the same as SVRG. In contrast, ASPDC will not execute the update if . Thus, the complexity of its pass-through data is , which is much lower than SPD1 and SVRG when the data are sparse and high-dimensional.
Table 2.
Complexity comparison of per-iteration and pass through data.
There are two major differences between SDCA and ASPDC, as follows. First, SDCA tries to solve the dual problem, while ASPDC tries to solve a saddle point problem. Second, the dual update of ASPDC is significantly simpler than the update of SDCA. The dual update of SDCA is shown in (9). In comparison with that of ASPDC in Equation (4), the dual update of SDCA involves the additional computation of :
We use the dual-gap metric as the stopping criterion, as shown in line 9 of Algorithm 1. The dual-gap is calculated by , and it is sufficient to say that if , as . This stopping criterion is easier to implement than the other criteria, e.g., . This is for the reason that is not known in advance in real-world machine learning applications.
| Algorithm 1 ASPDC |
|
In the rest of this section, we show the proof for ASPDC’s convergence. We first present the following lemma.
Lemma 2.
On the basis of Assumptions 1–3, let and be the sequence produced by ASPDC and let . ; then, we have:
Proof.
The detailed proof can be found in the Appendix. In the proof, we assume that for convenience. Therefore, the theory only works for l2 regularization. The extension to l1 regularization is a topic for future work.
The skeleton of the proof in the Appendix can be described using the following three steps:
First, we obtain
Second, we have
Finally, using the weak duality we can obtain
□
Theorem 1.
The total number of iterations needed to achieve the expected duality gap of is
Proof.
Using Lemma 2, we can obtain
where, in the inequality, we use the fact that . Let ; then, we finally obtain . □
As shown by Equation (11), the complexity of ASPDC is , In contrast, the complexity of SVRG is and the complexity of SPDC is .
5. ASPDC for Ill-Conditioned Problems
According to convex theory [16], the value is called the condition number of function f if f is and . Under Assumptions 1–3, the condition number of the primal function in Equation (1) is . Suppose becomes lower; then, the condition number, , will be larger. When , the problem f is called ill-conditioned.
In this section, we extend ASPDC to the ill-conditioned problem, especially when . The extension method is called ASPDC-i, in which the suffix i means “for ill-conditioned problems”.
As shown in Algorithm 2, the procedure of ASPDC-i can be divided into epochs, indexed . Each epoch uses ASPDC to solve the following problem with a decreasing precision parameter :
where , is a constant throughout the procedure, and is plus an additional perturbation term. This additional term is employed to ensure that the strongly convex parameter of satisfies . Note that a smaller is preferable, as a larger leads to a severe bias between and . Therefore, in the implementation of our ASPDC algorithms we simply use the smallest : .
These calls of ASPDC produce a sequence , which are the solutions of the corresponding approximate problem in Equation (12). Here, we need to prove that each running procedure of ASPDC from these calls can stop itself after finite epochs as well as that the output satisfies the condition . In this condition, the variable is the theoretical optimal solution of . These facts are illustrated in the following Theorem 2.
Theorem 2.
Algorithm 2 needs epochs to approach the approximate solution , where .
The proof can be found in the Appendix A. The settings of the hyper parameters of Algorithm 2 are presented in the proof.
| Algorithm 2 ASPDC-i |
| 1: Parameter: , , , |
| 2: Initialize: , |
| 3: for s = 1,2,3,... do |
| 4: = ASPDC |
| 5: |
| 6: end for |
| 7: stop condition: |
| Output, |
To make for a fair comparison with other algorithms, we provide an realistic implementation version of Algorithm 2. This implementation version is shown in Algorithm 3. Here, the number of iterations in Algorithm 3 is set to be a constant m (e.g., ). As be demonstrated in the experiment section, this approach works well.
| Algorithm 3 Implemented version of ASPDC-i |
|
6. Experiments
In this section, we evaluate the performance of our ASPDC algorithms along with several state-of-art algorithms for solving machine learning problems such as SVM. All the algorithms were implemented in C++ and executed through a Matlab interface. The experiments were performed on a PC with an Intel i5-4690 CPU and 16.0 GB RAM. The source code and the detailed proofs can be downloaded from the GitHub website (https://github.com/lianghb6/ASPDC, accessed on 28 June 2022) and the datasets can be obtained from the LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, accessed on 28 June 2022).
As the computation processes of the problems are similar, in these experiments we mainly evaluated the practical performance of ASPDC for solving the following SVM optimization problem:
where is a smooth hinge loss, and is used in [1,5] as well.
The corresponding convex–concave saddle point problem is as follows:
where
Under Assumption 3, the smooth parameter of is 1. The strongly convex parameter of is , which comes from the regularized function .
In Figure 1 and Table 3, we show the cases when is relatively large (e.g., ). We compare ASPDC (Algorithm 1) with state-of-art dual methods: the stochastic dual coordinate ascent method (SDCA) [5] and stochastic primal–dual coordinate method (SPDC) [1]. Note that accelerated stochastic dual ascent (ASDCA) [13] cannot be applied to this scenario, as ASDCA requires to be extremely small (i.e., . We omit the comparison between ASPDC and the stochastic gradient descent method and its variants (e.g., SVRG [19] and Katyusha [26]), as there have already been extensive experiments using SPDC and this situation performed in the literature.
Figure 1.
Dual-gap (y-axis) vs, the number of epochs (x-axis). Comparing ASPDC with other methods for smooth hinge SVM on real-world datasets with regularization coefficient . The horizontal axis is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap.
Table 3.
The running time for dual-gap approaches to the given precision () when .
The horizontal axis in Figure 1 is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap. It can be seen from Figure 1 that ASPDC and SDCA have comparable performances on relatively large . With the same epoch, the dual-gap of ASPDC is lower than that of SPDC by two orders of magnitude after several epochs.
Figure 1 shows that both SDCA and ASPDC are faster than SPDC. This is because in Figure 1 is relative large (e.g., 0.01). In this case, the condition number of problems is relatively small. When the condition number is large, ASPDC and SPDC perform better than SDCA. In total, ASPDC is faster and is well suited for ill-conditioned problems.
Table 3 lists the needed running time for the dual-gaps of different algorithms to decrease to the given precision (e.g., ) for different algorithms and datasets. Table 3 demonstrates that ASPDC and SDCA need less time to approach the given precision, and verifies that the convergence of ASPDC and SDCA is faster than SPDC. Table 4 presents the total running time for the algorithms to go through the entire dataset once to measure the per-iteration computation complexity. An algorithm with a shorter running time indicates that the algorithm has a lower per-iteration computation complexity. Table 4 shows that ASPDC and SDCA have a lower per-iteration complexity than SPDC. Among all of the running time results, ASPDC demonstrates both fast convergence and low per-iteration complexity when is large.
Table 4.
The average running time for the algorithms to pass through the entire dataset once when .
We then tested the case when is relatively small (e.g., ) and compared ASPDC-i with SDCA, SPDC, and ASDCA. Figure 2 plots the convergence results. Figure 2 shows that the convergences of SDCA, ASDCA, and SPDC are significantly slower than those of the same algorithms in Figure 1. The reason for this is that the condition number of the problem in this test case is larger than that in Figure 1. ASPDC-i performs much better in this experiment, as can be seen from Figure 2. ASPDC-i needs far fewer epochs than other algorithms to approach the same level of dual-gap. Additionally, ASPDC can approach a significantly lower dual-gap than the others with the same epochs.
Figure 2.
Dual-gap (y-axis) vs. the number of epochs (x-axis). Comparing ASPDC-i with other methods for smooth hinge SVM on real-world datasets with regularization coefficient . The horizontal axis is the number of passes through the entire dataset, and the vertical axis is the logarithmic dual-gap.
In addition, we compared ASPDC-i to a widely used non-dual-based algorithm, SVRG [19]. As SVRG is not dual-based, we directly compared its reduction speed of the primal value with ASPDC-i. Figure 3 shows that the convergence speed of ASPDC-i is faster than SVRG.
Figure 3.
Optimal primal value (y-axis) vs. the number of epochs (x-axis): Comparing ASPDC-i with SVRG for smooth hinge SVM on real-world datasets with the regularization coefficient . The x-axis is the number of passes through the entire dataset, and the y-axis is the logarithmic dual-gap.
Note that ASDCA cannot be applied to cases in which the dataset is covtype and , as ASDCA needs the extra condition . Table 5 illustrates the running time that different algorithms spend to decrease the dual-gap to the given precision (e.g., ). Table 6 demonstrates the total running time for the algorithms to go through the entire dataset once. It shows that ASPDC and ASDCA have lower per-iteration complexity than SPDC. Although SDCA has low per-iteration complexity, its convergence is the slowest among these methods when is relatively small. We did not list the corresponding results of SDCA in Table 5 and Table 6. In summary, the above experiments show that our proposed methods achieve both fast convergence and low per-iteration complexity.
Table 5.
The running time for dual-gaps to approach the given precision () when .
Table 6.
The average running time for the algorithms to pass through the entire dataset once when .
7. Conclusions and Future Work
In this paper, we propose two stochastic primal–dual coordinate methods, ASPDC and its accelerated variant version, ASPDC-i. These two algorithms are designed for the regularized empirical risk minimization problem. We proved the theoretical convergence guarantee of the algorithms and performed a series of experiments. The results illustrate that our methods achieve a good balance between low per-iteration computation complexity and fast convergence. The new convergence proof presented here uses Nesterov’s estimation sequence technique and . We believe that it is possible to extend this proof to the more general regularized function ; however, we leave this as a possibility for future work.
Author Contributions
Writing—original draft, H.L.; Data curation, F.S. and X.L.; Writing—review & editing, H.C., H.W. and J.C. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Science and Technology Program of Guangzhou, China (No. 202002020045) and by the Meizhou Major Scientific and Technological Innovation Platforms and Projects of Guangdong Provincial Science & Technology Plan Projects under Grant No. 2019A0102005.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Appendix A.1. Proof of Lemma 1
We prove the following equations: , and . We first prove .
Proof.
In the last equation, we use the Conjugate Theorem (Convex Optimization Theory). Then, we prove that .
The proof of can be found in [1]. □
Appendix A.2. Proof of Lemma 2
Proof.
When , the primal objective can be written as follows:
The corresponding dual objective is
Note that at through the algorithm, we can set
Thus, the can be written as
Suppose we have and that the coordinate is chosen at iteration :
The variables in the algorithm are as follows:
where in the last inequality we define , and correspondingly have .
where in the inequality ①, while in the inequality ② we use the fact that if is smooth, then is strong convex.
On the one hand, according to (A7), we obtain
This implies that
On the other hand, by the definition of the convex conjugate function, we have . According to the Fenchel conjugate sub-gradient theorem, we have
where in ③ we apply the Fenchel Dual theorem.
Combining with (A6) and (A12), we have
where in the last inequality we use the assumption . Note that we have supposed that the coordinate of is chosen, thus, we use the expectation of (A13) with respect to i, obtaining
Recall that
where in ④ we use the fact that .
Using and , we have , and
Note that ; it is well known that .
Combined with (A17), we obtain
This further implies that
Until now, we have assumed that is known and the expectation is for random variable i; if below we take this expectation with all the history i, we obtain
In addition, it can be known from (A17) that
This implies that . □
References
- Zhang, Y.; Xiao, L. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 353–361. [Google Scholar]
- Ruppert, D. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Publ. Am. Stat. Assoc. 2010, 99, 567. [Google Scholar] [CrossRef]
- Chiang, W.; Lee, M.; Lin, C. Parallel Dual Coordinate Descent Method for Large-scale Linear Classification in Multi-core Environments. In KDD ’16, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1485–1494. [Google Scholar]
- Hsieh, C.; Chang, K.; Lin, C.; Keerthi, S.S.; Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 408–415. [Google Scholar]
- Shalevshwartz, S.; Zhang, T. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. J. Mach. Learn. Res. 2012, 14, 2013. [Google Scholar]
- Chang, K.W.; Hsieh, C.J.; Lin, C.J. Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines. J. Mach. Learn. Res. 2008, 9, 1369–1398. [Google Scholar]
- Platt, J.C. Fast Training of Support Vector Machines Using Sequential Minimal Optimization; MIT Press: Cambridge, MA, USA, 1999; pp. 185–208. [Google Scholar]
- Naskovska, K.; Lau, S.; Korobkov, A.A.; Haueisen, J.; Haardt, M. Coupled CP decomposition of simultaneous MEG-EEG signals for differentiating oscillators during photic driving. Front. Neurosci. 2020, 14, 261. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lee, S.; Kim, E.; Kim, C.; Kim, K. Localization with a mobile beacon based on geometric constraints in wireless sensor networks. IEEE Trans. Wirel. Commun. 2009, 8, 5801–5805. [Google Scholar] [CrossRef]
- Wang, J.; Dong, P.; Jing, Z.; Cheng, J. Consensus-based filter for distributed sensor networks with colored measurement noise. Sensors 2018, 18, 3678. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Anastassiu, H.T.; Vougioukas, S.; Fronimos, T.; Regen, C.; Petrou, L.; Zude, M.; Käthner, J. A computational model for path loss in wireless sensor networks in orchard environments. Sensors 2014, 14, 5118–5135. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Deng, X.; Yin, L.; Peng, S.; Ding, M. An iterative algorithm for solving ill-conditioned linear least squares problems. Geod. Geodyn. 2015, 6, 453–459. [Google Scholar] [CrossRef] [Green Version]
- Shalevshwartz, S.; Zhang, T. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In Proceedings of the International Conference on Machine Learning, Bejing, China, 21–26 June 2014. [Google Scholar]
- Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: New York, NY, USA, 2011; Volume 408. [Google Scholar]
- Güler, O. New proximal point algorithms for convex minimization. SIAM J. Optim. 1992, 2, 649–664. [Google Scholar] [CrossRef]
- Nesterov, Y. Introductory Lectures on Convex Optimization; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2014; pp. xviii, 236. [Google Scholar]
- Frostig, R.; Ge, R.; Kakade, S.; Sidford, A. Un-regularizing: Approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2540–2548. [Google Scholar]
- Lin, H.; Mairal, J.; Harchaoui, Z. A Universal Catalyst for First-Order Optimization. Available online: https://proceedings.neurips.cc/paper/2015/hash/c164bbc9d6c72a52c599bbb43d8db8e1-Abstract.html (accessed on 29 June 2022).
- Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems, Tahoe, CA, USA, 5–10 December 2013; pp. 315–323. [Google Scholar]
- Xiao, L.; Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 2014, 24, 2057–2075. [Google Scholar] [CrossRef]
- Zhang, Y.; Xiao, L. Stochastic primal-dual coordinate method for regularized empirical risk minimization. J. Mach. Learn. Res. 2017, 18, 2939–2980. [Google Scholar]
- Devolder, O.; Glineur, F.; Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Math. Program. 2014, 146, 37–75. [Google Scholar] [CrossRef] [Green Version]
- Schmidt, M.; Roux, N.L.; Bach, F.R. Convergence rates of inexact proximal-gradient methods for convex optimization. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; pp. 1458–1466. [Google Scholar]
- Hiriart-Urruty, J.B.; Lemaréchal, C. Fundamentals of Convex Analysis; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]
- Bertsekas, D.P. Convex Optimization Theory; Athena Scientific Belmont: Belmont, MA, USA, 2009. [Google Scholar]
- Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, Montreal, QC, Canada, 19–23 June 2017; pp. 1200–1205. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).