An Extended Gradient Method for Smooth and Strongly Convex Functions

: In this work, we introduce an extended gradient method that employs the gradients of the preceding two iterates to construct the search direction for the purpose of solving the centralized and decentralized smooth and strongly convex functions. Additionally, we establish the linear convergence for iterate sequences in both the centralized and decentralized manners. Furthermore, the numerical experiments demonstrate that the centralized extended gradient method can achieve faster acceleration than the compared algorithms, and the search direction also exhibits the capability to improve the convergence of the existing algorithms in both two manners.


Introduction
In this work, we consider the unconstrained optimization problem: min x∈R p f (x). ( The cost function f (x) : R p → R of this problem is a differentiable strongly convex function, and the gradient ∇ f (x) is L-Lipschitz continuous.Starting from an initial point x 0 ∈ R p with ∇ f (x −1 ) = 0, the extended gradient method updates the iterates x k for k ≥ 0: where the step size α (α > 0) is sufficiently small, and it will be analyzed below to ensure the convergence of the extended gradient algorithm.Firstly, we discussed briefly the accelerated scheme of gradient methods, that is, the iterate x k+1 updates using the information of the preceding two iterates.Such as the gradient method with extrapolation that has the following forms: , where α k is the step size and β k ∈ [0, 1) is a scalar.When α k and β k are set as the constant scalar α and β, respectively, the method is equivalent to the heavy ball method [1].In [2], the linear convergence rate was established for the heavy ball method applied to smooth, twice continuously differentiable and strongly convex functions.Further, the authors in [3] constructed an example to illustrate that even if the objective functions are strongly convex, the heavy ball method still can not converge when the functions are not twice continuously differentiable.Thus, the property that the objective function is twice continuously differentiable is a necessary condition in the convergence guarantees of the heavy ball method.Based on this fact, for the case of strictly convex quadratic functions, the recent work [4], obtained the iteration complexity of the heavy ball method under the assumption that the appropriate upper and lower bounds for eigenvalues of the Hessian matrix are known.In particular, several accelerated gradient descent methods with a similar structure to the heavy ball method also have been proposed in [5], and they converge linearly to the solution of the smooth and strongly convex functions with the optimal iteration complexity.Compared to the heavy ball method, the above accelerated methods fully utilize the difference between the preceding two iterates.Moreover, the main advantage of solving convex functions in practice is that the accelerated gradient descent methods can converge to the optimum more quickly [6].
On the other hand, the accelerated scheme also can update the variables employing the gradients of the preceding several iterates.As mentioned in [7,8], the optimistic gradient descent-ascent (OGDA) method and the inertial gradient algorithm with Hessian dampling (IGAHD) update the variable via the difference of the gradients of the preceding two iterates.Besides, for the least mean squares (LMS) estimation of graph signals, the extended LMS algorithm in [9] and proportionate-type graph LMS algorithm in [10] update the variable via the weighted sum of the gradients of the preceding several iterates.Further, the Anderson acceleration gradient descent method (AA-PGA) in [11] updates the variable by employing the convex combination of the gradients of the preceding several iterates.However, AA-PGA converges linearly to the optimal solution x * under the assumption ∇ 2 f (x * ) υI. Inspired by the above works, we here introduce an extended gradient method, in which the variables are updated along the direction of the sum of the gradients of the preceding two iterates.The main purpose of this work is to analyze the convergence of the extended gradient method for finding the optimal solution of problem (1).We will show that when the step size is less than a given upper bound, the linear convergence of the extended gradient method can be guaranteed.
Next, we consider a class of smooth and strongly convex optimization functions on the real vector space with Euclidean norm • and inner product •, • .

Analysis of the Extended Gradient Method
In this section, to establish the convergence of the extended gradient method, we first obtain an important inequality in Theorem 1, where the upper bound on x k+1 − x * will be given in terms of the linear combinations of This inequality is vital and will be used in the following analysis., and {x k } k∈N be generated by the extended gradient method (2).Then, the following hold.
(i) The iterates {x k } satisfy the following linear inequality where x * is the optimal solution.
(ii) min 2≤k≤K f (x k ) converges to the optimal value f (x * ) at the sub-linear rate of O( 1 K ).
Proof.(i) According to the update equation in (2), for any x ∈ R p , we have where Now, substituting x = x * , and noting that ∇ f (x * ) = 0, we first bound the term Γ 1 , Secondly, we provide a lower bound of Γ 2 with the strongly convex property of f (x), Finally, we bound the term Γ 3 by employing Γ 1 and Γ 2 .Based on the smoothness of f , we can obtain that Add and subtract the inner product ∇ f (x k−1 ) T x * to the right-hand side of (6) and rearrange the terms to have Then by the bound of Γ 2 (5) (k is replaced by k − 1), ( 7) can be rewritten as wherein, in terms of the update equation ( 2) and the bound of Γ 1 (4), the term x k − x k−1 2 in (8) can be further expressed as Therefore, the bound of Γ 3 can be yielded as Substituting ( 4), ( 5) and (10) (the bounds of Γ 1 , Γ 2 , and Γ 3 ) into (3), we have Considering that f (x k ) ≥ f (x * ) in ( 11), we complete the proof of (i).
(ii) Rearranging the terms in (11), it holds For 0 ≤ α ≤ Summing over k = 2, . . ., K, we have Noticed that 0 ≤ α ≤ which means that min 2≤k≤K f (x k ) converges to the optimal value f (x * ) at the sub-linear rate of The result in (ii) of Theorem 1 shows that the function value generated by the extended gradient method converges to the optimal solution of problem (1) at a sublinear rate.Further, in order to derive the linear convergence of the extended gradient method, we give two important lemmas as follows.

Lemma 2 ([14] Lemma 1 in Section 2.1). It holds that ρ(
, the spectral radius of A gives the asymptotic growth rate of A k : For every Together with (i) of Theorem 1, Lemma 1 and Lemma 2, we give the following Theorem 2. For notational convenience, we define Theorem 2. The same assumptions hold as in Theorem 1.Let the sequence {x k } generated by the extended gradient method and x * be the optimal solution.Then for every ε > 0, there is c = c( ) such that z k+1 ≤ c(ρ(A) + ) k z 0 , for all k, where ρ(A) < 1 is a constant.
Proof.By the result in (i) of Theorem 1, we first readily find that z k+1 ≤ Az k , here Under the assumption that the spectral radius of matrix A is less than 1 (i.e., ρ(A) < 1), the statement of convergence can be obtained via Lemma 2. Thus, it remains to analyze the spectral radius of the matrix A and show that the spectral radius of A is less than 1.Find a positive vector c = [c 1 , c 2 , c 3 ] T satisfying Ac < c, then the following inequalities hold (16) To ensure that the first inequality in ( 16) holds, the necessary condition is that 2α 2 L 2 − αµ < 0, which means α < µ 2L 2 .Then, according to the sign of 2α 2 L 2 + 2α 3 L 3 − αµ, we further deduce the bounds of α by considering the following two cases.
(i) If 2α 2 L 2 + 2α 3 L 3 − αµ > 0, we can derive the first inequality of ( 16) as Then, since the left-hand side of the ( 17) is less than 0, we have due to 0 < c 1 < c 3 .Further, it follows from (18) that −(4α which leads to a contradiction to the assumption. (ii) If 2α 2 L 2 + 2α 3 L 3 − αµ ≤ 0, the first inequality of ( 16) can be written as due to 0 < c 2 < c 3 .Then, based on the fact that the left-hand side of ( 19) is less than 0, we have due to 0 < c 1 < c 3 .On rearranging the terms of (20), we obtain that 2(2α 2 L 2 + 2α 3 L 3 − αµ)c 3 < 0, which is consistent with the assumption.Therefore, the upper bound of step size α can be reduced as Finally, we have demonstrated that there exists a positive vector . By induction and by the selections of initial points, it can be ensured that ∇ f (x k−1 ) ≤ 2 ∇ f (x k ) for all k ∈ N. As a consequence, by defining . This is indeed similar to the inexact condition considered in [14] (Theorem 2 in Section 4.2.3),[15,16].Note, that [14] considers the methods of the form and where η ∈ (0, 1) for the convex functions with Lipschitz continuous gradients.While [15,16] considers the methods of type (0, 2 3 ) for smooth functions without convexity.The constant 2 in the extended gradient method can relax the inexact condition.

Analysis of the Decentralized Extended Gradient Method
The traditional (centralized) optimization methods are found on a large assumption that all data samples are stored and operated at a computing unit [17].However, in the networked multi-agent systems, the data resources and function information related to optimization problems are stored in each agent distribution, and the distributed schemes are introduced into the framework of optimization [18].In the distributed fashion, all data are transmitted to a fusion center that conducts the optimization operation and scatters the updated information to all agents.Nevertheless, the distributed operation is gradually replaced by the decentralized manner for the reasons of privacy and communication costs.The decentralized optimization is designed for the situations of networked systems which can fully utilize the resources of multiple computing units and improve the reliability of the networked systems.
In this section, we focus on distributed gradient methods.The earlier distributed gradient methods [19][20][21] slowly converge to the optimal solution by utilizing the diminishing step size.With the constant step size, these methods only converge to a neighborhood of the optimal solution.Recently, the distributed gradient methods [22][23][24] combined with more techniques can converge linearly to the optimal solution.The techniques include the multi-step consensus, the difference between the two consecutive iterates and the gradient tracking technique, which could cause more computation and communication burden.Thus, we try to investigate the behavior of the decentralized extended gradient method working with the simple update mechanism.

Decentralized Extended Gradient Method
This section develops a decentralized version of the extended gradient method for the following problem: where the local cost function f i is L-smooth and µ-strongly convex, which is defined over a network consisting of n agents.To apply the extended gradient method to problem (22) under the decentralized situation, all the agents carry out the extended gradient method in parallel.A set of local variables {x i ∈ R p } n i=1 is introduced, and x i is assigned to agent i. Denote the aggregate variable X and the aggregate gradient ∇F(X) as X = (x 1 , x 2 , . . ., x n ) and ∇F(X) = (∇ f 1 (x 1 ), ∇ f 2 (x 2 ), . . ., ∇ f n (x n )) .The decentralized extended gradient method updates the iterates X k+1 : where the weight matrix W is symmetric and double stochastic, i.e., W = W, W1 n = 1 n and 1 n W = 1 n , herein 1 n = (1, . . ., 1) ∈ R p .Weight the matrix W is generated based on the network.When the agent i and j are neighbors or i = j, W ij = 0.In the decentralized setting, the local function f i is only accessed by agent i, i = 1, . . ., n.All agents cooperate to optimize the global function f via local intra-node computation and extra-node communication.
The local computation is operated on the individual agent, i.e., the local variable of agent i is updated by the gradient information of the local function f i .The communication is operated over the network, i.e., the local variable is updated by combining the weighted average of its neighbor's local variable, which enforces the consensus of all local variables.

Convergence Result
We need to define the average vector: Then, we give the upper bounds on Xk − (x * ) and X k − 1 n Xk by using the linear combinations of the values of the preceding two iterates.Lemma 3. Let {X k } k∈N be the iterates generated by the updates in (23).If ξ ∈ (α − 1 − (n − 1)α, α + 1 − (n − 1)α), then the following relation hold: Proof.First, we establish a linear system by using the upper bounds on X k+1 − 1 n Xk+1 and Xk+1 − (x * ) .Next, we can find the range of α such that ρ(A) < 1.From the update in (23), we derive Step 1. Bound X k+1 − 1 n Xk+1 .

Numerical Experiments
In this section, we perform computational experiments on both the centralized and the decentralized problems to demonstrate the efficiency of the extended gradient method.We compare our proposed method with many well-known algorithms.All algorithms are programmed in Matlab (R2018b, MathWorks, Natick, MA, USA) and run on a PC (Lenovo-M430, Beijing, China) with a 2.90 GHz Intel Core i5-10400 CPU and 8 GB of memory.

Centralized Problem
We consider the performance of the extended gradient method when it is applied to the LASSO problem: where the random matrix M ∈ R m×p is generated by the uniform distribution U(0, 1), b = Mx where x also follows an uniform distribution U(0, 1) with sparsity r.In particular, we set the parameters m = 512, p = 1024, r = 0.1, and the initial condition x 0 ∈ U(0, 1).The termination criterion is set to reach the maximum number of iterations or When the regularization parameter ν is zero, the LASSO problem becomes the least squares problem.Otherwise, the objective function f (x) is nonsmooth.
Considering that the nonsmooth term x 1 is actually the sum of the absolute values of each component of x, we employ the proximal gradient method and the approximation method based on the Huber function, respectively, to solve the LASSO problem.The Huber function becomes closer to the absolute value function as the parameter θ approaches 0.
In order to study the influence of the previous gradients on the search direction of the optimization algorithm, we employ the gradient (denoted as g 1 ), the sum of the gradients of the preceding two iterates (denoted as g 2 ), and the sum of the gradients of the preceding three iterates (denoted as g 3 ) to construct search directions, respectively.Figure 1 shows the performance of the gradient descent method(GD) for the least square problem, the proximal gradient method(PG) and the approximation method based on the Huber function (AH) for the LASSO problem (ν = 0.001) when the above three search directions are employed.We set the parameter θ as 0.02.For these algorithms with the three search directions, we set the step sizes as 1 L , 0.8 1 L and 0.4 1 L , respectively.Observing that when the sum of the gradients of the preceding two iterates (g 2 ) is selected to construct the search direction, the optimization algorithms achieve better acceleration than them with the other two search directions (g 1 and g 3 ).On the other hand, we compare the extended gradient method (our algorithm) with the gradient descent method, heavy ball method (iPiasco) [2] and Nesterov accelerated gradient descent method [5].Specifically, for the gradient descent method, the step size α is 1  L .For the extended gradient method, the step size α is 0.8 1 L .For the heavy ball method, the step size α is 4 ( √ µ+ √ L) 2 and the extrapolation parameter β is ( For the Nesterov accelerated gradient descent method, the extrapolation parameter is k−2 k+1 and step size α is 1 L . Figure 2 depicts the comparison result of all compared methods for the least squares problem.We notice that the extended gradient method outperforms the other algorithms in terms of the number of iterations.Moreover, in order to show that the sum of two gradients (g 2 ) can accelerate the convergence of algorithms, we use it to construct the new search direction of the Nesterov accelerated gradient descent method and iPiasco.We set step sizes α and ν as 0.8 1 L and 0.001.From Figure 3, we observe that when using the sum of two gradients to construct the new search direction, both the proximal Nesterov accelerated gradient descent method and the proximal iPiasco can achieve better acceleration performance.

Decentralized Problem
illustrate the performance of the decentralized extended gradient method (DGD g2 ) on the following where the random matrix M i ∈ R m×p is generated by the uniform distribution U(0, 1), and each column of M i is normalized to be 1.We set the parameters m = 512, p = 1024, r = 0.1, ν = 0.0001, and b i = M i x where x also follows the uniform distribution U(0, 1) with sparsity r.The undirected graph is generated by Erdős-Rényi model [25] where each pair of nodes has the connection probability p = 0.8, and the weight matrix W = I+M 2 , here M is generated with the Metropolis constant edge rules [26].To demonstrate the effect of the previous gradient on the search direction of decentralized optimization algorithms, we use the sum of the gradients of the preceding two iterates (g 2 ) to construct the search direction in the accelerated penalty method with consensus (APM C ) [27], the decentralized gradient descent (DGD) [21] and the decentralized stochastic gradient descent (DSGD) [28].For DGD, DSGD, DGD g2 and DSGD g2 , we tune the optimized step size by hand, which are 0.09, 0.9, 0.058 and 0.58, respectively.For APM C and APM Cg2 , the step sizes are 1 L and 0.1 1 L .The initial condition X 0 is 0 n×p .Compared to the original APM C , DGD and DSGD, we can see from Figure 4 that the algorithms with the gradients of the preceding two iterates (g 2 ) to construct search direction outperform than those with the current negative gradient as the search direction.
Finally, we extend the proposed method to a classification problem.First, we introduce the Radial Basis Function Network (RBFN) that uses a radial basis function ψ(s; c) as the activation function, where s is a sample and c is a center.The output of the RBFN is a linear combination of radial basis functions ψ(s; c) and the neuron weight parameters x.The training error that evaluates the performance of the RBFN can be written as (44).The training samples are {(s 1 , 1), . . ., (s N /2, 1), (s N/2+1 , −1), . . ., (s N , −1)}.For the decentralized extended gradient method, the step size is set as 1  L+µ .Figure 5 shows the classification result, and the accuracy of classing the samples is 1, which further indicates the efficiency of our method.

Conclusions
The gradient methods have been widely employed to solve the optimization problems.In this paper, we first introduce an extended gradient method for the centralized and decentralized smooth and strongly convex problems.Secondly, we establish the linear convergence of the extended gradient method in a centralized and decentralized manner, respectively.Finally, the numerical examples are used to demonstrate the convergence and to validate the efficiency compared to current classical methods.In the future, the acceleration technique used in this work can be applied to more optimization problems.

Theorem 1 .
Assume that f is µ-strongly convex, and the gradient ∇ f (x) is L-Lipschitz continuous.Let the step size α satisfy the condition 0 ≤ α ≤

Figure 1 .
Figure 1.Performance of optimization algorithms with three search directions.

Figure 2 .Figure 3 .
Figure 2. Comparison of different algorithms for the least squares problem.

Figure 4 .
Figure 4. Comparison of different algorithms for the LASSO problem.

Figure 5 .
Figure 5. Classification result of the extended gradient method.