Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization

Chen, Ruijuan; Tang, Xiaoquan; Li, Xiuting

doi:10.3390/fractalfract6120709

Open AccessArticle

Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization

by

Ruijuan Chen

¹

,

Xiaoquan Tang

² and

Xiuting Li

^3,*

¹

Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430200, China

²

School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen 518172, China

³

College of Science, Huazhong Agricultural University, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2022, 6(12), 709; https://doi.org/10.3390/fractalfract6120709

Submission received: 10 August 2022 / Revised: 19 November 2022 / Accepted: 23 November 2022 / Published: 29 November 2022

(This article belongs to the Special Issue Recent Advances in Fractional Differential Equations and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Stochastic gradient descent is the method of choice for solving large-scale optimization problems in machine learning. However, the question of how to effectively select the step-sizes in stochastic gradient descent methods is challenging, and can greatly influence the performance of stochastic gradient descent algorithms. In this paper, we propose a class of faster adaptive gradient descent methods, named AdaSGD, for solving both the convex and non-convex optimization problems. The novelty of this method is that it uses a new adaptive step size that depends on the expectation of the past stochastic gradient and its second moment, which makes it efficient and scalable for big data and high parameter dimensions. We show theoretically that the proposed AdaSGD algorithm has a convergence rate of

O (1 / T)

in both convex and non-convex settings, where T is the maximum number of iterations. In addition, we extend the proposed AdaSGD to the case of momentum and obtain the same convergence rate for AdaSGD with momentum. To illustrate our theoretical results, several numerical experiments for solving problems arising in machine learning are made to verify the promise of the proposed method.

Keywords:

stochastic gradient descent; adaptive step size; convex optimization; non-convex optimization

1. Introduction

Optimization based on stochastic gradients is of central practical significance in many scientific and engineering fields. Many problems in these areas can be reduced to optimization problems of some scalar parameterized objective function for which parameters need to be maximized or minimized. Recent years have witnessed the great success of machine learning, especially deep learning, in many fields, including computer vision, speech processing, and natural language processing. For many machine learning tasks, a critical and challenging problem is to design optimization algorithms to train neural network models. If the objective function is differentiable, stochastic gradient descent (SGD) is an efficient and effective optimization method that plays a central role in many machine learning successes. The SGD algorithm can be traced back to Robbins and Monro [1], who stated that the classical convergence analysis depends on the decreasing positive learning rate condition. Stochastic approximation methods have been widely studied in various areas of the literature [2,3,4], mainly focusing on the convergence of algorithms in different environments.

In recent years, the convergence speed of standard SGD has been greatly improved, and a number of methods to reduce variance have been adopted, such as vanilla SGD in the non-convex case [5]. However, vanilla SGD is too sensitive to the learning rate, making it difficult to adjust the appropriate learning rate, and its convergence performance is poor. There have been many attempts to achieve easily tunable learning rates and improve SGD performance, For example, in the case of the smooth and strongly convex objective function, the variance of stochastic gradient decrease [6,7,8,9], adaptive learning rate [10,11,12,13,14,15,16], averaging [17], momentum acceleration mechanism [18,19,20,21], and the Powerball method [22] are used, and a better self optimization control method is proposed using fractional-order Gaussian noise [23]. The most promising variance reduction technique is the stochastic variance reduction gradient (SVRG) [8,9]. In fact, these stochastic methods need to store and use the full batch of past gradients in order to progressively reduce the variance of the stochastic gradient estimator. For stochastic optimization problems, the number of training samples is usually large; consequently, the algorithm can be difficult to implement if the storage space is limited. Therefore, adaptive learning rate and momentum mechanisms are more suitable for stochastic optimization problems than variance reduction.

In addition to the classical optimization algorithms, several other popular stochastic optimization algorithms can be found in the current literature, for example, genetic algorithms, which are inspired by biological evolution [24], particle swarm optimization derived from the natural behavior of clusters [25,26], and the most recent dynamic stochastic fractal search optimization algorithm based on the adaptive strategy of fuzzy logic for diffusion parameters [27]. However, because heuristic algorithms are proposed based on experience without a theoretical basis, they lack a unified and complete theoretical framework. In addition, due to the use of non-deterministic polynomial theory, global optimality cannot be guaranteed when using the heuristic polynomial approach.

Adaptive step sizes have a long history in convex settings. They were first proposed in the online learning literature [28] and later applied to the random learning literature [12]. In a recent study, an adaptive projection gradient algorithm has been proposed for a special nonlinear fractional optimization problem with an objective function that is smooth convex in the numerator and smooth concave in the denominator [29]. In [30], a very weak condition is proposed for the non-convex function to converge to the global optimum almost everywhere, and in [31], a new convergence analysis method for SGD under a decreasing learning rate regime is proposed. In [16,32,33], the authors studied several classes of stochastic optimization algorithms enriched with heavy ball momentum, showing a linear rate for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum (SGDM)). This does not require large memory, merely requiring slightly more computation in each iteration compared with the vanilla SGD method. Therefore, both techniques have been widely used and demonstrated to be effective for training deep neural networks [10,13]. On the one hand, common SGD variants have been designed and analyzed under convex settings [12], and the results may not provide a relevant guarantee of convergence [13]. On the other hand, it is well known that linear convergence can be achieved even with constant step-size gradient descent under certain conditions. However, while most of the advanced SGD variants can achieve faster convergence rates by applying adaptive step size, the convergence rate is not yet ideal.

We summarize the main contributions of the present paper to the existing results in the literature as follows:

For smooth and convex functions, a novel adaptive step-size stochastic gradient descent (AdaSGD) method is proposed, and a momentum acceleration variant (AdaSGDM) is studied as well. It is proven that both have a convergence rate of $O (1 / T)$ , where T is the maximum number of iterations.
For smooth but non-convex functions, we show that both AdaSGD and AdaSGDM achieve global optimization with a convergence rate of $O (1 / T)$ .

The rest of this paper is organized as follows. In Section 2, we describe the optimization problem and present the AdaSGD and AdaSGDM method along with details of the adaptive step sizes. In Section 3, we prove the convergence rates of the proposed AdaSGD and AdaSGDM theoretically. Section 4 presents a practical implementation and discusses the experimental results on problems arising from machine learning. Finally, a brief conclusion and discussion of possible future work is presented in Section 5.

2. Problem Statement

Consider the following unconstrained minimization problem:

\begin{matrix} min_{x \in R^{d}} f (x), \end{matrix}

(1)

where

f : R^{d} \to R

is a differentiable function (though not necessarily convex). More concretely, we assume that

f (x)

has a Lipschitz gradient.

Assumption 1.

The continuously differentiable function

f : R^{d} \to R

is bounded below by

f^{*} : = {inf}_{x \in R^{d}} f (x) \in R

, and its gradient

\nabla f (x)

is L-Lipschitz; i.e., there exists a constant

L > 0

such that, for all

x, y \in R^{d}

,

\begin{matrix} ∥ \nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥, \forall x, y \in R^{d}, \end{matrix}

where

∥ \cdot ∥

denotes the Euclidean norm.

Notice that the inequality does not imply the convexity of f. However, the assumption that f is L-smooth for any

x, y \in R^{d}

implies that ([34], Lemma 1.2.3)

\begin{matrix} | f (y) - f (x) - 〈 \nabla f (x), y - x 〉 | \leq \frac{L}{2} {∥ y - x ∥}^{2} . \end{matrix}

Because we are interested in solving (1) using stochastic gradient methods, we assume that at each

x \in R^{d}

we have access to an unbiased estimator of the true gradient

\nabla f (x)

, denoted by

g (x, ξ)

, where

ξ

is a source of randomness. Thus, we need the following assumptions, which analyze SGD under the assumptions that

f (x)

is lower bounded and that the stochastic gradients

g (x, ξ)

are unbiased and have bounded variance [5].

Assumption 2.

For any

k \geq 1

, the stochastic gradient oracle provides us an independent unbiased estimate

g (x_{k}, ξ_{k})

of

\nabla f (x_{k})

upon receiving query

x_{k} \in R^{d}

:

\begin{matrix} E [g (x_{k}, ξ_{k})] = \nabla f (x_{k}), \end{matrix}

where ξ is a random variable satisfying certain specific distributions and the variance of the random variable is bounded as follows:

\begin{matrix} E [∥ g (x_{k}, ξ_{k}) - \nabla f (x_{k}) ∥^{2}] \leq σ^{2}, \end{matrix}

for some parameter

0 \leq σ < \infty

.

It is worth noting that in the standard setting for SGD, the random vectors

ξ, k = 1, 2, \dots,

are independent of each other (and of

x_{k}

; see, e.g., [17]). Note that due to unbiasedness, Assumption 2 is the standard stochastic gradient oracle assumption used for SGD analysis and the standard variance bound is equivalent to

E [{∥ g (x, ξ) ∥}^{2}] \leq {∥ \nabla f (x) ∥}^{2} + σ^{2}

. Classic convergence analysis of the SGD algorithm relies on placing conditions on the positive step size

η_{k}

[1]. In particular, sufficient conditions are that

\begin{matrix} \sum_{k = 1}^{\infty} η_{k} = \infty, and \sum_{k = 1}^{\infty} η_{k}^{2} < \infty . \end{matrix}

The first condition is both necessary and intuitive, as it is necessary for the algorithm to be able to travel an arbitrary distance in order to reach the stationary point from the initial point. However, the second condition is actually unnecessary. Many popular step size choices, such as that of Adagrad [12], find it difficult to satisfy this condition, even when the step size can guarantee the convergence of Adagrad on convex sets.

2.1. Adaptive Step Size Stochastic Gradient Descent

More specifically, Adagrad [11] can be used to solve problem (1), as follows:

\begin{matrix} x_{k + 1} = x_{k} - \frac{η}{\sqrt{G (x_{k}, ξ_{k}) + ϵ}} ⊙ g (x_{k}, ξ_{k}), \end{matrix}

where the element-wise matrix-vector multiplication ⊙ between

G (x_{k}, ξ_{k})

and

g (x_{k}, ξ_{k})

, which here is

G_{k} \in R^{d \times d}

, is a diagonal matrix in which each diagonal element i,

i = 1, 2, \dots, d

is the sum of the squares of the gradients with respect to

x_{k}

up to time step k, while

ϵ

is a smoothing term that avoids division by zero (usually on the order of

1 \times 10^{- 8}

). Interestingly, without the square root operation, the algorithm performs much more poorly. In this work, we focus on SGD with adaptive step size promotion, which iteratively updates the solution via

\begin{matrix} x_{k + 1} = x_{k} - η_{k} g (x_{k}, ξ_{k}), \end{matrix}

with an arbitrary initial point

x_{0}

and adaptive step-size

η_{k}

;

ξ_{k}

is a random variable obeying some distribution. In the sequelae, we let

g_{k} : = g (x_{k}, ξ_{k})

denote a stochastic gradient and assume that we have access to a stochastic first-order black-box oracle that returns a noisy estimate of the gradient of f at any point

x \in R^{d}

. Unlike [11], in this paper we use the expectation of the stochastic gradient

g_{k}

and its second moment to design a new adaptive step size, then obtain a new kind of adaptive stochastic gradient descent method (AdaSGD).

The pseudo-code of our proposed AdaSGD algorithm is presented in Algorithm 1.

Algorithm 1 Adaptive Stochastic Gradient Descent (AdaSGD) Method

1:: Initialization: initialize $x_{0}$ and the maximum number of iterations T
2:: Iterate:
3:: for $k = 0, 1, 2, \dots, T$ do
4:: Compute the step size (i.e., learning rate) $η_{k} > 0$ .
5:: Generate a random variable $ξ_{k}$ .
6:: Compute a stochastic gradient $g (x_{k}, ξ_{k})$ .
7:: Update the new iterate $x_{k + 1} = x_{k} - η_{k} g (x_{k}, ξ_{k})$ .
8:: End for

2.2. Adaptive Step Size Stochastic Gradient Descent with Momentum

In addition, we consider the momentum acceleration variant of the proposed AdaSGD for application of the algorithm. Similarly, the difference from stochastic heavy-ball in [35] is the different selection of adaptive step size. The updates to the AdaSGDM are as follows:

\begin{matrix} x_{k + 1} = x_{k} - η_{k} g_{k} + β (x_{k} - x_{k - 1}), \end{matrix}

(2)

with

x_{- 1} = x_{0}

, where

β \in [0, 1)

is the momentum constant. Equivalently, denoting by

y_{k + 1} : = x_{k + 1} - x_{k}

, AdaSGDM can be implemented in two steps for

k = 1, 2, \dots

\begin{matrix} y_{k + 1} & = β y_{k} - η_{k} g_{k}, \\ x_{k + 1} & = x_{k} + y_{k + 1}, \end{matrix}

(3)

where

η_{k} > 0

,

β \in [0, 1)

. It is notable that during updating of

x_{k + 1}

, a momentum term is constructed based on the auxiliary sequence

{y_{k}}

. When

β = 0

, the method returns to AdaSGD. The pseudo-code of the AdaSGDM algorithm is presented in Algorithm 2.

Algorithm 2 Adaptive Stochastic Gradient Descent Momentum (AdaSGDM) Method

1:: Initialization: $β \neq 0$ , initialize $x_{- 1}, x_{0}$ and the maximum number of iterations T
2:: Iterate:
3:: for $k = 0, 1, 2, \dots, T$ do
4:: Compute the step size (i.e., learning rate) $η_{k} > 0$ .
5:: Generate a random variable $ξ_{k}$ .
6:: Compute a stochastic gradient $g (x_{k}, ξ_{k})$ .
7:: Update the new iterate:
8:: $y_{k + 1} = β y_{k} - η_{k} g_{k}$ ,
9:: $x_{k + 1} = x_{k} + y_{k + 1}$ .
10:: End for

To facilitate analysis of the stochastic momentum methods, we note that (3) implies the following recursions, which are straightforward to verify:

\begin{matrix} x_{k + 1} + p_{k + 1} = x_{k} + p_{k} - \frac{η_{k}}{1 - β} g_{k}, \end{matrix}

(4)

where

p_{k}

is provided by

\begin{matrix} p_{k} = \frac{β}{1 - β} (x_{k} - x_{k - 1}), k \geq 1, \end{matrix}

(5)

and

p_{0} = 0

. Let

v_{k} = \frac{1 - β}{β} p_{k}

; then,

\begin{matrix} v_{k + 1} = β v_{k} - η_{k} g_{k} . \end{matrix}

(6)

3. Convergence Analysis

In this section, without knowledge of the noise, we state the convergence results of AdaSGD and AdaSGDM under the convex settings in Section 3.1. Similarly, the convergence of the two methods under non-convex settings is analyzed in Section 3.2.

3.1. Adaptive Convergence Rates for Convex Functions

In this section, the convergence of AdaSGD and AdaSGDM under convex settings is discussed using the classical convergence analysis method under the specific adaptive step size iteration. Before stating the theorem for the convergence conclusion, we first provide the following technical Lemma for proving the theorem.

Lemma 1

([15]). When f is L-smooth, then

{∥ \nabla f (x) ∥}^{2} \leq 2 L (f (x) - f (x^{*})), \forall x \in R^{n}

, where

x^{*} = arg {min}_{x} f (x)

.

Next, we first provide the convergence results of AdaSGD and AdaSGDM in the case of convex functions.

Theorem 1.

Assumptions 1 and 2 hold if f is convex by designing an appropriate adaptive step size, as follows:

\begin{matrix} η_{k} = δ_{k} \cdot \frac{∥ E [g_{k}] ∥}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥}, \end{matrix}

(7)

where

δ_{k} > 0

is a parameter. Then, the iterates of AdaSGD (

β = 0

) and AdaSGDM (

β \neq 0

) satisfy the following bound:

\begin{matrix} f ({\bar{x}}_{T}) - f (x^{*}) \leq \frac{1}{T + 1} \frac{1 - β}{2 C} {∥ x_{0} - x^{*} ∥}^{2}, \end{matrix}

(8)

where

{\bar{x}}_{T} = \frac{1}{T + 1} \sum_{k = 1}^{T} x_{k}

,

x^{*} = arg {min}_{x} f (x)

,

x_{- 1} = x_{0}

are random initial points, C is a positive constant, and T is the maximum number of iterations.

Proof.

From the iterative format (4), we can obtain

\begin{matrix} ∥ x_{k + 1} + p_{k + 1} - x^{*} ∥^{2} - ∥ x_{k} + p_{k} - x^{*} ∥^{2} = - \frac{2 η_{k}}{1 - β} 〈 g_{k}, x_{k} + p_{k} - x^{*} 〉 + \frac{η_{k}^{2}}{{(1 - β)}^{2}} {∥ g_{k} ∥}^{2} . \end{matrix}

(9)

The adaptive step-size we analyze here is a generalization of ones widely used in the online and stochastic optimization literature. As such, their good performance has already been validated using numerous empirical results. In particular, we consider in the following parts that the step size satisfies (7). In addition, for

∥ E [g_{k}] ∥

and

{(E [∥ g_{k} ∥^{2}])}^{1 / 2}

, there always exists

C_{1 k} \in (0, 1)

and

C_{2 k} > 1

such that

\begin{matrix} \frac{∥ E [g_{k}] ∥}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥} \geq C_{1 k}, \end{matrix}

(10)

and

\begin{matrix} 1 < \frac{{(E [∥ g_{k} ∥^{2}])}^{1 / 2}}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥} \leq C_{2 k} . \end{matrix}

(11)

Taking the conditional expectation with respect to

ξ_{1}, \dots, ξ_{k - 1}

, we can find that

\begin{matrix} E [η_{k} 〈 g_{k}, x_{k} + p_{k} - x^{*} 〉] \\ = & 〈 E [g_{k}], x_{k} + p_{k} - x^{*} ∥ 〉 \cdot \frac{∥ E [g_{k}] ∥}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥} \cdot δ_{k} \\ \geq & 〈 \nabla f (x_{k}), x_{k} + p_{k} - x^{*} 〉 \cdot C_{1 k} δ_{k} \\ = & δ_{k} C_{1 k} \cdot 〈\nabla f (x_{k}), x_{k} - x^{*}〉 + δ_{k} C_{1 k} \cdot \frac{β}{1 - β} \cdot 〈\nabla f (x_{k}), x_{k} - x_{k - 1}〉 \\ \geq & δ_{k} C_{1 k} (f (x_{k}) - f (x^{*})) + δ_{k} C_{1 k} \cdot \frac{β}{1 - β} (f (x_{k}) - f (x_{k - 1})) \\ \geq & δ_{k} C_{1 k} (f (x_{k}) - f (x^{*})) + \bar{δ} C_{0} \cdot \frac{β}{1 - β} (f (x_{k}) - f (x_{k - 1})) . \end{matrix}

(12)

The first inequality is provided by (10), and the second inequality by the convexity of the function. The last inequality is provided by defining where

C_{0} : = min_{k = 0, \dots, T} C_{1 k}

and

\bar{δ} : = min_{k = 0, \dots, T} δ_{k}

. Hence, by summing (9) over

k = 0

to T and incorporating (12), we have

\begin{matrix} \frac{2}{1 - β} [\sum_{k = 0}^{T} δ_{k} C_{1 k} (f (x_{k}) - f (x^{*}))] \\ \leq & \frac{- 2 β}{{(1 - β)}^{2}} \bar{δ} C_{0} (f (x_{T}) - f (x_{- 1})) + \frac{1}{{(1 - β)}^{2}} E [\sum_{k = 0}^{T} η_{k}^{2} {∥ g_{k} ∥}^{2}] \\ + (∥ x_{0} + p_{0} - x^{*} ∥^{2} - ∥ x_{T + 1} + p_{T + 1} - x^{*} ∥^{2}) . \end{matrix}

Notice the initial conditions

x_{- 1} = x_{0}

,

p_{0} = 0

; then,

\begin{matrix} \frac{2}{1 - β} [\sum_{k = 0}^{T} δ_{k} C_{1 k} (f (x_{k}) - f (x^{*}))] \\ \leq & \frac{- 2 β}{{(1 - β)}^{2}} \bar{δ} C_{0} (f (x_{T}) - f (x_{0})) + \frac{1}{{(1 - β)}^{2}} E [\sum_{k = 0}^{T} η_{k}^{2} {∥ g_{k} ∥}^{2}] + {∥ x_{0} - x^{*} ∥}^{2} . \end{matrix}

(13)

Next, we consider the boundedness of the second term on the right side of (13):

\begin{matrix} E [\sum_{k = 0}^{T} η_{k}^{2} {∥ g_{k} ∥}^{2}] & = & \sum_{k = 0}^{T} E [η_{k}^{2} {∥ g_{k} ∥}^{2}] \\ = & \sum_{k = 0}^{T} E [∥ g_{k} ∥^{2}] \cdot δ_{k}^{2} \cdot \frac{∥ E [g_{k}] ∥^{2}}{{({(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥)}^{2}} \\ \leq & \sum_{k = 0}^{T} C_{2 k}^{2} \cdot δ_{k}^{2} \cdot {∥ \nabla f (x_{k}) ∥}^{2} \\ \leq & \sum_{k = 0}^{T} C_{2 k}^{2} \cdot δ_{k}^{2} \cdot 2 L (f (x_{k}) - f (x^{*})), \end{matrix}

(14)

the first inequality is provided by (11) and the second by Lemma 1. Substituting (14) into (13), we have

\begin{matrix} \frac{2}{1 - β} \sum_{k = 0}^{T} δ_{k} C_{1 k} (f (x_{k}) - f (x^{*})) & \leq & \frac{1}{{(1 - β)}^{2}} \sum_{k = 0}^{T} C_{2 k}^{2} \cdot δ_{k}^{2} \cdot 2 L (f (x_{k}) - f (x^{*})) \\ - \frac{2 β \bar{δ} C_{0}}{{(1 - β)}^{2}} (f (x_{T}) - f (x_{0})) + {∥ x_{0} - x^{*} ∥}^{2} . \end{matrix}

(15)

By recombining (15) and the definition of

C_{0}

,

\begin{matrix} \frac{2}{{(1 - β)}^{2}} \sum_{k = 0}^{T} ((1 - β) δ_{k} C_{1 k} - L C_{2 k}^{2} δ_{k}^{2}) (f (x_{k}) - f (x^{*})) \\ \leq & - \frac{2 β \bar{δ} C_{0}}{{(1 - β)}^{2}} (f (x_{T}) - f (x_{0})) + {∥ x_{0} - x^{*} ∥}^{2} \\ \leq & \frac{2 β \bar{δ} C_{0}}{{(1 - β)}^{2}} (f (x_{0}) - f (x^{*)}) + {∥ x_{0} - x^{*} ∥}^{2}, \end{matrix}

and we choose

δ_{k} < \frac{1 - β}{L} \frac{C_{1 k}}{C_{2 k}^{2}}

such that

(1 - β) δ_{k} C_{1 k} - L C_{2 k}^{2} \cdot δ_{k}^{2} > 0

.

Note that

0 < C_{1 k} < 1

and

0 < C_{1 k} + 1 < C_{2 k}

can be obtained from (10) and (11). Per the definition of

C_{0}

and

\bar{δ}

, and without loss of generality, we can assume that

\bar{δ} = δ_{k_{0}}

. Then,

\begin{matrix} \bar{δ} C_{0} = δ_{k_{0}} C_{0} \leq \frac{1 - β}{L} \frac{2 C_{1 k_{0}}}{C_{2 k_{0}}^{2}} C_{0} \leq \frac{1 - β}{L} \frac{2 C_{1 k_{0}}^{2}}{C_{2 k_{0}}^{2}} \leq \frac{1 - β}{L} . \end{matrix}

Let

C : = min_{k = 0, \dots, T} \{(1 - β) δ_{k} C_{1 k} - L C_{2 k}^{2} δ_{k}^{2}\}

; then,

\begin{matrix} \frac{2 C}{{(1 - β)}^{2}} \sum_{k = 0}^{T} (f (x_{k}) - f (x^{*})) \leq {∥ x_{0} - x^{*} ∥}^{2} + \frac{2 β}{1 - β} \frac{1}{L} (f (x_{0}) - f (x^{*)}), \end{matrix}

which means that

\begin{matrix} \sum_{k = 0}^{T} (f (x_{k}) - f (x^{*})) & \leq & \frac{{(1 - β)}^{2}}{2 C} {∥ x_{0} - x^{*} ∥}^{2} + \frac{1}{L} \frac{β (1 - β)}{C} (f (x_{0}) - f (x^{*)}) \\ \leq & \frac{{(1 - β)}^{2}}{2 C} ∥ x_{0} - x^{*} ∥^{2} + \frac{1}{L} \frac{β (1 - β)}{C} \frac{L}{2} {∥ x_{0} - x^{*} ∥}^{2} \\ \leq & \frac{1 - β}{2 C} {∥ x_{0} - x^{*} ∥}^{2} . \end{matrix}

Now, from Jensen’s inequality, we have

\begin{matrix} f ({\bar{x}}_{T}) - f (x^{*}) \leq \frac{1}{T + 1} \sum_{k = 0}^{T} (f (x_{k}) - f (x^{*})) \leq \frac{1}{T + 1} \frac{1 - β}{2 C} {∥ x_{0} - x^{*} ∥}^{2}, \end{matrix}

where

{\bar{x}}_{T} = \frac{1}{T + 1} \sum_{k = 0}^{T} x_{k}

. □

3.2. Adaptive Convergence for Non-Convex Optimization

We now turn to the case where f is non-convex. In practice, most loss functions are non-convex. Because the convexity of a function plays an important role in convergence analysis, the convergence conclusion is not valid in the case of non-convexity. However, there are few theoretical results about stochastic optimization convergence in non-convex environments. In this section, we analyze the convergence of AdaSGD and AdaSGDM under non-convex settings by applying the expectation of the stochastic gradient and the second moment to the design of the adaptive step size.

Theorem 2.

Let Assumptions 1 and 2 hold if f is non-convex. We choose the step size as in (7). Then, the iterates of AdaSGD satisfy the following bound:

\begin{matrix} min_{k = 1, \dots, T} {∥ \nabla f (x_{k}) ∥}^{2} \leq \frac{1}{T} \cdot \frac{1}{\hat{C}} (f (x_{0}) - f (x^{*})), \end{matrix}

where

x^{*}

is one of the minimum point of the function

f (x)

over

R^{d}

,

x_{0}

is a random initial point,

\hat{C}

is a positive constant, and T is the maximum number of iterations.

Proof.

Because

f (x)

is an L-smooth function, we have

\begin{matrix} f (y) - f (x) - 〈 \nabla f (x), y - x 〉 \leq \frac{L}{2} {∥ y - x ∥}^{2} . \end{matrix}

Then,

\begin{matrix} f (x_{k + 1}) & \leq & f (x_{k}) + 〈 \nabla f (x_{k}), x_{k + 1} - x_{k} 〉 + \frac{L}{2} {∥ x_{k + 1} - x_{k} ∥}^{2} \\ = & f (x_{k}) - 〈 \nabla f (x_{k}), η_{k} g_{k} 〉 + \frac{L}{2} η_{k}^{2} {∥ g_{k} ∥}^{2} . \end{matrix}

(16)

Using the expectation on both sides of (16),

\begin{matrix} E [f (x_{k + 1}) - f (x_{k})] & \leq & - E [〈 \nabla f (x_{k}), η_{k} g_{k} 〉] + \frac{L}{2} η_{k}^{2} E [∥ g_{k} ∥^{2}] \\ = & - η_{k} {∥ \nabla f (x_{k}) ∥}^{2} + \frac{L}{2} η_{k}^{2} E [∥ g_{k} ∥^{2}] . \end{matrix}

Now, by taking the adaptive step size as (7), we have

\begin{matrix} E [f (x_{k + 1}) - f (x_{k})] & \leq & - δ_{k} \cdot \frac{∥ E [g_{k}] ∥}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥} \cdot {∥ \nabla f (x_{k}) ∥}^{2} \\ + \frac{L}{2} δ_{k}^{2} \cdot {(\frac{∥ E [g_{k}] ∥}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥})}^{2} \cdot E [∥ g_{k} ∥^{2}] \\ = & - δ_{k} \cdot \frac{∥ E [g_{k}] ∥}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥} \cdot {∥ \nabla f (x_{k}) ∥}^{2} \\ + \frac{L}{2} δ_{k}^{2} \cdot {(\frac{{(E [∥ g_{k} ∥^{2}])}^{1 / 2}}{{(E [∥ g_{k} ∥^{2}])}^{1 / 2} - ∥ E [g_{k}] ∥})}^{2} \cdot ∥ E [g_{k}] ∥^{2} . \end{matrix}

From (10) and (11),

\begin{matrix} E [f (x_{k + 1}) - f (x_{k})] \leq - δ_{k} C_{1 k} {∥ \nabla f (x_{k}) ∥}^{2} + \frac{L}{2} δ_{k}^{2} C_{2 k}^{2} \cdot ∥ E [g_{k}] ∥^{2}, \end{matrix}

we choose

δ_{k} < \frac{2}{L} \frac{C_{1 k}}{C_{2 k}^{2}}

such that

δ_{k} C_{1 k} - \frac{L}{2} δ_{k}^{2} C_{2 k}^{2} > 0

. Let

\hat{C} : = min_{k = 0, \dots, T} \{δ_{k} C_{1 k} - δ_{k}^{2} C_{2 k}^{2}\}

; then,

\begin{matrix} f (x_{k + 1}) - f (x_{k}) \leq - \hat{C} {∥ \nabla f (x_{k}) ∥}^{2} . \end{matrix}

(17)

By summing (17) for

k = 0, \dots, T

and averaging it,

\begin{matrix} \frac{\hat{C}}{T} \sum_{k = 0}^{T} {∥ \nabla f (x_{k}) ∥}^{2} \leq \frac{1}{T + 1} (f (x_{0}) - f (x_{T + 1})) \leq \frac{1}{T + 1} (f (x_{0}) - f (x^{*})), \end{matrix}

then,

\begin{matrix} min_{k = 0, \dots, T} {∥ \nabla f (x_{k}) ∥}^{2} \leq \frac{1}{T + 1} \cdot \frac{1}{\hat{C}} (f (x_{0}) - f (x^{*})), \end{matrix}

where

x^{*}

is one of the minimum point of the function

f (x)

over

R^{d}

. □

In order to prove the convergence of AdaSGDM under a non-convex setting, we first analyze the relationship between the local error bound of the function and the local variation and gradient. Second, the relationship between the local variation of gradient and gradient is further analyzed. Finally, the boundary of the gradient is obtained, that is, the convergence of AdaSGDM under a non-convex setting. Before we state the adaptive convergence of AdaSGDM for non-convex optimization, we first present the following two Lemmas.

Lemma 2.

Let

z_{k} = x_{k} + p_{k}

. For AdaSGDM, we have the following for any

k \geq 0

:

\begin{matrix} E [f (z_{k + 1}) - f (z_{k})] & \leq & \frac{1}{2 L} E [∥ \nabla f (z_{k}) - \nabla f (x_{k}) ∥^{2}] \\ - (\frac{1}{1 - β} C_{1 k} δ_{k} - \frac{L}{{(1 - β)}^{2}} C_{2 k}^{2} δ_{k}^{2}) E [∥ \nabla f (x_{k}) ∥^{2}], \end{matrix}

where L is the Lipschitz constant of f,

β \in [0, 1)

is the momentum constant as mentioned in (2),

C_{1 k}

and

C_{2 k}

are parameters in (10) and (11), and

δ_{k}

is the parameter in (7).

Proof.

Because

f (x)

is a smooth function, we have

\begin{matrix} f (y) - f (x) - 〈 \nabla f (x), y - x 〉 \leq \frac{L}{2} {∥ y - x ∥}^{2} . \end{matrix}

We define

ω_{k} = g_{k} - \nabla f (x_{k})

; then, from Assumption 2,

E [ω_{k}] = 0

can be obtained. Then,

\begin{matrix} f (z_{k + 1}) & \leq & f (z_{k}) + 〈 \nabla f (z_{k}), z_{k + 1} - z_{k} 〉 + \frac{L}{2} {∥ z_{k + 1} - z_{k} ∥}^{2} \\ = & f (z_{k}) - \frac{1}{1 - β} η_{k} \nabla f {(z_{k})}^{T} g_{k} + \frac{L}{2} \frac{1}{{(1 - β)}^{2}} η_{k}^{2} {∥ g_{k} ∥}^{2} \\ = & f (z_{k}) - \frac{1}{1 - β} η_{k} \nabla f {(z_{k})}^{T} (ω_{k} + \nabla f (x_{k})) + \frac{L}{2} \frac{1}{{(1 - β)}^{2}} η_{k}^{2} {∥ g_{k} ∥}^{2} \\ = & f (z_{k}) - \frac{η_{k}}{1 - β} \nabla f {(z_{k})}^{T} ω_{k} - \frac{η_{k}}{1 - β} \nabla f {(x_{k})}^{T} (\nabla f (z_{k}) - \nabla f (x_{k})) \\ - \frac{η_{k}}{1 - β} ∥ \nabla f (x_{k}) ∥^{2} + \frac{L}{2} \frac{1}{{(1 - β)}^{2}} η_{k}^{2} {∥ g_{k} ∥}^{2} . \end{matrix}

(18)

Recombining (18) and using the expectation of both sides,

\begin{matrix} E [f (z_{k + 1}) - f (z_{k})] & \leq & - \frac{1}{1 - β} η_{k} E [\nabla f {(x_{k})}^{T} (\nabla f (z_{k}) - \nabla f (x_{k}))] - \frac{1}{1 - β} η_{k} E [∥ \nabla f (x_{k}) ∥^{2}] \\ + \frac{L}{2} \frac{1}{{(1 - β)}^{2}} η_{k}^{2} E [∥ g_{k} ∥^{2}] \\ \leq & \frac{1}{2} E [\frac{1}{L} ∥ \nabla f (z_{k}) - \nabla f (x_{k}) ∥^{2} + \frac{L}{{(1 - β)}^{2}} η_{k}^{2} {∥ \nabla f (x_{k}) ∥}^{2}] \\ - \frac{1}{1 - β} η_{k} E [∥ \nabla f (x_{k}) ∥^{2}] + \frac{L}{2} \frac{1}{{(1 - β)}^{2}} η_{k}^{2} E [∥ g_{k} ∥^{2}], \end{matrix}

(19)

where the second inequality uses the inequality of the arithmetical and geometric means

(a b \leq \frac{1}{2} (a^{2} + b^{2}))

. By taking the adaptive step-size as (7) and substituting it into (19), we have

\begin{matrix} E [f (z_{k + 1}) - f (z_{k})] & \leq & \frac{1}{2 L} E [∥ \nabla f (z_{k}) - \nabla f (x_{k}) ∥^{2}] - \frac{1}{1 - β} C_{1 k} δ_{k} E [∥ \nabla f (x_{k}) ∥^{2}] \\ + \frac{L}{{(1 - β)}^{2}} C_{2 k}^{2} δ_{k}^{2} E [∥ \nabla f (x_{k}) ∥^{2}], \end{matrix}

using (10) and (11). □

Lemma 3.

For AdaSGDM, for any

k \geq 1

, we have

\begin{matrix} E [∥ \nabla f (z_{k}) - \nabla f (x_{k}) ∥^{2}] \leq \frac{L^{2} β^{2}}{{(1 - β)}^{2}} \cdot Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} δ_{k - 1 - i}^{2} C_{1 k}^{2} {∥ \nabla f (x_{k - 1 - i}) ∥}^{2}, \end{matrix}

(20)

where

Γ_{k - 1} : = \sum_{i = 0}^{k - 1} β^{i} = \frac{1 - β^{k}}{1 - β}

, L is the Lipschitz constant of f, and

β \in [0, 1)

is the momentum constant as mentioned in (2). For

k \geq 1

,

C_{1 k}

and

C_{2 k}

are parameters in (10) and (11), respectively, and

δ_{k}

is the parameter in (7).

Proof.

Because f is L-smooth,

z_{k} = x_{k} + p_{k}

, and (5), we have

\begin{matrix} ∥ \nabla f (z_{k}) - \nabla f (x_{k}) ∥^{2} \leq L^{2} ∥ z_{k} - x_{k} ∥^{2} = L^{2} ∥ p_{k} ∥^{2} = \frac{L^{2} β^{2}}{{(1 - β)}^{2}} {∥ x_{k} - x_{k - 1} ∥}^{2} . \end{matrix}

(21)

Recall the recursion in (6), that is,

v_{k + 1} = β v_{k} - η_{k} g_{k}

. Note that

v_{0} = 0

. By induction, for

k \geq 1

,

\begin{matrix} v_{k} = - \sum_{i = 0}^{k - 1} β^{i} η_{k - 1 - i} g_{k - 1 - i} . \end{matrix}

(22)

Let

Γ_{k - 1} = \sum_{i = 0}^{k - 1} β^{i} = \frac{1 - β^{k}}{1 - β}

; then,

\begin{matrix} ∥ v_{k} ∥^{2} & = & ‖ \sum_{i = 0}^{k - 1} \frac{β^{i}}{Γ_{k - 1}} η_{k - 1 - i} g_{k - 1 - i} ‖^{2} \cdot Γ_{k - 1}^{2} \\ \leq & Γ_{k - 1}^{2} \sum_{i = 0}^{k - 1} \frac{β^{i}}{Γ_{k - 1}} η_{k - 1 - i}^{2} {∥ g_{k - 1 - i} ∥}^{2} \\ = & Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} η_{k - 1 - i}^{2} {∥ g_{k - 1 - i} ∥}^{2} . \end{matrix}

(23)

Taking the expectation over both sides of (23) and noting the step size (7), we have

\begin{matrix} E [∥ v_{k} ∥^{2}] & \leq & Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} η_{k - 1 - i}^{2} E [∥ g_{k - 1 - i} ∥^{2}] \\ \leq & Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} δ_{k - 1 - i}^{2} C_{2, k - 1 - i}^{2} {∥ \nabla f (x_{k - 1 - i}) ∥}^{2} . \end{matrix}

Then, taking the expectation of both sides of (21) and substituting the above inequality into it, we have

\begin{matrix} E [∥ \nabla f (z_{k}) - \nabla f (x_{k}) ∥^{2}] & \leq & \frac{L^{2} β^{2}}{{(1 - β)}^{2}} E [∥ x_{k} - x_{k - 1} ∥^{2}] \\ = & \frac{L^{2} β^{2}}{{(1 - β)}^{2}} E [∥ v_{k} ∥^{2}] \\ \leq & \frac{L^{2} β^{2}}{{(1 - β)}^{2}} Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} δ_{k - 1 - i}^{2} C_{2, k - 1 - i}^{2} {∥ \nabla f (x_{k - 1 - i}) ∥}^{2}, \end{matrix}

which means that (20) is established. □

Based on the previous Lemmas 2 and 3, we can now state the convergence analysis of AdaSGDM under non-convex settings.

Theorem 3.

Let Assumptions 1 and 2 hold, and let f be a non-convex and L-smooth function. Choosing the step size as in (7), the iteration sequence

x_{k}

obtained by AdaSGDM satisfies the following bound:

\begin{matrix} min_{k = 0, \dots, T - 1} E [∥ \nabla f (x_{k}) ∥^{2}] \leq \frac{1}{\bar{c} - \bar{d}} \frac{1}{T} (f (x_{0}) - f (x^{*})), \end{matrix}

where

x^{*} = arg {min}_{x} f (x)

,

\bar{c}, \bar{d} \geq 0

are constants,

\bar{c} > \bar{d}

, and T is the maximum number of iterations.

Proof.

From the initial conditions, it follows that

z_{0} = x_{0}

; thus, Lemmas 2 and 3 imply the following inequality:

\begin{matrix} E [f (z_{k + 1}) - f (z_{k})] & \leq & \frac{1}{2 L} \frac{L^{2} β^{2}}{{(1 - β)}^{2}} Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} δ_{k - 1 - i}^{2} C_{2 k}^{2} E [∥ \nabla f (x_{k - 1 - i}) ∥^{2}] \\ - (\frac{1}{1 - β} C_{1 k} δ_{k} - \frac{L}{{(1 - β)}^{2}} C_{2 k}^{2} δ_{k}^{2}) E [∥ \nabla (x_{k}) ∥^{2}] . \end{matrix}

(24)

By summing (24) for

k = 0, \dots, T

,

\begin{matrix} E [f (z_{T + 1}) - f (z_{0})] \\ \leq & E [∥ \nabla f (z_{0}) - \nabla f (x_{0}) ∥^{2}] - \sum_{k = 0}^{T} (\frac{1}{1 - β} C_{1 k} δ_{k} - \frac{L}{{(1 - β)}^{2}} C_{2 k}^{2} δ_{k}^{2}) E [∥ \nabla f (x_{k}) ∥^{2}] \\ + \frac{1}{2 L} \frac{L^{2} β^{2}}{{(1 - β)}^{2}} \sum_{k = 1}^{T} Γ_{k - 1} \sum_{i = 0}^{k - 1} β^{i} δ_{k - 1 - i}^{2} C_{2 k}^{2} E [∥ \nabla f (x_{k - 1 - i}) ∥^{2}] \\ = & - \sum_{k = 0}^{T - 1} (c_{k} - d_{k}) E [∥ \nabla f (x_{k}) ∥^{2}] - c_{T} E [∥ \nabla f (x_{T}) ∥^{2}], \end{matrix}

where

c_{k} : = \frac{1}{1 - β} C_{1 k} δ_{k} - \frac{L}{{(1 - β)}^{2}} C_{2 k}^{2} δ_{k}^{2}

and

d_{k} : = \frac{1}{2} \frac{L β^{2}}{{(1 - β)}^{2}} δ_{k}^{2} C_{2 k}^{2} \sum_{i = k}^{T - 1} Γ_{i} β^{i - k}

. For

k = 0, \dots, T

, we choose

δ_{k} < \frac{1 - β}{L} \frac{C_{1 k}}{C_{2 k}^{2}} \frac{1}{2 + β^{2} \sum_{i = k}^{T - 1} Γ_{i} β^{i - k}}

. Thus, it is true that

c_{k} > d_{k}

for

k = 0, \dots, T - 1

as well as that

c_{T} > 0

. Then,

\begin{matrix} \sum_{k = 0}^{T - 1} (c_{k} - d_{k}) E [∥ \nabla f (x_{k}) ∥^{2}] & \leq & E [f (z_{0}) - f (z_{T + 1})] - c_{T} E [∥ \nabla f (x_{T}) ∥^{2}] \\ \leq & f (z_{0}) - f (z_{T + 1}) . \end{matrix}

Furthermore, because

z_{0} = x_{0}

and

x^{*} = arg {min}_{x} f (x)

, we have

\begin{matrix} min_{k = 0, \dots, T - 1} E [∥ \nabla f (x_{k}) ∥^{2}] \leq \frac{1}{\bar{c} - \bar{d}} \frac{1}{T} (f (z_{0}) - f (z_{T + 1})) \leq \frac{1}{\bar{c} - \bar{d}} \frac{1}{T} (f (x_{0}) - f (x^{*})), \end{matrix}

where

\bar{c} - \bar{d} = {min}_{k = 0, \dots, T - 1} {c_{k} - d_{k}}

. □

4. Experiments

In this section, we present experimental results of applying our adaptive schemes to several test problems. Section 4.1 focuses on regularized linear regression problem and regularized logistic regression problem, which are widely used in the machine learning community, while Section 4.2 considers the non-convex support vector machine (SVM) problem and non-convex quadratic problem. In both, we report the performance of AdaSGD and AdaSCDM and compare them with SGDM, Adam and Adagrad. In each instance, we set the step size for AdaSGD and AdaSGDM using the procedure in (7). To make the comparison equitable, the default parameter values for Adam are selected according to [9], especially

η = 0.001, β_{1} = 0.9, β_{2} = 0.999

and

ρ = 10^{- 8}

. For Adagrad, the initial step size is

η_{0} = 0.1

. Using random datasets, we prove that the proposed adaptive SGD can effectively solve practical deep learning problems.

The parameters in SGDM are set to a step size of

η = 0.001

and momentum coefficient of

β = 0.8

in the following applications. We repeated the experiment ten times and report the average results. All methods use the same random initialization; all figures in this section are in log–log scale, and the maximum number of iterations T = 10,000. Finally, all the algorithms involved in the experiment were implemented using MATLAB R2017a (9.2.0.538062) 64 bit software in Windows 10.

4.1. Convex Functions

Consider the following two convex optimization problems: an

l_{2}

-regularized quadratic function with

f_{1} (x) = {∥ A x - b ∥}^{2} + λ {∥ x ∥}_{2}^{2}

, and an

l_{2}

-regularized logistic regression for binary classification with

f_{2} (x) = \sum_{i = 1}^{m} log (1 + e^{- b_{i} A_{i} x}) + λ {∥ x ∥}_{2}^{2}

with the penalty parameter

λ = 0.1

, where

A \in R^{m \times n}

and

b \in R^{m}

. The entries of b are randomly −1 or 1. Rows

A_{i}

in A are generated by an i.i.d multivariate Gaussian distribution conditioned on

b_{i}

. We use a mini-batch of size n to compute a stochastic gradient at each iteration. Note that the gradients of functions

f_{1} (x)

and

f_{2} (x)

are continuous; we assume that random sampling of small batches from the datasets satisfies Assumption 2.

When

m = 60

and

n = 10

, the convergence paths of the procedure for minimizing different convex functions in SGDM, Adam, Adagrad, and the proposed AdaSGD and AdaSGDM is demonstrated in Figure 1, where the left subfigure in Figure 1, corresponding to function

f_{1} (x) = {∥ A x - b ∥}^{2} + λ {∥ x ∥}_{2}^{2}

, takes 10.337139 s, and the right subfigure in Figure 1, corresponding to function

f_{2} (x) = \sum_{i = 1}^{m} log (1 + e^{- b_{i} A_{i} x}) + λ {∥ x ∥}_{2}^{2}

, takes 4.947887 s. When

m = 10, 000

and

n = 200

. The results are shown in Figure 2, where the left and right subgraphs in Figure 2, corresponding to

f_{1} (x)

and

f_{2} (x)

, take 2130.277402 s and 442.714215 s, respectively.

From the left and right figures in Figure 1 and Figure 2, it is not difficult to see that AdaSGD and AdaSGDM show better convergence than existing stochastic optimization methods when considering the convex optimization problems of different models. Observe that SGDM displays local acceleration close to the optimal point and attains convergence rate of

O (1 / \sqrt{T})

, as shown in [36]. Adagrad shows a convergence rate of

O (1 / \sqrt{T})

, as mentioned in [11]. Adam eventually attains a rate of convergence of

O (1 / \sqrt{T})

, as shown in [10]. The proposed methods, AdaSGD and AdaSGDM, tend to converge faster than the SGDM, Adam, or Adagrad, showing a convergence of

1 / T

, which is consistent with our theory results in this paper.

4.2. Non-Convex Functions

Consider the following non-convex support vector machine (SVM) problem with a sigmoid loss function, which has previously been considered in [5] (the data points are generated in the same way as in Section 4.1):

{min}_{x \in R^{n}} f_{3} (x) : = \sum_{i = 1}^{m} [1 - tanh (b_{i} 〈 x, a_{i} 〉)] + λ {∥ x ∥}_{2}^{2}

, where

λ = 0.1

is a regularization parameter. In addition, consider the following non-convex optimization problem corresponding to the elastic regression network model [37]:

{min}_{x \in R^{n}} f_{4} (x) : = ∥ A x - b ∥ - λ_{1} {∥ x ∥}_{2}^{2} + λ_{2} {∥ x ∥}_{1}

, where

λ_{1} = 0.001

and

λ_{2} = 0.01

. Here, we use a mini-batch of size n to compute a stochastic gradient at each iteration. For minimizing the two non-convex functions

f_{3} (x)

and

f_{4} (x)

, the gradient of

f_{3} (x)

is obviously continuous. For

f_{4} (x)

it is easy to know that the derivative of

f_{4} (x)

at point

x = 0

does not exist; however, we can use the subgradient at this point. For example, one of the subgradients of

f_{4} (x)

here is

\partial f_{4} (0) = 0

. Although this gradient is discontinuous, it satisfies the Lipschitz condition, meaning that the conclusion in Theorem 3 holds. The convergence paths of the algorithms SGDM, Adam, Adagrad, AdaSGD, and AdaSGDM when (

m = 60

,

n = 10

) and (

m = 10, 000

,

n = 200

) are shown in Figure 3 and Figure 4. The CPU times of the left and right subfigures corresponding to

f_{3} (x)

and

f_{4} (x)

in Figure 3 are 10.808409 s and 5.824761 s, respectively, and those of

f_{3} (x)

and

f_{4} (x)

in Figure 4 are 2369.346180 s and 455.041079 s, respectively.

From the figures in Figure 3 and Figure 4, it can be seen that AdaSGD and AdaSGDM maintain good convergence when considering non-convex optimization problems of different models. For different non-convex objective functions with Lipschitz continuous gradients, it can be observed that the gradient converges in expectation at the order of

O (1 / \sqrt{T})

of SGDM, as shown in [36]. As described in [10], the convergence analysis for Adam is not applicable to non-convex problems, and it is only through experience that Adam is likely to perform better than other methods. The Adagrad algorithm displays a convergence rate of

O (log T / \sqrt{T})

under non-convex setting, as showed in [38]. For the proposed methods in this paper, AdaSGD and AdaSGDM, tend to converge faster than SGDM, Adam, and Adagrad under non-convex settings, showing a convergence of

1 / T

, which is consistent with our theory result in this paper.

5. Conclusions and Future Work

In this paper, two shortcomings of the adaptive stochastic gradient descent method for stochastic optimization problems are studied. The first is the assumption of a convex setting, which is often harsh in many practical optimization problems of machine learning. The second is slow convergence, which is a result of using the adaptive step size of past stochastic gradients, and is generally up to

O (1 / \sqrt{T})

. As a consequence, in this paper we first propose a new adaptive SGD in which the new step size is a function of the expectation of the past stochastic gradient and its second moment. In both convex and non-convex settings, the adaptive SGD with the new designed step size converges at the rate of

O (1 / T)

. Second, the new adaptive SGD is extended to the case with momentum, and again achieves a convergence rate of

O (1 / T)

, irrespective of convex or non-convex settings. To sum up, our results indicate that the designed adaptive step size is able to alleviate the problem of slow convergence caused by inherent variance to a certain extent. The proposed approach achieves accelerated convergence in convex setting, and works in non-convex settings as well. Experimental results show that the proposed adaptive stochastic gradient descent methods, both with and without momentum, have better convergence performance than existing methods. In the future, we hope to apply this method to large datasets or to actual data collection in order to better analyze its effectiveness.

Author Contributions

Writing—original draft, R.C.; Writing—review & editing, X.T.; Supervision, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Fundamental Research Funds for the Central Universities No. 2662021LXQD001 and the National Natural Science Foundation of China with Grant No. 61903148.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Ye Yuan for his helpful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Chung, K.L. On a stochastic approximation method. Ann. Math. Stat. 1954, 25, 463–483. [Google Scholar] [CrossRef]
Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef] [Green Version]
Ruszczyński, A.; Syski, W. A method of aggregate stochastic subgradients with on-line stepsize rules for convex stochastic programming problems. Math. Program. Stud. 1986, 28, 113–131. [Google Scholar]
Ghadimi, S.; Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef] [Green Version]
Bach, F. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. 2014, 15, 595–627. [Google Scholar]
Xiao, L.; Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 2014, 24, 2057–2075. [Google Scholar] [CrossRef] [Green Version]
Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 315–323. [Google Scholar]
Cutkosky, A.; Busa-Fekete, R. Distributed stochastic optimization via adaptive stochastic gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1910–1919. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Mustapha, A.; Mohamed, L.; Ali, K.; Hamlich, M.; Bellatreche, L.; Mondal, A. An Overview of Gradient Descent Algorithm Optimization in Machine Learning: Application in the Ophthalmology Field. In Proceedings of the Smart Applications and Data Analysis. SADASC 2020, Marrakesh, Morocco, 25–26 June 2020; pp. 349–359. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. 2011, 12, 257–269. [Google Scholar]
Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. arXiv 2019, arXiv:1904.09237. [Google Scholar]
Li, X.; Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv 2018, arXiv:1805.08114. [Google Scholar]
Yousefian, F.; Nedi, A.; Shanbhag, U.V. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica 2012, 48, 56–67. [Google Scholar] [CrossRef] [Green Version]
Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 2013, 19, 1574–1609. [Google Scholar] [CrossRef] [Green Version]
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999, 12, 145–151. [Google Scholar] [CrossRef] [PubMed]
Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Nesterov, Y.E. A method for solving the convex programming problem with convergence rate O(1/k²). Sov. Math. Dokl. 1983, 269, 543–547. [Google Scholar]
Klein, S.; Pluim, J.P.W.; Staring, M.; Viergever, M.A. Adaptive stochastic gradient descent optimisation for image registration. Int. J. Comput. Vis. 2009, 81, 227. [Google Scholar] [CrossRef]
Yuan, Y.; Li, M.; Liu, J.; Tomlin, C.J. On the Powerball method for optimization. arXiv 2016, arXiv:1603.07421. [Google Scholar]
Viola, J.; Chen, Y.Q. A Fractional-Order On-Line Self Optimizing Control Framework and a Benchmark Control System Accelerated Using Fractional-Order Stochasticity. Fractal Fract. 2022, 6, 549. [Google Scholar] [CrossRef]
Holland, J.H. Genetic Algorithms understand Genetic Algorithms. Sci. Am. 1992, 267, 66–73. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
Xu, K.; Cheng, T.L.; Lope, A.M.; Chen, L.P.; Zhu, X.X.; Wang, M.W. Fuzzy Fractional-Order PD Vibration Control of Uncertain Building Structures. Fractal Fract. 2022, 6, 473. [Google Scholar] [CrossRef]
Lagunes, M.L.; Castillo, O.; Valdez, F.; Soria, J.; Melin, P. A New Approach for Dynamic Stochastic Fractal Search with Fuzzy Logic for Parameter Adaptation. Fractal Fract. 2021, 5, 33. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Gentile, C. Adaptive and self-confident on-line learning algorithms. J. Comput. Syst. Sci. 2002, 64, 48–75. [Google Scholar] [CrossRef] [Green Version]
Prangprakhon, M.; Feesantia, T.; Nimana, N. An Adaptive Projection Gradient Method for Solving Nonlinear Fractional Programming. Fractal Fract. 2022, 6, 566. [Google Scholar] [CrossRef]
Bottou, L. Online learning and stochastic approximations. Online Learn. Neural Netw. 1998, 17, 142. [Google Scholar]
Nguyen, L.M.; Nguyen, P.H.; Richtárik, P.; Scheinberg, K.; Takáč, M.; van Dijk, M. New convergence aspects of stochastic gradient algorithms. J. Mach. Learn. Res. 2019, 20, 1–49. [Google Scholar]
Yan, Y.; Yang, T.; Li, Z.; Lin, Q.; Yang, Y. A unified analysis of stochastic momentum methods for deep learning. arXiv 2018, arXiv:1808.10396. [Google Scholar]
Xu, P.; Wang, T.; Gu, Q. Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions. In Proceedings of the International Conference on Machine Learning, Macau, China, 26–28 February 2018; pp. 5488–5497. [Google Scholar]
Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer Science & Business Media: Berlin, Germany, 2013; Volume 87. [Google Scholar]
Zou, F.; Shen, L. On the convergence of adagrad with momentum for training deep neural networks. arXiv 2018, arXiv:1808.03408. [Google Scholar]
Yang, T.; Lin, Q.; Li, Z. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv 2016, arXiv:1604.03257. [Google Scholar]
Facchinei, F.; Scutari, G.; Sagratella, S. Parallel selective algorithms for nonconvex big data optimization. IEEE Trans. Signal Process. 2015, 63, 1874–1889. [Google Scholar] [CrossRef]
Ward, R.; Wu, X.; Bottou, L. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6677–6686. [Google Scholar]

$Fractalfract 06 00709 g001 550$

Figure 1. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 60

,

n = 10

, and the objective functions are smooth convex functions. Left: the objective function

f_{1} (x)

; Right: the objective function

f_{2} (x)

.

Figure 1. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 60

,

n = 10

, and the objective functions are smooth convex functions. Left: the objective function

f_{1} (x)

; Right: the objective function

f_{2} (x)

.

$Fractalfract 06 00709 g001$

$Fractalfract 06 00709 g002 550$

Figure 2. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 10, 000

,

n = 200

, and the objective functions are smooth convex functions. Left: the objective function

f_{1} (x)

; Right: the objective function

f_{2} (x)

.

Figure 2. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 10, 000

,

n = 200

, and the objective functions are smooth convex functions. Left: the objective function

f_{1} (x)

; Right: the objective function

f_{2} (x)

.

$Fractalfract 06 00709 g002$

$Fractalfract 06 00709 g003 550$

Figure 3. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 60

,

n = 10

, and the objective functions are smooth non-convex functions. Left: the objective function

f_{3}

(x); Right: the objective function

f_{4} (x)

.

Figure 3. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 60

,

n = 10

, and the objective functions are smooth non-convex functions. Left: the objective function

f_{3}

(x); Right: the objective function

f_{4} (x)

.

$Fractalfract 06 00709 g003$

$Fractalfract 06 00709 g004 550$

Figure 4. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 10, 000

,

n = 200

, and the objective functions are smooth non-convex functions. Left: the objective function

f_{3} (x)

; Right: the objective function

f_{4} (x)

.

Figure 4. The convergence paths of the algorithms SGDM, Adam, and Adagrad and the proposed methods AdaSGD and AdaSGDM when

m = 10, 000

,

n = 200

, and the objective functions are smooth non-convex functions. Left: the objective function

f_{3} (x)

; Right: the objective function

f_{4} (x)

.

$Fractalfract 06 00709 g004$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, R.; Tang, X.; Li, X. Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization. Fractal Fract. 2022, 6, 709. https://doi.org/10.3390/fractalfract6120709

AMA Style

Chen R, Tang X, Li X. Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization. Fractal and Fractional. 2022; 6(12):709. https://doi.org/10.3390/fractalfract6120709

Chicago/Turabian Style

Chen, Ruijuan, Xiaoquan Tang, and Xiuting Li. 2022. "Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization" Fractal and Fractional 6, no. 12: 709. https://doi.org/10.3390/fractalfract6120709

APA Style

Chen, R., Tang, X., & Li, X. (2022). Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization. Fractal and Fractional, 6(12), 709. https://doi.org/10.3390/fractalfract6120709

Article Menu

Adaptive Stochastic Gradient Descent Method for Convex and Non-Convex Optimization

Abstract

1. Introduction

2. Problem Statement

2.1. Adaptive Step Size Stochastic Gradient Descent

2.2. Adaptive Step Size Stochastic Gradient Descent with Momentum

3. Convergence Analysis

3.1. Adaptive Convergence Rates for Convex Functions

3.2. Adaptive Convergence for Non-Convex Optimization

4. Experiments

4.1. Convex Functions

4.2. Non-Convex Functions

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI