Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum

Nam, Kyunghun; Park, Sejun

doi:10.3390/math13040681

Open AccessFeature PaperArticle

Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum

by

Kyunghun Nam

and

Sejun Park

^*

Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(4), 681; https://doi.org/10.3390/math13040681

Submission received: 11 January 2025 / Revised: 13 February 2025 / Accepted: 18 February 2025 / Published: 19 February 2025

(This article belongs to the Special Issue Advanced Optimization Methods and Applications, 3rd Edition)

Download

Browse Figure

Versions Notes

Abstract

In this study, we analyze the convergence rate of Adagrad with momentum for non-convex optimization problems. We establish the first dimension-independent convergence rate under the

(L_{0}, L_{1})

-smoothness assumption, which is a generalization of the standard L-smoothness. We show the

O (\sqrt{1 / T})

convergence rate under bounded noise in stochastic gradients, where the bound can scale with the current optimality gap and gradient norm.

Keywords:

non-convex optimization; high-probability convergence rate; Adagrad

MSC:

46N10

1. Introduction

For a differentiable function

f : R^{d} \to R^{1}

from the d-dimensional Euclidean space to the one-dimensional Euclidean space with

d \geq 1

, we consider the following minimization problem:

min_{x \in R^{d}} f (x) .

(1)

Stochastic iterative algorithms that leverage first-order derivative information, such as stochastic gradient descent (SGD), are popular tools for solving this problem (1). The step size heavily influences the convergence properties of these methods. However, tuning and adjusting this parameter during model training can be time-consuming and computationally expensive. To mitigate these challenges, various adaptive step size algorithms have been proposed.

One such adaptive algorithm, Adagrad (and its variants), has demonstrated strong empirical performance. To theoretically understand its performance, the convergence rates of Adagrad have been studied; however, most of the existing analyses focus on the in-expectation convergence rate, which cannot describe the success of a single run (or at most a few runs) of Adagrad in practice. Understanding the convergence property of these few runs requires a high-probability convergence guarantee in which the convergence rate has logarithmic dependency on the failure probability, and hence, recent research has increasingly focused on achieving high probability bounds.

In addition, the existing convergence rates for Adagrad are dimension-dependent. For instance, Hong and Lin [1] and Liu et al. [2] report rates that scale with

d^{2}

where d denotes the input dimension. However, when optimizing with a high-dimensional input, as in deep learning, Adagrad shows a faster (or at least comparable) convergence speed compared to that of SGD, which has a dimension-independent convergence rate in terms of both expectation and high probability. This discrepancy between theory and practice implies that the theoretical convergence rates of Adagrad can be significantly improved, especially for dimension independence and high probability.

In this work, we derive the first high-probability and dimension-independent convergence rates for Adagrad with momentum. Specifically, under

(L_{0}, L_{1})

-smoothness, which is a generalization of traditional L-smoothness, and bounded gradient noise where the bound can scale with the current gradient norm and optimality gap, we show the

O (\sqrt{1 / T})

convergence rate of Adagrad with momentum for non-convex stochastic optimization problems (1), where T denotes the number of iterations.

2. Related Works

2.1. Convergence Analysis of Adagrad

Several works have analyzed Adagrad and its variants in the context of convex optimization, with extensions to variational inequality problems—for example, Kavis et al. [3], Bach and Levy [4], and Ene et al. [5]. For non-convex optimization, Li and Orabona [6] first analyzed the convergence rate of a modified version of Adagrad that did not use the latest gradient to compute the step size, deviating from the original algorithm. Subsequent works (e.g., Hong and Lin [1], Liu et al. [2], Défossez et al. [7], Kavis et al. [8], Wang et al. [9], Faw et al. [10], Attia and Koren [11], etc.) also studied the convergence rate for Adagrad (or Adagrad-Norm) and provided error bounds on

\frac{1}{T} \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}_{2}^{2}

that scaled with the input dimension d.

In contrast to Adagrad or Adagrad-Norm, the convergence behavior of Adagrad with momentum has received relatively little attention (e.g., Hong and Lin [1], Li and Orabona [6], and Défossez et al. [7]). Empirically, in SGD with heavy-ball momentum, it is observed that smaller momentum factors

β

(see

m_{t}

in Algorithm 1) often lead to better training results. As a result, practitioners commonly set

β

to a small constant (e.g., 0.01) or allow it to decrease over time. However, the theoretical understanding of this empirical observation is quite limited. For instance, Hong and Lin [1] derived a convergence rate inversely proportional to

β^{2}

. Défossez et al. [7] improve this rate; however, their rate is still inversely proportional to

β

. On the other hand, our convergence rates for Adagrad with momentum are not inversely proportional to

β

. A summary of these results is presented in Table 1.

2.2. Stopping Time in Optimization

In the literature on stochastic approximation, stopping times have been widely employed either as analytical tools (Faw et al. [10], Patel et al. [12], and Patel [13]) or as components of the algorithm design (Ene et al. [5]). In most of these works, stopping time is used to test for the proximity to a stationary point or to ensure a sufficient decrease in the objective function. The notable exceptions of Li et al. [14] and Li et al. [15] utilize the stopping time to bound the sub-optimality gap,

f (x) - {inf}_{x} f (x)

. By leveraging the reverse direction of the Polyak–Lojasiewicz inequality, it follows that the gradient norm

{∥ \nabla f (x) ∥}_{2}

can also be bounded. Using this idea, we used a stopping time analysis to explore scenarios in which the sub-optimality gap and gradient norm remained bounded. This approach integrates the stopping times as a practical mechanism for algorithm control and a theoretical framework for bounding the key metrics during optimization.

3. Preliminaries

3.1. Notations

For

x, y \in R^{d}

,

x^{⊙ 2}

,

\sqrt{x}

and

x ⊙ y

denote the coordinate-wise square, square root, and coordinate-wise multiplication, respectively. The Euclidean norm and the standard inner product are denoted by

∥ \cdot ∥

and

〈 \cdot, \cdot 〉

, respectively. For a positive semi-definite matrix

A \in R^{n \times n}

and

x \in R^{n}

,

{∥ x ∥}_{A}^{2}

denotes the quadratic form

x^{T} A x

. For a vector

x = (x_{1}, \dots, x_{d}) \in R^{d}

and a scalar value

z \in R

, we use

z / x

to denote

(z / x_{1}, \dots, z / x_{d})

if

x_{i} \neq 0

for all

i \in {1, \dots, x}

and use

z + x

to denote

(z + x_{1}, \dots, z + x_{d})

. For symmetric matrices

A, B \in R^{n \times n}

, we say

A ⪯ B

if

B - A

is positive semi-definite. For a matrix

A \in R^{n \times n}

, let

∥ A ∥

be the spectral norm of A.

Finally, for

x \in R

,

⌊ x ⌋

denotes the floor function, which maps x to the greatest integer less than or equal to x.

3.2. Problem Setup and Assumptions

Throughout this paper, we consider an algorithm for Adagrad with (heavy-ball) momentum (Algorithm 1) applied to a non-convex objective function (1).

Algorithm 1 Adagrad with heavy-ball momentum

Require:: A learning rate $η > 0$ , momentum coefficient $0 < β < 1$ , small constant $λ > 0$ to prevent division by zero, initial parameter $x_{1} \in R^{d}$ .
1:: for $t = 1$ to T do,
2:: Draw a stochastic gradient $g_{t}$ such that $E [g_{t} | g_{1}, \dots, g_{t - 1}] = f (x_{t})$ .
3:: Update the momentum vector $m_{t} \leftarrow \{\begin{cases} g_{t}, & if t = 1, \\ (1 - β) m_{t - 1} + β g_{t}, & if t > 1 \end{cases}$ .
4:: Compute the next iterate $x_{t + 1} \leftarrow x_{t} - \frac{η}{\sqrt{\sum_{k = 1}^{t} g_{k}^{⊙ 2} + λ}} ⊙ m_{t}$ .
5:: end for

Throughout this paper, we assume that the objective function f in (1) is bounded below.

Assumption A1.

f is bounded below by its finite infimum

f^{🟉} : = f (x^{🟉}) > - \infty

.

We also assume that f is the

(L_{0}, L_{1})

-smoothness for some

L_{0}, L_{1} > 0

.

Assumption A2.

f is

(L_{0}, L_{1})

-smooth,

i . e .,

f is differentiable, and for any

x, y \in R^{d}

satisfying

∥ x - y ∥ \leq 1 / L_{1}

,

∥ \nabla f (x) - \nabla f (y) ∥ \leq (L_{0} + L_{1} ∥ \nabla f (y) ∥) ∥ x - y ∥ .

(2)

For a twice-differentiable function f, Assumption 2 is strictly weaker than the standard L-smoothness, as the L-smoothness condition is equivalent to the

(L, 0)

-smoothness condition and there are functions that are

(L_{0}, L_{1})

-smooth for some

L_{0}, L_{1}

but not L-smooth for all L (see Lemma A1 and Zhang et al. [16]). Empirical evidence shows that many practical objective functions satisfy (2) while they do not satisfy the L-smoothness assumption (e.g., large language models [17]). For a more detailed discussion, see Appendix A.

We consider the following assumption on the noise in the stochastic gradients.

Assumption A3.

σ_{0}, σ_{1}, σ_{2} > 0

exist such that for each

t \in N

,

∥ g_{t} - \nabla f (x_{t}) ∥ \leq σ_{0} + σ_{1} {∥ \nabla f (x_{t}) ∥}^{2} + σ_{2} (f (x_{t}) - f^{🟉}) .

Assumption 3 relaxes the standard bounded noise assumption by allowing the error bound on the stochastic gradient noise to increase with the gradient norm square and the optimality gap. For further details on the stochastic noise assumptions, please refer to Khaled and Richtárik [18].

4. The High-Probability and Dimension-Independent Convergence Rate of Adagrad with Momentum

We are now ready to present our main results. In this section, we present the convergence rate (Theorem 1) and iteration complexity (Corollary 1) of Adagrad with momentum to find an

ϵ

-stationary point. We now introduce our high-probability convergence analysis of Algorithm 1 under Assumptions 1–3.

Theorem 1.

Let

x_{t}

be generated by Adagrad with heavy-ball momentum under Assumptions 1–3. Let

Δ = f (x_{1}) - f^{🟉}

,

ι = \sqrt{log (1 / δ)}

,

σ = σ_{0} + σ_{1} M^{2} + σ_{2} G

, and

\begin{matrix} G = 29 \cdot max \{Δ, \frac{1}{L_{1} \sqrt{λ}}\}, M = \frac{4 L_{1} G + \sqrt{16 L_{1}^{2} G^{2} + 24 L_{0} G}}{3}, \\ 0 < β \leq min \{1, \frac{1}{M^{2}}, \frac{1}{σ^{2} ι}, \frac{1}{σ^{2}}\}, L = L_{0} + L_{1} M, η \leq min \{\frac{1}{L_{1}}, \frac{\sqrt{λ} β^{2}}{6 L}, \frac{\sqrt{β} σ}{L}\} . \end{matrix}

Then, for any natural number

t \in [1, T]

with

T \in N

such that

T \leq 1 / β^{2}

and for any

δ \in (0, 1)

, it holds with probability of at least

1 - δ

that

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2} & \leq \frac{2 Δ β^{2} (M + σ)}{η \sqrt{T}} + \frac{2 Δ β^{2} \sqrt{λ}}{η T} \\ + (\frac{2 β^{2} (M + σ)}{\sqrt{λ T}} + \frac{2 β^{2} (M + σ)}{T}) [1 + 4 β σ^{2} + 4 β^{2} σ^{2} + 18 (1 - β) β σ^{2} ι] . \end{matrix}

Compared to prior works analyzing the convergence rate of Adagrad with momentum [1,6,7], Theorem 1 offers several key improvements. Notably, the error bounds in prior works scale with

1 / β

and d, whereas our bound does not. Furthermore, our bounds do not deteriorate for a smaller

β

value, which aligns well with empirical observations: a smaller

β

often yields better training results. Furthermore, when compared to [1], which derives its results from the same assumptions as ours, our convergence rate has no

log T

factor. By using Theorem 1, we can also compute the iteration complexity of Adagrad with momentum to obtain an

ϵ

-stationary point.

Corollary 1.

Let

x_{t}

be generated by Adagrad with heavy-ball momentum under Assumptions 1–3. Let

ϵ \in (0, 1)

,

Δ = f (x_{1}) - f^{🟉}

,

ι = \sqrt{log (1 / δ)}

, and

\begin{matrix} G = 29 \cdot max \{Δ, \frac{1}{L_{1} \sqrt{λ}}\}, \\ M = \frac{4 L_{1} G + \sqrt{16 L_{1}^{2} G^{2} + 24 L_{0} G}}{3}, \\ σ = σ_{0} + σ_{1} M^{2} + σ_{2} G, L = L_{0} + L_{1} M \\ 0 < β \leq min \{\frac{1}{2}, \frac{6 L}{L_{1} \sqrt{λ}}, \frac{6 σ}{\sqrt{λ}}, \frac{1}{M^{2}}, \frac{1}{σ^{2} ι}, \frac{1}{σ^{2}}, \frac{1}{M}, \frac{1}{σ}, \frac{1}{σ ι} \\ \frac{\sqrt{λ} ϵ^{2}}{144 Δ L (M + σ)}, \frac{ϵ}{12} \sqrt{\frac{1}{L Δ}}, \frac{\sqrt[4]{λ} ϵ}{12 \sqrt{M + σ}}, \frac{ϵ^{2}}{144}\}, \\ η = \frac{\sqrt{λ} β^{2}}{6 L}, T = ⌊ 1 / β^{2} ⌋ . \end{matrix}

Then, for any natural number

t \in [1, T]

and for any

δ \in (0, 1)

, it holds with probability of at least

1 - δ

that

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2} \leq ϵ^{2} . \end{matrix}

Under a constant failure probability

δ

, Corollary 1 implies that choosing

β = Θ (ϵ^{2})

and

T = O (ϵ^{- 4})

suffices to find an

ϵ

-stationary point x such that

∥ \nabla f (x) ∥ \leq ϵ

. Here, we hide all terms other than

ϵ

in the big-O and big-

Θ

notations.

5. Proof of Theorem 1

To prove Theorem 1, we first introduce the stopping time that is used to bound the function value and the gradient norm as Adagrad iterates. We then present key lemmas (Lemmas 1–3) and derive Theorem 1 in Section 5.1. We describe the high-level idea behind our proof in Section 5.2. We present all of the technical lemmas in Section 5.3 and the proofs of Lemmas 1–3 in Section 5.4, Section 5.5 and Section 5.6, respectively.

5.1. Proof of Theorem 1

We define the stopping time as

τ = min {t | f (x_{t}) - f^{🟉} > G} \land (T + 1)

where

a \land b

denotes

min (a, b)

. In other words, the optimality gap is bounded by G until time

τ

. We note that this also implies bounded gradients under the

(L_{0}, L_{1})

-smoothness (see Lemma 6 in Section 5.3).

Using the definition of the stopping time

τ

, we introduce the following key lemmas to prove Theorem 1.

Lemma 1.

Suppose that Assumptions 1–3 hold. Furthermore, the definitions of

G, M, L,

and σ, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Then, the iterates

{x_{t}}_{t < τ}

generated by Algorithm 1 satisfy the following:

\begin{matrix} f (x_{τ}) - f^{🟉} \leq f (x_{1}) - f^{🟉} + \frac{η β^{2}}{\sqrt{λ}} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2}, \end{matrix}

and

\begin{matrix} \sum_{t = 1}^{τ - 1} {∥ \nabla f (x_{t}) ∥}^{2} \leq & \frac{2 β^{2} Δ \sqrt{{(M + σ)}^{2} (τ - 1) + λ}}{η} \\ + \frac{2 β^{4} \sqrt{{(M + σ)}^{2} (τ - 1) + λ}}{\sqrt{λ}} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2} \end{matrix}

where

ϵ_{t} = m_{t} - \nabla f (x_{t})

.

Lemma 2.

Suppose that Assumptions 1–3 hold. Furthermore, the definitions of

G, M, L, σ

, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Then, for any

δ \in (0, 1)

, with probability at least

1 - δ

,

\begin{matrix} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2} \leq & \frac{2}{β} σ^{2} + β M^{2} (τ - 2) + 2 β σ^{2} (τ - 2) + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2} \end{matrix}

where

ϵ_{t} = m_{t} - \nabla f (x_{t})

.

Lemma 3.

Suppose that Assumptions 1–3 hold. Furthermore, the definitions of

G, M, L,

and σ, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Suppose that

\begin{matrix} f (x_{τ}) - f^{🟉} & \leq f (x_{1}) - f^{🟉} + \frac{η β^{2}}{\sqrt{λ}} [\frac{2}{β} σ^{2} + β M^{2} (τ - 2) + 2 β σ^{2} (τ - 2) \\ + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] . \end{matrix}

Then, it holds that

τ = T + 1 .

We now prove Theorem 1 using Lemmas 1–3.

Proof of Theorem 1.

According to Lemmas 1–3, for any

δ \in (0, 1)

, with a probability of at least

1 - δ

, we have

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2} & \leq \frac{2 Δ β^{2} \sqrt{{(M + σ)}^{2} T + λ}}{η T} + \frac{2 β^{4} \sqrt{{(M + σ)}^{2} T + λ}}{\sqrt{λ} T} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T \\ + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq \frac{2 Δ β^{2} \sqrt{{(M + σ)}^{2} T + λ}}{η T} + \frac{2 β^{2} \sqrt{{(M + σ)}^{2} T + λ}}{\sqrt{λ} T} [2 β σ^{2} + 1 + 2 β σ^{2} \\ + 18 (1 - β) σ^{2} β \sqrt{(log \frac{1}{δ})} + 4 β^{2} σ^{2}] \end{matrix}

(3)

where the second inequality comes from

β \leq 1 / M^{2}

,

T \leq 1 / β^{2}

.

By applying

\sqrt{a + b} \leq \sqrt{a} + \sqrt{b}

to the RHS, we obtain the bound in Theorem 1. □

5.2. The High-Level Idea Behind the Proof of Theorem 1

In this section, we illustrate the main idea behind our proof and how we use the technical lemmas in Section 5.3. Our proof mainly relies on the stopping time result (Lemma 3), which enables us to bound the optimality gap until the end of Adagrad with momentum. This enables us to bound the gradient norm (Lemma 7) and treat the

(L_{0}, L_{1})

-smoothness as the standard smoothness assumption (Lemma 5). Furthermore, under the bounded gradient norm (and the bounded number of iterations T), we can derive the upper and lower bounds on the adaptive step size (Lemma 7). Using these observations and analyzing the bias between the update vector

m_{t}

and the true gradient (Lemma 2), we can derive our dimension-independent convergence rate.

5.3. Technical Lemmas

We introduce the following technical lemmas.

Lemma 4.

The iterates

{x_{t}}_{t \in T}

generated by Algorithm 1 satisfy the condition of for all

t \geq 1

,

∥ x_{t + 1} - x_{t} ∥ \leq η \sqrt{β} \leq η .

Moreover, for any function

f : R^{d} \to R

satisfying Assumption 2, it holds that

∥ \nabla f (x_{t + 1}) - \nabla f (x_{t}) ∥ \leq (L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥) η \sqrt{β} \leq (L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥) η .

Proof.

According to the definition of Algorithm 1, we have

\begin{matrix} ∥ x_{t + 1} - x_{t} ∥ & = η ∥\frac{β \sum_{k = 1}^{t} {(1 - β)}^{t - k} g_{k}}{\sqrt{\sum_{k = 1}^{t} g_{k}^{⊙ 2} + λ}}∥ \leq η ∥\frac{β \sum_{k = 1}^{t} {(1 - β)}^{t - k} g_{k}}{\sqrt{\sum_{k = 1}^{t} {(1 - β)}^{t - k} g_{k}^{⊙ 2}}}∥ \\ \leq η β ∥\frac{\sqrt{\sum_{k = 1}^{t} {(1 - β)}^{t - k} g_{k}^{⊙ 2}} \sqrt{\sum_{k = 1}^{t} \frac{{(1 - β)}^{2 t - 2 k}}{{(1 - β)}^{t - k}}}}{\sqrt{\sum_{k = 1}^{t} {(1 - β)}^{t - k} g_{k}^{⊙ 2}}}∥ \\ = η β |\sqrt{\sum_{k = 1}^{t} {(1 - β)}^{t - k}}| = η β (\sqrt{\frac{1}{β} - \frac{{(1 - β)}^{t}}{β}}) \\ \leq η \sqrt{β} . \end{matrix}

Here, the second inequality follows from Cauchy’s inequality, and for the last equality, we use the sum formula of the geometric sequence. □

Lemma 5

(Lemma from Faw et al. [10] and Zhang et al. [16]). For any function

f : R^{d} \to R

satisfying Assumption 2, the sequence of iterates

{x_{t}}_{t \in T}

generated by Algorithm 1 with

η \leq 1 / L_{1}

satisfy the condition of for all

t \geq 1

,

f (x_{t + 1}) \leq f (x_{t}) + 〈 \nabla f (x_{t}), x_{t + 1} - x_{t} 〉 + \frac{L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥}{2} {∥ x_{t + 1} - x_{t} ∥}^{2} .

Lemma 6.

For any function

f : R^{d} \to R

satisfying Assumptions 1–2, the following inequality holds:

{∥ \nabla f (x) ∥}^{2} \leq \frac{8 (L_{0} + L_{1} ∥ \nabla f (x) ∥)}{3} (f (x) - f^{🟉}) .

Proof.

Let

y = x - \frac{1}{2 (L_{0} + L_{1} ∥ \nabla f (x) ∥)} \nabla f (x)

. Then, we have

\begin{matrix} ∥ y - x ∥ = \frac{∥ \nabla f (x) ∥}{2 L_{0} + 2 L_{1} ∥ \nabla f (x) ∥} \leq \frac{∥ \nabla f (x) ∥}{2 L_{1} ∥ \nabla f (x) ∥} = \frac{1}{2 L_{1}} \leq \frac{1}{L_{1}} . \end{matrix}

This implies that

\begin{matrix} f^{🟉} - f (x) \leq f (y) - f (x) & \leq 〈 \nabla f (x), y - x 〉 + \frac{L_{0} + L_{1} ∥ \nabla f (x) ∥}{2} {∥ y - x ∥}^{2} \\ = - \frac{3}{8 (L_{0} + L_{1} ∥ \nabla f (x) ∥)} {∥ \nabla f (x) ∥}^{2} \end{matrix}

where the second inequality is from Lemma 5. □

Lemma 7.

Suppose that Assumptions 1–3 hold. Furthermore, the definitions of

G, M, L

, and σ, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Suppose that

f (x_{t}) - f^{*} \leq G

for all

t < τ

for a given τ. Then, it holds that

\begin{matrix} ∥ \nabla f (x_{t}) ∥ \leq M, \\ f (x_{t + 1}) \leq f (x_{t}) + 〈 \nabla f (x_{t}), x_{t + 1} - x_{t} 〉 + \frac{L}{2} {∥ x_{t + 1} - x_{t} ∥}^{2}, \\ ∥ g_{t} ∥ \leq M + σ, \\ \frac{η}{\sqrt{{(M + σ)}^{2} t + λ}} I ⪯ H_{t} : = diag (\frac{η}{\sqrt{\sum_{k = 1}^{t} g_{k}^{⊙ 2} + λ}}) ⪯ \frac{η}{\sqrt{λ}} I \end{matrix}

where

I

denotes the

d \times d

identity matrix and diag

(v)

for

v \in R^{n}

denotes the diagonal matrix whose diagonal entries are given by the components of the vector v.

Proof.

According to Lemma 6, we have the following inequalities:

\begin{matrix} {∥ \nabla f (x_{t}) ∥}^{2} \leq \frac{8 (L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥)}{3} (f (x_{t}) - f^{🟉}) \\ \frac{3 {∥ \nabla f (x_{t}) ∥}^{2}}{8 (L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥)} \leq f (x_{t}) - f^{🟉} . \end{matrix}

Define the function

g (k) : = \frac{3 k^{2}}{8 (L_{0} + L_{1} k)}

over

k \geq 0

. It is straightforward to verify that

g^{'} (k) \geq 0

for all

k \geq 0

, which implies that

g (k)

is an increasing function and is therefore invertible. Consequently,

g^{- 1}

is also an increasing function and is defined as follows:

g^{- 1} (y) = \frac{4 L_{1} y + \sqrt{16 L_{1}^{2} y^{2} + 24 L_{0} y}}{3} .

If

f (x) - f^{🟉} \leq G

, then

∥ \nabla f (x) ∥ \leq g^{- 1} (f (x) - f^{🟉}) \leq g^{- 1} (G) = M .

This implies that for all

t < τ

, we have

f (x_{t}) - f^{🟉} \leq G

and

∥ \nabla f (x_{t}) ∥ \leq M

. Then,

L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥ \leq L_{0} + L_{1} M

. Using this bound, we can apply the descent lemma (refer to Lemma 5) as the starting point for the proof. Specifically, for all

t < τ

, if

η \leq 1 / L_{1}

, the descent lemma gives

\begin{matrix} f (x_{t + 1}) & \leq f (x_{t}) + 〈 \nabla f (x_{t}), x_{t + 1} - x_{t} 〉 + \frac{L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥}{2} {∥ x_{t + 1} - x_{t} ∥}^{2} \\ \leq f (x_{t}) + 〈 \nabla f (x_{t}), x_{t + 1} - x_{t} 〉 + \frac{L}{2} {∥ x_{t + 1} - x_{t} ∥}^{2} . \end{matrix}

Next, according to Assumption 3, for all

t < τ

, it holds that

∥ g_{t} ∥ \leq ∥ g_{t} - \nabla f (x_{t}) ∥ + ∥ \nabla f (x_{t}) ∥ \leq M + σ .

□

Lemma 8

(Young’s inequality with

ϵ

). For any

ϵ > 0

and

a, b \in R

, we have

a b \leq ϵ a^{2} + \frac{b^{2}}{4 ϵ} .

Proof.

Observe that

{(\sqrt{ϵ} a - \frac{b}{2 \sqrt{ϵ}})}^{2} = ϵ a^{2} - a b + \frac{b^{2}}{4 ϵ} \geq 0

By rearranging the terms, we obtain the desired result. □

Lemma 9.

For any

k > 0

and

a, b \in R^{n}

, we have

{∥ a + b ∥}^{2} \leq (1 + k) {∥ a ∥}^{2} + (1 + \frac{1}{k}) {∥ b ∥}^{2} .

Proof.

We want to show

(1 + k) {∥ a ∥}^{2} + (1 + \frac{1}{k}) {∥ b ∥}^{2} - ({∥ a ∥}^{2} + {∥ b ∥}^{2}) - 2 〈 a, b 〉 \geq 0,

which is equivalent to

k {∥ a ∥}^{2} + \frac{1}{k} {∥ b ∥}^{2} - 2 〈 a, b 〉 \geq 0 .

Since

{∥\sqrt{k} a - \frac{1}{\sqrt{k}} b∥}^{2} = k {∥ a ∥}^{2} + \frac{1}{k} {∥ b ∥}^{2} - 2 〈 a, b 〉 \geq 0,

we obtain the desired results. □

Lemma 10

(The Azuma–Hoeffding inequality). Let

Z_{1}, Z_{2}, \dots, Z_{T}

be a martingale difference sequence with respect to a filtration

F_{t - 1}

. Suppose that there is a constant b such that for any t,

P (| Z_{t} | \leq b) = 1 .

Then, for any positive integer T and for any

δ > 0

, it holds with a probability of at least

1 - δ

that

\frac{1}{T} \sum_{t = 1}^{T} Z_{t} \leq b \sqrt{\frac{2 log (1 / δ)}{T}} .

5.4. Proof of Lemma 1

For all

t < τ

, under the condition

η \leq 1 / L_{1}

, the following holds according to Lemma 7:

\begin{matrix} f (x_{t + 1}) & \leq f (x_{t}) + 〈 \nabla f (x_{t}), x_{t + 1} - x_{t} 〉 + \frac{L}{2} {∥ x_{t + 1} - x_{t} ∥}^{2} \\ = f (x_{t}) - \nabla f {(x_{t})}^{T} H_{t} m_{t} + \frac{L}{2} m_{t}^{T} H_{t}^{2} m_{t} \\ \leq f (x_{t}) - \nabla f {(x_{t})}^{T} H_{t} (m_{t} - \nabla f (x_{t})) - {∥ \nabla f (x_{t}) ∥}_{H_{t}}^{2} + \frac{η L}{2 \sqrt{λ}} {∥ m_{t} ∥}_{H_{t}}^{2} \\ = f (x_{t}) - {∥ \nabla f (x_{t}) ∥}_{H_{t}}^{2} - \nabla f {(x_{t})}^{T} H_{t} ϵ_{t} + \frac{η L}{2 \sqrt{λ}} {∥ \nabla f (x_{t}) + ϵ_{t} ∥}_{H_{t}}^{2} \\ \underset{(a)}{\leq} f (x_{t}) - \frac{2}{3 β^{2}} {∥ \nabla f (x_{t}) ∥}_{H_{t}}^{2} + \frac{3 β^{2}}{4} {∥ ϵ_{t} ∥}_{H_{t}}^{2} + \frac{η L}{\sqrt{λ}} ({∥ \nabla f (x_{t}) ∥}_{H_{t}}^{2} + {∥ ϵ_{t} ∥}_{H_{t}}^{2}) \\ \underset{(b)}{\leq} f (x_{t}) - \frac{1}{2 β^{2}} {∥ \nabla f (x_{t}) ∥}_{H_{t}}^{2} + β^{2} {∥ ϵ_{t} ∥}_{H_{t}}^{2} \end{matrix}

where

ϵ_{t} : = m_{t} - \nabla f (x_{t})

. For

(a)

, we apply Lemma 8 with

ϵ = \frac{1}{3 β^{2}}

and Cauchy’s inequality. For

(b)

, we utilize the step size condition

η \leq \frac{\sqrt{λ} β^{2}}{6 L} \leq \frac{\sqrt{λ}}{6 L β^{2}}

for

0 < β < 1

.

Then, according to the lower bound and upper bound of

H_{t}

in Lemma 7, we have

\begin{matrix} f (x_{t + 1}) \leq f (x_{t}) - \frac{η}{2 β^{2} \sqrt{{(M + σ)}^{2} t + λ}} {∥ \nabla f (x_{t}) ∥}^{2} + \frac{η β^{2}}{\sqrt{λ}} {∥ ϵ_{t} ∥}^{2} . \end{matrix}

By rearranging the above inequality, we have

\begin{matrix} \frac{η}{2 β^{2} \sqrt{{(M + σ)}^{2} t + λ}} {∥ \nabla f (x_{t}) ∥}^{2} \leq f (x_{t}) - f (x_{t + 1}) + \frac{η β^{2}}{\sqrt{λ}} {∥ ϵ_{t} ∥}^{2} . \end{matrix}

This implies that for all

t < τ

, it holds that

\begin{matrix} \frac{η}{2 β^{2} \sqrt{{(M + σ)}^{2} (τ - 1) + λ}} {∥ \nabla f (x_{t}) ∥}^{2} & \leq \frac{η}{2 β^{2} \sqrt{{(M + σ)}^{2} t + λ}} {∥ \nabla f (x_{t}) ∥}^{2} \\ \leq f (x_{t}) - f (x_{t + 1}) + \frac{η β^{2}}{\sqrt{λ}} {∥ ϵ_{t} ∥}^{2} . \end{matrix}

Take a summation from

t = 1

to

τ - 1

. Then,

\frac{η}{2 β^{2} \sqrt{{(M + σ)}^{2} (τ - 1) + λ}} \sum_{t = 1}^{τ - 1} {∥ \nabla f (x_{t}) ∥}^{2} \leq f (x_{1}) - f^{🟉} - (f (x_{τ}) - f^{🟉}) + \frac{η β^{2}}{\sqrt{λ}} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2}

From the above equation, the bound in Lemma 1 follows

\begin{matrix} f (x_{τ}) - f^{🟉} \leq f (x_{1}) - f^{🟉} + \frac{η β^{2}}{\sqrt{λ}} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2} \end{matrix}

and

\begin{matrix} \sum_{t = 1}^{τ - 1} {∥ \nabla f (x_{t}) ∥}^{2} \leq & \frac{2 β^{2} \sqrt{{(M + σ)}^{2} (τ - 1) + λ}}{η} (f (x_{1}) - f^{🟉}) \\ + \frac{2 β^{4} \sqrt{{(M + σ)}^{2} (τ - 1) + λ}}{\sqrt{λ}} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2} . \end{matrix}

5.5. Proof of Lemma 2

ϵ_{t}

can be represented as follows:

\begin{matrix} ϵ_{t} & = m_{t} - \nabla f (x_{t}) \\ = \underset{(🟉)}{\underset{︸}{(1 - β) (ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}))}} + β (g_{t} - \nabla f (x_{t})) . \end{matrix}

This representation highlights how

ϵ_{t}

is recursively defined in terms of the momentum term

(m_{t})

, the gradient differences, and the stochastic gradient noise. First, for all

t < τ

, by using

η \leq 1 / L_{1}

, Assumption 2, and Lemma 7, we have

\begin{matrix} {∥ \nabla f (x_{t - 1}) - f (x_{t}) ∥}^{2} & \leq {(L_{0} + L_{1} ∥ \nabla f (x_{t}) ∥)}^{2} {∥ x_{t} - x_{t - 1} ∥}^{2} \leq L^{2} {∥ x_{t} - x_{t - 1} ∥}^{2} \end{matrix}

(4)

\begin{matrix} \leq \frac{η^{2} L^{2}}{λ} {∥ m_{t - 1} ∥}^{2} \leq \frac{2 η^{2} L^{2}}{λ} ({∥ \nabla f (x_{t - 1}) ∥}^{2} + {∥ ϵ_{t - 1} ∥}^{2}) . \end{matrix}

(5)

Using this, we can bound the term

(🟉)

as follows:

\begin{matrix} {∥ (1 - β) ϵ_{t - 1} + (1 - β) (\nabla f (x_{t - 1}) - f (x_{t})) ∥}^{2} \\ \underset{(a)}{\leq} {(1 - β)}^{2} (1 + β) {∥ ϵ_{t - 1} ∥}^{2} + {(1 - β)}^{2} (1 + \frac{1}{β}) {∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥}^{2} \\ \underset{(b)}{\leq} (1 - β) {∥ ϵ_{t - 1} ∥}^{2} + \frac{1}{β} {∥ \nabla f (x_{t - 1}) - f (x_{t}) ∥}^{2} \\ \underset{(c)}{\leq} (1 - β) {∥ ϵ_{t - 1} ∥}^{2} + \frac{2 η^{2} L^{2}}{β λ} ({∥ \nabla f (x_{t - 1}) ∥}^{2} + {∥ ϵ_{t - 1} ∥}^{2}) \\ \underset{(d)}{\leq} (1 - \frac{β}{2}) {∥ ϵ_{t - 1} ∥}^{2} + \frac{2 η^{2} L^{2}}{β λ} {∥ \nabla f (x_{t - 1}) ∥}^{2} \\ \underset{(e)}{\leq} (1 - \frac{β}{2}) {∥ ϵ_{t - 1} ∥}^{2} + \frac{β^{2}}{2} {∥ \nabla f (x_{t - 1}) ∥}^{2} . \end{matrix}

In

(a)

, we use Lemma 9 with

k = β

.

(b)

follows from the following inequalities:

\begin{matrix} {(1 - β)}^{2} (1 + β) = (1 - β) (1 - β^{2}) \leq (1 - β), \\ {(1 - β)}^{2} (1 + \frac{1}{β}) = \frac{1}{β} {(1 - β)}^{2} (1 + β) \leq \frac{1}{β} (1 - β) \leq \frac{1}{β} . \end{matrix}

(c)

is due to (4).

(d)

is derived from the step size condition

η \leq \frac{\sqrt{λ} β^{2}}{6 L} \leq \frac{\sqrt{λ} β}{2 L}

. Finally,

(e)

is due to the step size condition

η \leq \frac{\sqrt{λ} β^{2}}{6 L} \leq \frac{\sqrt{λ} β^{3 / 2}}{2 L}

. Hence,

\begin{matrix} {∥ ϵ_{t} ∥}^{2} & = {∥ (1 - β) ϵ_{t - 1} + (1 - β) (\nabla f (x_{t - 1}) - \nabla f (x_{t})) ∥}^{2} + β^{2} {∥ g_{t} - \nabla f (x_{t}) ∥}^{2} \\ + 2 (1 - β) β 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉 \\ \leq (1 - \frac{β}{2}) {∥ ϵ_{t - 1} ∥}^{2} + \frac{β^{2}}{2} {∥ \nabla f (x_{t - 1}) ∥}^{2} + β^{2} {∥ g_{t} - \nabla f (x_{t}) ∥}^{2} \\ + 2 (1 - β) β 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉 . \end{matrix}

Then,

\begin{matrix} \frac{β}{2} {∥ ϵ_{t - 1} ∥}^{2} \leq & {∥ ϵ_{t - 1} ∥}^{2} - {∥ ϵ_{t} ∥}^{2} + \frac{β^{2}}{2} {∥ \nabla f (x_{t - 1}) ∥}^{2} + β^{2} {∥ g_{t} - \nabla f (x_{t}) ∥}^{2} \\ + 2 (1 - β) β 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉 . \end{matrix}

Multiply

2 / β

and take the summation from

t = 2

to

τ - 1

. Then, we obtain

\begin{matrix} \sum_{t = 2}^{τ - 1} {∥ ϵ_{t - 1} ∥}^{2} = \sum_{t = 1}^{τ - 2} {∥ ϵ_{t} ∥}^{2} \leq & \frac{2}{β} (∥ ϵ_{1} ∥^{2} - ∥ ϵ_{τ - 1} ∥^{2}) + β \sum_{t = 2}^{τ - 1} ∥ \nabla f (x_{t - 1}) ∥^{2} + 2 β \sum_{t = 2}^{τ - 1} {∥ g_{t} - \nabla f (x_{t}) ∥}^{2} \\ + 4 (1 - β) \sum_{t = 2}^{τ - 1} 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉 \\ \leq & \frac{2}{β} σ^{2} + β M^{2} (τ - 2) + 2 β σ^{2} (τ - 2) \\ + 4 (1 - β) \sum_{t = 2}^{τ - 1} 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉 \end{matrix}

where we use Lemma 7 for the last inequality. To handle the term

\sum_{t = 2}^{τ - 1} 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉

, we apply the Azuma–Hoeffding inequality (Lemma 10). First, we observe that since

τ - 1 \leq T

by its definition,

\begin{matrix} \sum_{t = 2}^{τ - 1} 〈 ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}), g_{t} - \nabla f (x_{t}) 〉 \\ = \sum_{t = 2}^{T} 〈 (ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t})) 1_{τ \geq t}, g_{t} - \nabla f (x_{t}) 〉 . \end{matrix}

Let

X_{t} = 〈 (ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t})) 1_{τ \geq t}, g_{t} - \nabla f (x_{t}) 〉

for

t \in {2, \dots, T}

. We now show

∥ ϵ_{t} ∥ \leq 2 σ

for all

t < τ

through mathematical induction on t. When

t = 1

, according to Lemma 7, it holds that

∥ ϵ_{1} ∥ = ∥ g_{1} - \nabla f (x_{1}) ∥ \leq σ \leq 2 σ

Furthermore, according to Lemma 4, we have

∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ \leq L η \sqrt{β} \leq β σ

where the last inequality comes from

η \leq \frac{\sqrt{β} σ}{L}

. Recall that

ϵ_{t} = (1 - β) (ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t})) + β (g_{t} - \nabla f (x_{t})) .

Now, suppose that

∥ ϵ_{t - 1} ∥ \leq 2 σ

. Then, we have

\begin{matrix} ∥ ϵ_{t} ∥ & \leq (1 - β) ∥ ϵ_{t - 1} ∥ + (1 - β) ∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ + β ∥ g_{t} - \nabla f (x_{t}) ∥ \\ \leq (2 - β) σ + ∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ \\ \leq (2 - β) σ + β σ = 2 σ \end{matrix}

where we use the induction hypothesis and

∥ g_{t} - \nabla f (x_{t}) ∥ \leq σ

for the second inequality and

∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ \leq β σ

for the last inequality. Hence, it holds that

\begin{matrix} | X_{t} | & \leq ∥ ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ ∥ g_{t} - \nabla f (x_{t}) ∥ \\ \leq σ ∥ ϵ_{t - 1} + \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ \\ \leq σ (∥ ϵ_{t - 1} ∥ + ∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥) \\ \leq 3 σ^{2} \end{matrix}

where we use

∥ \nabla f (x_{t - 1}) - \nabla f (x_{t}) ∥ \leq β σ

and

β \leq 1

for the last inequality. Note that

E [X_{t} | X_{1}, \dots, X_{t - 1}] = 0

by its definition. This implies that

{X_{t}}_{t \leq T}

is a martingale difference sequence. Now, we apply the Azuma–Hoeffding inequality (Lemma 10) to obtain the following: with a probability of at least

1 - δ

,

\sum_{t = 2}^{T} X_{t} \leq 3 σ^{2} \sqrt{2 T (log \frac{1}{δ})} \leq \frac{9}{2} σ^{2} \sqrt{T (log \frac{1}{δ})} .

Therefore, with a probability of at least

1 - δ

,

\begin{matrix} \sum_{t = 1}^{τ - 2} {∥ ϵ_{t} ∥}^{2} \leq & \frac{2}{β} σ^{2} + β M^{2} (τ - 2) + 2 β σ^{2} (τ - 2) + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} . \end{matrix}

With

{∥ ϵ_{τ - 1} ∥}^{2} \leq 4 σ^{2}

, which is from

∥ ϵ_{τ - 1} ∥ \leq 2 σ

, we obtain

\begin{matrix} \sum_{t = 1}^{τ - 1} {∥ ϵ_{t} ∥}^{2} \leq & \frac{2}{β} σ^{2} + β M^{2} (τ - 2) + 2 β σ^{2} (τ - 2) + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2} . \end{matrix}

5.6. Proof of Lemma 3

According to Lemmas 1 and 2, with a probability of at least

1 - δ

, we have

\begin{matrix} f (x_{τ}) - f^{🟉} & \leq f (x_{1}) - f^{🟉} + \frac{η β^{2}}{\sqrt{λ}} [\frac{2}{β} σ^{2} + β M^{2} (τ - 2) + 2 β σ^{2} (τ - 2) \\ + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq f (x_{1}) - f^{🟉} + \frac{β^{2}}{\sqrt{λ} L_{1}} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T \\ + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ = : I_{1} \end{matrix}

where the last inequality is from

η \leq 1 / L_{1}

and

τ \leq T + 1

. If

τ \leq T

, then according to the definition of

τ

, we can immediately derive the following lower bound on

f (x_{τ})

:

f (x_{τ}) - f^{🟉} > G

. We now show

τ = T + 1

by showing that this lower bound exceeds the upper bound

I_{1}

, which results in a contradiction. Specifically, we show that

I_{1} / G < 1

as follows:

\begin{matrix} \frac{I_{1}}{G} \leq \frac{Δ}{G} & + \frac{β^{2}}{\sqrt{λ} L_{1} G} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq \frac{Δ}{G} + \frac{2}{\sqrt{λ} L_{1} G} + \frac{1}{\sqrt{λ} L_{1} G} + \frac{2}{\sqrt{λ} L_{1} G} + \frac{18}{\sqrt{λ} L_{1} G} + \frac{4}{\sqrt{λ} L_{1} G} \\ = \frac{28}{29} < 1 . \end{matrix}

For the second inequality, we use the conditions

T \leq 1 / β^{2}

and

β \leq min \{1, \frac{1}{σ^{2}}, \frac{1}{M^{2}}, \frac{1}{σ^{2} ι}\}

. For the last equality, we use

G \geq \frac{29}{L_{1} \sqrt{λ}}

and

G \geq 29 Δ

. Since

I_{1} / G < 1

(i.e.,

I_{1} < G

), which contradicts our assumption that

τ \leq T

, it holds that

τ = T + 1

.

6. Proof of Corollary 1

First, we restate the hyper-parameter conditions in Corollary 1 for convenience as follows:

\begin{matrix} η = \frac{\sqrt{λ} β^{2}}{6 L}, T = ⌊\frac{1}{β^{2}}⌋, \\ 0 < β \leq min \{\frac{1}{2}, \frac{6 L}{L_{1} \sqrt{λ}}, \frac{6 σ}{\sqrt{λ}}, \frac{1}{M^{2}}, \frac{1}{σ^{2} ι}, \frac{1}{σ^{2}}, \frac{1}{M}, \frac{1}{σ}, \frac{1}{σ ι} \\ \frac{\sqrt{λ} ϵ^{2}}{144 Δ L (M + σ)}, \frac{ϵ}{12} \sqrt{\frac{1}{L Δ}}, \frac{\sqrt[4]{λ} ϵ}{12 \sqrt{M + σ}}, \frac{ϵ^{2}}{144}\} . \end{matrix}

Note that since

β \leq min {\frac{6 L}{L_{1} \sqrt{λ}}, \frac{6 σ}{\sqrt{λ}}}

, we have

η = \frac{\sqrt{λ} β^{2}}{6 L} \leq min {\frac{1}{L_{1}}, \frac{\sqrt{β} σ}{L}}

, i.e., the condition on

η

in Lemmas 1–3 is satisfied. According to Lemmas 1–3, we derive the following inequality as in (3): for any

δ \in (0, 1)

,

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2} & \leq \frac{2 Δ β^{2} \sqrt{{(M + σ)}^{2} T + λ}}{η T} \\ + \frac{2 β^{4} \sqrt{{(M + σ)}^{2} T + λ}}{\sqrt{λ} T} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T \\ + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq \frac{2 Δ β^{2} (M + σ)}{η \sqrt{T}} + \frac{2 Δ β^{2} \sqrt{λ}}{η T} \\ + (\frac{2 β^{4} (M + σ)}{\sqrt{λ T}} + \frac{2 β^{4}}{T}) [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T \\ + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \end{matrix}

with a probability of at least

1 - δ

where the second inequality is from the inequality

\sqrt{a + b} \leq \sqrt{a} + \sqrt{b}

. First, we bound the following terms:

\frac{2 Δ β^{2} (M + σ)}{η \sqrt{T}} + \frac{2 Δ β^{2} \sqrt{λ}}{η T} .

Since

T = ⌊ 1 / β^{2} ⌋

, we have

\frac{1}{β^{2}} - 1 \leq T \leq \frac{1}{β^{2}} .

This implies that

\frac{1}{T} \leq \frac{β^{2}}{1 - β^{2}} = \frac{β^{2}}{(1 - β) (1 + β)} \leq \frac{β^{2}}{1 - β} \leq 2 β^{2}

(6)

where the last inequality holds because

1 - β \geq 1 / 2

. By using

η = \sqrt{λ} β^{2} / 6 L

and (6), we obtain

\begin{matrix} \frac{2 Δ β^{2} (M + σ)}{η \sqrt{T}} + \frac{2 Δ β^{2} \sqrt{λ}}{η T} \leq \frac{18 L Δ β (M + σ)}{\sqrt{λ}} + 24 L Δ β^{2} \leq \frac{42}{144} ϵ^{2} \end{matrix}

where the last inequality is due to the following condition on

β

:

β \leq min \{\frac{\sqrt{λ} ϵ^{2}}{144 Δ L (M + σ)}, \frac{ϵ}{12} \sqrt{\frac{1}{L Δ}}\} .

Next, we bound the following term:

\begin{matrix} \frac{2 β^{4} (M + σ)}{\sqrt{λ T}} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] . \end{matrix}

Using the condition

T = ⌊ 1 / β^{2} ⌋ \leq 1 / β^{2}

and (6), we obtain

\begin{matrix} \frac{2 β^{4} (M + σ)}{\sqrt{λ T}} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq & \frac{3 β^{5} (M + σ)}{\sqrt{λ}} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq & \frac{3 β^{2} (M + σ)}{\sqrt{λ}} [2 β^{2} σ^{2} + β^{2} M^{2} + 2 β^{2} σ^{2} + 18 β^{2} σ^{2} ι + 4 β^{3} σ^{2}] . \end{matrix}

Then, according to the following condition of

β

β \leq min \{\frac{1}{2}, \frac{1}{σ}, \frac{1}{M}, \frac{1}{σ ι}, \frac{\sqrt[4]{λ} ϵ}{12 \sqrt{M + σ}}\},

we obtain

\begin{matrix} \frac{3 β^{2} (M + σ)}{\sqrt{λ}} [2 β^{2} σ^{2} + β^{2} M^{2} + 2 β^{2} σ^{2} + 18 β^{2} σ^{2} ι + 4 β^{3} σ^{2}] \leq \frac{75 β^{2} (M + σ)}{\sqrt{λ}} \leq \frac{75}{144} ϵ^{2} . \end{matrix}

Lastly, we bound the following term:

\begin{matrix} \frac{2 β^{4}}{T} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] . \end{matrix}

Using (6), we obtain

\begin{matrix} \frac{2 β^{4}}{T} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq 4 β^{6} [\frac{2}{b e t a} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] . \end{matrix}

Then, according to the following condition of

β

β \leq min \{\frac{1}{2}, \frac{1}{σ}, \frac{1}{M}, \frac{1}{σ ι}, \frac{ϵ^{2}}{144}\},

it holds that

\begin{matrix} 4 β^{6} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq β^{4} [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T + 18 (1 - β) σ^{2} \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq 25 β \leq \frac{25}{144} ϵ^{2} . \end{matrix}

Therefore, for any

δ \in (0, 1)

, with a probability of at least

1 - δ

, we have

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2} & \leq \frac{2 Δ β^{2} (M + σ)}{η \sqrt{T}} + \frac{2 Δ β^{2} \sqrt{λ}}{η T} \\ + (\frac{2 β^{4} (M + σ)}{\sqrt{λ T}} + \frac{2 β^{4}}{T}) [\frac{2}{β} σ^{2} + β M^{2} T + 2 β σ^{2} T \\ + 12 (1 - β) σ \sqrt{T (log \frac{1}{δ})} + 4 σ^{2}] \\ \leq \frac{142}{144} ϵ^{2} \leq ϵ^{2} . \end{matrix}

7. Conclusions

In this paper, we proved the dimension-independent convergence rates of Adagrad with momentum under the

(L_{0}, L_{1})

-smoothness assumption. We demonstrated that Adagrad with momentum achieves convergence to a stationary point at a rate of

O (\sqrt{1 / T})

that does not scale with the dimension and

1 / β

. We believe that our results can improve the theoretical understanding of adaptive gradient methods with momentum.

Author Contributions

Conceptualization, K.N. and S.P.; methodology, K.N.; validation, S.P.; investigation, K.N.; writing—original draft preparation, K.N.; writing—review and editing, S.P.; supervision, S.P.; project administration, S.P.; funding acquisition, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University); supported by the Culture, Sports and Tourism R&D Program through a Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (Project Name: International Collaborative Research and Global Talent Development for the Development of Copyright Management and Protection Technologies for Generative AI; Project Number: RS-2024-00345025); and partially supported by the Culture, Sports, and Tourism R&D Program through another Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (Project Name: Research on neural watermark technology for copyright protection of generative AI 3D content, RS-2024-00348469, 25%).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Discussion on (L₀, L₁)-Smoothness

A differentiable function

f : R^{d} \to R

is L-smooth if a constant

L > 0

exists such that

∥ \nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥, for all x, y \in R^{d} .

(A1)

For twice-differentiable functions, this is equivalent to

∥ \nabla^{2} f (x) ∥ \leq L

for all

x \in R^{d}

. Carmon et al. [19] demonstrated that the gradient descent algorithm with a learning rate of

η = 1 / L

is optimal for optimizing L-smooth, non-convex functions. However, the assumption that the Hessian norm is globally bounded by a constant L may exclude a wide range of functions. To address this limitation, Zhang et al. [17] conducted experiments and observed the following:

The smoothness of a function is positively correlated with the gradient norm.

This observation led to the proposal of a more flexible smoothness condition, where local smoothness grows with the gradient norm. Specifically, a twice-differentiable function f is

(L_{0}, L_{1})

-smooth if

∥ \nabla^{2} f (x) ∥ \leq L_{0} + L_{1} ∥ \nabla f (x) ∥ .

(A2)

That is, any L-smooth function is

(L_{0}, L_{1})

-smooth for all

L_{0} \geq L

. Furthermore, the

(L_{0}, L_{1})

-smoothness is strictly weaker than the L-smoothness.

Lemma A1

(Lemma from Zhang et al. [17]). Both

f (x) = \sum_{i = 1}^{n} a_{i} x^{i}

where

n \geq 3

, and

g (x) = exp (x)

are

(L_{0}, L_{1})

-smooth for some

L_{0}

and

L_{1}

but not L-smooth.

In a subsequent study, Zhang et al. [16] provided an equivalent definition of

(L_{0}, L_{1})

-smoothness for differentiable functions. According to this definition, the constants

L_{0}, L_{1} > 0

exist such that if

∥ x - y ∥ \leq 1 / L_{1}

, then

∥ \nabla f (x) - \nabla f (y) ∥ \leq (L_{0} + L_{1} ∥ \nabla f (y) ∥) ∥ x - y ∥ .

(A3)

Since then, many studies have analyzed the convergence rate of algorithms under

(L_{0}, L_{1})

-smoothness assumptions (e.g., Hong and Lin [1], Faw et al. [10], Zhang et al. [16], etc.) Based on these previous studies, we conducted our analysis under the

(L_{0}, L_{1})

-smoothness assumption, which more accurately reflects the loss landscape of neural networks than L-smoothness.

Appendix B. Experimental Results

In this section, we present three experiments that demonstrate that Adagrad with momentum converges to a stationary point and that its convergence rate does not degrade as the dimensionality of the input increases. We run experiments for the logistic regression problem with the following objective function:

\begin{matrix} min_{x \in R^{d}} \frac{1}{100} \sum_{i = 1}^{100} log (1 + exp (- a_{i}^{⊤} x)) \end{matrix}

where each

a_{i}

is sampled from the d-dimensional unit sphere. For each iteration of Adagrad with momentum, we randomly sample

i_{t} \in {1, . . ., 100}

and choose the stochastic gradient

g_{t}

as the gradient of

log (1 + exp (- a_{i_{t}}^{⊤} x))

. Then, one can observe that the objective function is

(1 / 4, 1)

-smooth and our choice of the stochastic gradient satisfies Assumption 3 with

σ_{0} = 2

and

σ_{1} = σ_{2} = 1

. In the experiments, we use the zero initialization (i.e.,

x_{1} = (0, \dots, 0)

) T = 60,000,

η = 0.004

, and

β = 0.004

for Adagrad with momentum and vary the input dimension d from 5000 to 500,000. Figure A1 summarizes the experimental results; the convergence speed of Adagrad with momentum does not decrease as d increases.

Figure A1. Squared gradient norm per iteration.

References

Hong, Y.; Lin, J. Revisiting Convergence of AdaGrad with Relaxed Assumptions. arXiv 2024, arXiv:2402.13794. [Google Scholar]
Liu, Z.; Nguyen, T.D.; Nguyen, T.H.; Ene, A.; Nguyen, H. High probability convergence of stochastic gradient methods. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 21884–21914. [Google Scholar]
Kavis, A.; Levy, K.Y.; Bach, F.; Cevher, V. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Bach, F.; Levy, K.Y. A universal algorithm for variational inequalities adaptive to smoothness and noise. In Proceedings of the Annual Conference on Learning Theory, PMLR, Phoenix, AZ, USA, 25–28 June 2019; pp. 164–194. [Google Scholar]
Ene, A.; Nguyen, H.L.; Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. Proc. AAAI Conf. Artif. Intell. 2021, 35, 7314–7321. [Google Scholar] [CrossRef]
Li, X.; Orabona, F. A high probability analysis of adaptive sgd with momentum. arXiv 2020, arXiv:2007.14294. [Google Scholar]
Défossez, A.; Bottou, L.; Bach, F.; Usunier, N. A simple convergence proof of adam and adagrad. arXiv 2020, arXiv:2003.02395. [Google Scholar]
Kavis, A.; Levy, K.Y.; Cevher, V. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. arXiv 2022, arXiv:2204.02833. [Google Scholar]
Wang, B.; Zhang, H.; Ma, Z.; Chen, W. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In Proceedings of the Annual Conference on Learning Theory, PMLR, Bangalore, India, 12–15 July 2023; pp. 161–190. [Google Scholar]
Faw, M.; Rout, L.; Caramanis, C.; Shakkottai, S. Beyond uniform smoothness: A stopped analysis of adaptive sgd. In Proceedings of the Annual Conference on Learning Theory, PMLR, Bangalore, India, 12–15 July 2023; pp. 89–160. [Google Scholar]
Attia, A.; Koren, T. SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance. arXiv 2023, arXiv:2302.08783. [Google Scholar]
Patel, V.; Zhang, S.; Tian, B. Global convergence and stability of stochastic gradient descent. Adv. Neural Inf. Process. Syst. 2022, 35, 36014–36025. [Google Scholar]
Patel, V. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Math. Program. 2022, 195, 693–734. [Google Scholar] [CrossRef]
Li, H.; Qian, J.; Tian, Y.; Rakhlin, A.; Jadbabaie, A. Convex and non-convex optimization under generalized smoothness. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Li, H.; Rakhlin, A.; Jadbabaie, A. Convergence of adam under relaxed assumptions. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Zhang, B.; Jin, J.; Fang, C.; Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 15511–15521. [Google Scholar]
Zhang, J.; He, T.; Sra, S.; Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv 2019, arXiv:1905.11881. [Google Scholar]
Khaled, A.; Richtárik, P. Better theory for SGD in the nonconvex world. arXiv 2020, arXiv:2002.03329. [Google Scholar]
Carmon, Y.; Duchi, J.C.; Hinder, O.; Sidford, A. Lower bounds for finding stationary points I. Math. Program. 2020, 184, 71–120. [Google Scholar] [CrossRef]

Table 1. Summary of related works. ADM refers to Adagrad with momentum, ADS refers to Adagrad.

(L_{0}, 0)

-smoothness denotes the standard smoothness assumption, and

(L_{0}, L_{1})

-smoothness is its relaxed version (see Assumption 2). We only reveal the total number of Adagrad iterations T and the input dimension d in

O (\cdot)

for the convergence rate.

Table 1. Summary of related works. ADM refers to Adagrad with momentum, ADS refers to Adagrad.

(L_{0}, 0)

-smoothness denotes the standard smoothness assumption, and

(L_{0}, L_{1})

-smoothness is its relaxed version (see Assumption 2). We only reveal the total number of Adagrad iterations T and the input dimension d in

O (\cdot)

for the convergence rate.

Reference	Alg.	Smoothness	Convergence Rate
Hong and Lin [1]	ADM	$(L_{0}, L_{1})$	$O (\frac{d^{2} log T}{T} + \frac{d log T}{\sqrt{T}})$
Liu et al. [2]	ADS	$(L_{0}, 0)$	$O (\frac{d log T}{\sqrt{T}})$
Li and Orabona [6]	ADM	$(L_{0}, 0)$	$O (\frac{d log T}{\sqrt{T}})$
Défossez et al. [7]	ADM	$(L_{0}, 0)$	$O (\frac{d log T}{\sqrt{T}})$
Wang et al. [9]	ADS	$(L_{0}, 0)$	$O (\frac{d log T}{\sqrt{T}})$
Ours (Theorem 1)	ADM	$(L_{0}, L_{1})$	$O (\frac{1}{\sqrt{T}})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nam, K.; Park, S. Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum. Mathematics 2025, 13, 681. https://doi.org/10.3390/math13040681

AMA Style

Nam K, Park S. Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum. Mathematics. 2025; 13(4):681. https://doi.org/10.3390/math13040681

Chicago/Turabian Style

Nam, Kyunghun, and Sejun Park. 2025. "Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum" Mathematics 13, no. 4: 681. https://doi.org/10.3390/math13040681

APA Style

Nam, K., & Park, S. (2025). Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum. Mathematics, 13(4), 681. https://doi.org/10.3390/math13040681

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum

Abstract

1. Introduction

2. Related Works

2.1. Convergence Analysis of Adagrad

2.2. Stopping Time in Optimization

3. Preliminaries

3.1. Notations

3.2. Problem Setup and Assumptions

4. The High-Probability and Dimension-Independent Convergence Rate of Adagrad with Momentum

5. Proof of Theorem 1

5.1. Proof of Theorem 1

5.2. The High-Level Idea Behind the Proof of Theorem 1

5.3. Technical Lemmas

5.4. Proof of Lemma 1

5.5. Proof of Lemma 2

5.6. Proof of Lemma 3

6. Proof of Corollary 1

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Discussion on (L₀, L₁)-Smoothness

Appendix B. Experimental Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum

Abstract

1. Introduction

2. Related Works

2.1. Convergence Analysis of Adagrad

2.2. Stopping Time in Optimization

3. Preliminaries

3.1. Notations

3.2. Problem Setup and Assumptions

4. The High-Probability and Dimension-Independent Convergence Rate of Adagrad with Momentum

5. Proof of Theorem 1

5.1. Proof of Theorem 1

5.2. The High-Level Idea Behind the Proof of Theorem 1

5.3. Technical Lemmas

5.4. Proof of Lemma 1

5.5. Proof of Lemma 2

5.6. Proof of Lemma 3

6. Proof of Corollary 1

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Discussion on (L0, L1)-Smoothness

Appendix B. Experimental Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A. Discussion on (L₀, L₁)-Smoothness