Next Article in Journal
σ-Martingales: Foundations, Properties, and a New Proof of the Ansel–Stricker Lemma
Previous Article in Journal
Smoothing the Subjective Financial Risk Tolerance: Volatility and Market Implications
Previous Article in Special Issue
A Conjugate Gradient Method: Quantum Spectral Polak–Ribiére–Polyak Approach for Unconstrained Optimization Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum

Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(4), 681; https://doi.org/10.3390/math13040681
Submission received: 11 January 2025 / Revised: 13 February 2025 / Accepted: 18 February 2025 / Published: 19 February 2025
(This article belongs to the Special Issue Advanced Optimization Methods and Applications, 3rd Edition)

Abstract

:
In this study, we analyze the convergence rate of Adagrad with momentum for non-convex optimization problems. We establish the first dimension-independent convergence rate under the ( L 0 , L 1 ) -smoothness assumption, which is a generalization of the standard L-smoothness. We show the O ( 1 / T ) convergence rate under bounded noise in stochastic gradients, where the bound can scale with the current optimality gap and gradient norm.

1. Introduction

For a differentiable function f : R d R 1 from the d-dimensional Euclidean space to the one-dimensional Euclidean space with d 1 , we consider the following minimization problem:
min x R d f ( x ) .
Stochastic iterative algorithms that leverage first-order derivative information, such as stochastic gradient descent (SGD), are popular tools for solving this problem (1). The step size heavily influences the convergence properties of these methods. However, tuning and adjusting this parameter during model training can be time-consuming and computationally expensive. To mitigate these challenges, various adaptive step size algorithms have been proposed.
One such adaptive algorithm, Adagrad (and its variants), has demonstrated strong empirical performance. To theoretically understand its performance, the convergence rates of Adagrad have been studied; however, most of the existing analyses focus on the in-expectation convergence rate, which cannot describe the success of a single run (or at most a few runs) of Adagrad in practice. Understanding the convergence property of these few runs requires a high-probability convergence guarantee in which the convergence rate has logarithmic dependency on the failure probability, and hence, recent research has increasingly focused on achieving high probability bounds.
In addition, the existing convergence rates for Adagrad are dimension-dependent. For instance, Hong and Lin [1] and Liu et al. [2] report rates that scale with d 2 where d denotes the input dimension. However, when optimizing with a high-dimensional input, as in deep learning, Adagrad shows a faster (or at least comparable) convergence speed compared to that of SGD, which has a dimension-independent convergence rate in terms of both expectation and high probability. This discrepancy between theory and practice implies that the theoretical convergence rates of Adagrad can be significantly improved, especially for dimension independence and high probability. 
In this work, we derive the first high-probability and dimension-independent convergence rates for Adagrad with momentum. Specifically, under  ( L 0 , L 1 ) -smoothness, which is a generalization of traditional L-smoothness, and bounded gradient noise where the bound can scale with the current gradient norm and optimality gap, we show the O ( 1 / T ) convergence rate of Adagrad with momentum for non-convex stochastic optimization problems (1), where T denotes the number of iterations.

2. Related Works

2.1. Convergence Analysis of Adagrad

Several works have analyzed Adagrad and its variants in the context of convex optimization, with extensions to variational inequality problems—for example, Kavis et al. [3], Bach and Levy [4], and Ene et al. [5]. For non-convex optimization, Li and Orabona [6] first analyzed the convergence rate of a modified version of Adagrad that did not use the latest gradient to compute the step size, deviating from the original algorithm. Subsequent works (e.g., Hong and Lin [1], Liu et al. [2], Défossez et al. [7], Kavis et al. [8], Wang et al. [9], Faw et al. [10], Attia and Koren [11], etc.) also studied the convergence rate for Adagrad (or Adagrad-Norm) and provided error bounds on 1 T t = 1 T f ( x t ) 2 2 that scaled with the input dimension d.
In contrast to Adagrad or Adagrad-Norm, the convergence behavior of Adagrad with momentum has received relatively little attention (e.g., Hong and Lin [1], Li and Orabona [6], and Défossez et al. [7]). Empirically, in SGD with heavy-ball momentum, it is observed that smaller momentum factors β (see m t in Algorithm 1) often lead to better training results. As a result, practitioners commonly set β to a small constant (e.g., 0.01) or allow it to decrease over time. However, the theoretical understanding of this empirical observation is quite limited. For instance, Hong and Lin [1] derived a convergence rate inversely proportional to β 2 . Défossez et al. [7] improve this rate; however, their rate is still inversely proportional to β . On the other hand, our convergence rates for Adagrad with momentum are not inversely proportional to β . A summary of these results is presented in Table 1.

2.2. Stopping Time in Optimization

In the literature on stochastic approximation, stopping times have been widely employed either as analytical tools (Faw et al. [10], Patel et al. [12], and Patel [13]) or as components of the algorithm design (Ene et al. [5]). In most of these works, stopping time is used to test for the proximity to a stationary point or to ensure a sufficient decrease in the objective function. The notable exceptions of Li et al. [14] and Li et al. [15] utilize the stopping time to bound the sub-optimality gap, f ( x ) inf x f ( x ) . By leveraging the reverse direction of the Polyak–Lojasiewicz inequality, it follows that the gradient norm f ( x ) 2 can also be bounded. Using this idea, we used a stopping time analysis to explore scenarios in which the sub-optimality gap and gradient norm remained bounded. This approach integrates the stopping times as a practical mechanism for algorithm control and a theoretical framework for bounding the key metrics during optimization.

3. Preliminaries

3.1. Notations

For x , y R d ,   x 2 , x and x y denote the coordinate-wise square, square root, and coordinate-wise multiplication, respectively. The Euclidean norm and the standard inner product are denoted by · and · , · , respectively. For a positive semi-definite matrix A R n × n and x R n , x A 2 denotes the quadratic form x T A x . For a vector x = ( x 1 , , x d ) R d and a scalar value z R , we use z / x to denote ( z / x 1 , , z / x d ) if x i 0 for all i { 1 , , x } and use z + x to denote ( z + x 1 , , z + x d ) . For symmetric matrices A , B R n × n , we say A B if B A is positive semi-definite. For a matrix A R n × n , let A be the spectral norm of A.
Finally, for  x R , x denotes the floor function, which maps x to the greatest integer less than or equal to x.

3.2. Problem Setup and Assumptions

Throughout this paper, we consider an algorithm for Adagrad with (heavy-ball) momentum (Algorithm 1) applied to a non-convex objective function (1).
Algorithm 1 Adagrad with heavy-ball momentum
Require: 
A learning rate η > 0 , momentum coefficient 0 < β < 1 , small constant λ > 0 to prevent division by zero, initial parameter x 1 R d .
1:
for  t = 1  to T do,
2:
    Draw a stochastic gradient g t such that E [ g t | g 1 , , g t 1 ] = f ( x t ) .
3:
    Update the momentum vector m t g t , if t = 1 , ( 1 β ) m t 1 + β g t , if t > 1 .
4:
    Compute the next iterate x t + 1 x t η k = 1 t g k 2 + λ m t .
5:
end for
Throughout this paper, we assume that the objective function f in (1) is bounded below.
Assumption A1. 
f is bounded below by its finite infimum f 🟉 : = f ( x 🟉 ) > .
We also assume that f is the ( L 0 , L 1 ) -smoothness for some L 0 , L 1 > 0 .
Assumption A2. 
f is ( L 0 , L 1 ) -smooth, i . e . , f is differentiable, and for any x , y R d satisfying x y 1 / L 1 ,
f ( x ) f ( y ) ( L 0 + L 1 f ( y ) ) x y .
For a twice-differentiable function f, Assumption 2 is strictly weaker than the standard L-smoothness, as the L-smoothness condition is equivalent to the ( L , 0 ) -smoothness condition and there are functions that are ( L 0 , L 1 ) -smooth for some L 0 , L 1 but not L-smooth for all L (see Lemma A1 and Zhang et al. [16]). Empirical evidence shows that many practical objective functions satisfy (2) while they do not satisfy the L-smoothness assumption (e.g., large language models [17]). For a more detailed discussion, see Appendix A.
We consider the following assumption on the noise in the stochastic gradients.
Assumption A3. 
σ 0 , σ 1 , σ 2 > 0 exist such that for each t N ,
g t f ( x t ) σ 0 + σ 1 f ( x t ) 2 + σ 2 ( f ( x t ) f 🟉 ) .
Assumption 3 relaxes the standard bounded noise assumption by allowing the error bound on the stochastic gradient noise to increase with the gradient norm square and the optimality gap. For further details on the stochastic noise assumptions, please refer to Khaled and Richtárik [18].

4. The High-Probability and Dimension-Independent Convergence Rate of Adagrad with Momentum

We are now ready to present our main results. In this section, we present the convergence rate (Theorem 1) and iteration complexity (Corollary 1) of Adagrad with momentum to find an ϵ -stationary point. We now introduce our high-probability convergence analysis of Algorithm 1 under Assumptions 1–3.
Theorem 1. 
Let x t be generated by Adagrad with heavy-ball momentum under Assumptions 1–3. Let Δ = f ( x 1 ) f 🟉 , ι = log 1 / δ , σ = σ 0 + σ 1 M 2 + σ 2 G , and
G = 29 · max Δ , 1 L 1 λ , M = 4 L 1 G + 16 L 1 2 G 2 + 24 L 0 G 3 , 0 < β min 1 , 1 M 2 , 1 σ 2 ι , 1 σ 2 , L = L 0 + L 1 M , η min 1 L 1 , λ β 2 6 L , β σ L .
Then, for any natural number t [ 1 , T ] with T N such that T 1 / β 2 and for any δ ( 0 , 1 ) , it holds with probability of at least 1 δ that
1 T t = 1 T f ( x t ) 2 2 Δ β 2 ( M + σ ) η T + 2 Δ β 2 λ η T + 2 β 2 ( M + σ ) λ T + 2 β 2 ( M + σ ) T 1 + 4 β σ 2 + 4 β 2 σ 2 + 18 ( 1 β ) β σ 2 ι .
Compared to prior works analyzing the convergence rate of Adagrad with momentum [1,6,7], Theorem 1 offers several key improvements. Notably, the error bounds in prior works scale with 1 / β and d, whereas our bound does not. Furthermore, our bounds do not deteriorate for a smaller β value, which aligns well with empirical observations: a smaller β often yields better training results. Furthermore, when compared to [1], which derives its results from the same assumptions as ours, our convergence rate has no log T factor. By using Theorem 1, we can also compute the iteration complexity of Adagrad with momentum to obtain an ϵ -stationary point.
Corollary 1. 
Let x t be generated by Adagrad with heavy-ball momentum under Assumptions 1–3. Let ϵ ( 0 , 1 ) , Δ = f ( x 1 ) f 🟉 , ι = log 1 / δ , and
G = 29 · max Δ , 1 L 1 λ , M = 4 L 1 G + 16 L 1 2 G 2 + 24 L 0 G 3 , σ = σ 0 + σ 1 M 2 + σ 2 G , L = L 0 + L 1 M 0 < β min 1 2 , 6 L L 1 λ , 6 σ λ , 1 M 2 , 1 σ 2 ι , 1 σ 2 , 1 M , 1 σ , 1 σ ι λ ϵ 2 144 Δ L ( M + σ ) , ϵ 12 1 L Δ , λ 4 ϵ 12 M + σ , ϵ 2 144 , η = λ β 2 6 L , T = 1 / β 2 .
Then, for any natural number t [ 1 , T ] and for any δ ( 0 , 1 ) , it holds with probability of at least 1 δ that
1 T t = 1 T f ( x t ) 2 ϵ 2 .
Under a constant failure probability δ , Corollary 1 implies that choosing β = Θ ( ϵ 2 ) and T = O ( ϵ 4 ) suffices to find an ϵ -stationary point x such that f ( x ) ϵ . Here, we hide all terms other than ϵ in the big-O and big- Θ notations.

5. Proof of Theorem 1

To prove Theorem 1, we first introduce the stopping time that is used to bound the function value and the gradient norm as Adagrad iterates. We then present key lemmas (Lemmas 1–3) and derive Theorem 1 in Section 5.1. We describe the high-level idea behind our proof in Section 5.2. We present all of the technical lemmas in Section 5.3 and the proofs of Lemmas 1–3 in Section 5.4, Section 5.5 and Section 5.6, respectively.

5.1. Proof of Theorem 1

We define the stopping time as
τ = min { t | f ( x t ) f 🟉 > G } ( T + 1 )
where a b denotes min ( a , b ) . In other words, the optimality gap is bounded by G until time τ . We note that this also implies bounded gradients under the ( L 0 , L 1 ) -smoothness (see Lemma 6 in Section 5.3).
Using the definition of the stopping time τ , we introduce the following key lemmas to prove Theorem 1.
Lemma 1. 
Suppose that Assumptions 1–3 hold. Furthermore, the definitions of G , M , L , and σ, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Then, the iterates { x t } t < τ generated by Algorithm 1 satisfy the following:
f ( x τ ) f 🟉 f ( x 1 ) f 🟉 + η β 2 λ t = 1 τ 1 ϵ t 2 ,
and
t = 1 τ 1 f ( x t ) 2 2 β 2 Δ ( M + σ ) 2 ( τ 1 ) + λ η + 2 β 4 ( M + σ ) 2 ( τ 1 ) + λ λ t = 1 τ 1 ϵ t 2
where ϵ t = m t f ( x t ) .
Lemma 2. 
Suppose that Assumptions 1–3 hold. Furthermore, the definitions of G , M , L , σ , as well as the conditions of η, β, and T, are identical to those in Theorem 1. Then, for any δ ( 0 , 1 ) , with probability at least 1 δ ,
t = 1 τ 1 ϵ t 2 2 β σ 2 + β M 2 ( τ 2 ) + 2 β σ 2 ( τ 2 ) + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2
where ϵ t = m t f ( x t ) .
Lemma 3. 
Suppose that Assumptions 1–3 hold. Furthermore, the definitions of G , M , L , and σ, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Suppose that
f ( x τ ) f 🟉 f ( x 1 ) f 🟉 + η β 2 λ [ 2 β σ 2 + β M 2 ( τ 2 ) + 2 β σ 2 ( τ 2 ) + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 ] .
Then, it holds that
τ = T + 1 .
We now prove Theorem 1 using Lemmas 1–3.
Proof of Theorem 1. 
According to Lemmas 1–3, for any δ ( 0 , 1 ) , with a probability of at least 1 δ , we have
1 T t = 1 T f ( x t ) 2 2 Δ β 2 ( M + σ ) 2 T + λ η T + 2 β 4 ( M + σ ) 2 T + λ λ T 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 2 Δ β 2 ( M + σ ) 2 T + λ η T + 2 β 2 ( M + σ ) 2 T + λ λ T [ 2 β σ 2 + 1 + 2 β σ 2 + 18 ( 1 β ) σ 2 β log 1 δ + 4 β 2 σ 2 ]
where the second inequality comes from β 1 / M 2 , T 1 / β 2 .
By applying a + b a + b to the RHS, we obtain the bound in Theorem 1. □

5.2. The High-Level Idea Behind the Proof of Theorem 1

In this section, we illustrate the main idea behind our proof and how we use the technical lemmas in Section 5.3. Our proof mainly relies on the stopping time result (Lemma 3), which enables us to bound the optimality gap until the end of Adagrad with momentum. This enables us to bound the gradient norm (Lemma 7) and treat the ( L 0 , L 1 ) -smoothness as the standard smoothness assumption (Lemma 5). Furthermore, under the bounded gradient norm (and the bounded number of iterations T), we can derive the upper and lower bounds on the adaptive step size (Lemma 7). Using these observations and analyzing the bias between the update vector m t and the true gradient (Lemma 2), we can derive our dimension-independent convergence rate.

5.3. Technical Lemmas

We introduce the following technical lemmas.
Lemma 4. 
The iterates { x t } t T generated by Algorithm 1 satisfy the condition of for all t 1 ,
x t + 1 x t η β η .
Moreover, for any function f : R d R satisfying Assumption 2, it holds that
f ( x t + 1 ) f ( x t ) ( L 0 + L 1 f ( x t ) ) η β ( L 0 + L 1 f ( x t ) ) η .
Proof. 
According to the definition of Algorithm 1, we have
x t + 1 x t = η β k = 1 t ( 1 β ) t k g k k = 1 t g k 2 + λ η β k = 1 t ( 1 β ) t k g k k = 1 t ( 1 β ) t k g k 2 η β k = 1 t ( 1 β ) t k g k 2 k = 1 t ( 1 β ) 2 t 2 k ( 1 β ) t k k = 1 t ( 1 β ) t k g k 2 = η β k = 1 t ( 1 β ) t k = η β 1 β ( 1 β ) t β η β .
Here, the second inequality follows from Cauchy’s inequality, and for the last equality, we use the sum formula of the geometric sequence. □
Lemma 5 
(Lemma from Faw et al. [10] and Zhang et al. [16]). For any function f : R d R satisfying Assumption 2, the sequence of iterates { x t } t T generated by Algorithm 1 with η 1 / L 1 satisfy the condition of for all t 1 ,
f ( x t + 1 ) f ( x t ) + f ( x t ) , x t + 1 x t + L 0 + L 1 f ( x t ) 2 x t + 1 x t 2 .
Lemma 6. 
For any function f : R d R satisfying Assumptions 1–2, the following inequality holds:
f ( x ) 2 8 ( L 0 + L 1 f ( x ) ) 3 ( f ( x ) f 🟉 ) .
Proof. 
Let y = x 1 2 ( L 0 + L 1 f ( x ) ) f ( x ) . Then, we have
y x = f ( x ) 2 L 0 + 2 L 1 f ( x ) f ( x ) 2 L 1 f ( x ) = 1 2 L 1 1 L 1 .
This implies that
f 🟉 f ( x ) f ( y ) f ( x ) f ( x ) , y x + L 0 + L 1 f ( x ) 2 y x 2 = 3 8 ( L 0 + L 1 f ( x ) ) f ( x ) 2
where the second inequality is from Lemma 5. □
Lemma 7. 
Suppose that Assumptions 1–3 hold. Furthermore, the definitions of G , M , L , and σ, as well as the conditions of η, β, and T, are identical to those in Theorem 1. Suppose that f ( x t ) f * G for all t < τ for a given τ. Then, it holds that
f ( x t ) M , f ( x t + 1 ) f ( x t ) + f ( x t ) , x t + 1 x t + L 2 x t + 1 x t 2 , g t M + σ , η ( M + σ ) 2 t + λ I H t : = diag η k = 1 t g k 2 + λ η λ I
where I denotes the d × d identity matrix and diag ( v ) for v R n denotes the diagonal matrix whose diagonal entries are given by the components of the vector v.
Proof. 
According to Lemma 6, we have the following inequalities:
f ( x t ) 2 8 ( L 0 + L 1 f ( x t ) ) 3 ( f ( x t ) f 🟉 ) 3 f ( x t ) 2 8 ( L 0 + L 1 f ( x t ) ) f ( x t ) f 🟉 .
Define the function g ( k ) : = 3 k 2 8 ( L 0 + L 1 k ) over k 0 . It is straightforward to verify that g ( k ) 0 for all k 0 , which implies that g ( k ) is an increasing function and is therefore invertible. Consequently, g 1 is also an increasing function and is defined as follows:
g 1 ( y ) = 4 L 1 y + 16 L 1 2 y 2 + 24 L 0 y 3 .
If f ( x ) f 🟉 G , then
f ( x ) g 1 ( f ( x ) f 🟉 ) g 1 ( G ) = M .
This implies that for all t < τ , we have f ( x t ) f 🟉 G and f ( x t ) M . Then, L 0 + L 1 f ( x t ) L 0 + L 1 M . Using this bound, we can apply the descent lemma (refer to Lemma 5) as the starting point for the proof. Specifically, for all t < τ , if η 1 / L 1 , the descent lemma gives
f ( x t + 1 ) f ( x t ) + f ( x t ) , x t + 1 x t + L 0 + L 1 f ( x t ) 2 x t + 1 x t 2 f ( x t ) + f ( x t ) , x t + 1 x t + L 2 x t + 1 x t 2 .
Next, according to Assumption 3, for all t < τ , it holds that
g t g t f ( x t ) + f ( x t ) M + σ .
Lemma 8 
(Young’s inequality with ϵ ). For any ϵ > 0 and a , b R , we have
a b ϵ a 2 + b 2 4 ϵ .
Proof. 
Observe that
ϵ a b 2 ϵ 2 = ϵ a 2 a b + b 2 4 ϵ 0
By rearranging the terms, we obtain the desired result. □
Lemma 9. 
For any k > 0 and a , b R n , we have
a + b 2 ( 1 + k ) a 2 + 1 + 1 k b 2 .
Proof. 
We want to show
( 1 + k ) a 2 + 1 + 1 k b 2 ( a 2 + b 2 ) 2 a , b 0 ,
which is equivalent to
k a 2 + 1 k b 2 2 a , b 0 .
Since
k a 1 k b 2 = k a 2 + 1 k b 2 2 a , b 0 ,
we obtain the desired results. □
Lemma 10 
(The Azuma–Hoeffding inequality). Let Z 1 , Z 2 , , Z T be a martingale difference sequence with respect to a filtration F t 1 . Suppose that there is a constant b such that for any t,
P ( | Z t | b ) = 1 .
Then, for any positive integer T and for any δ > 0 , it holds with a probability of at least 1 δ that
1 T t = 1 T Z t b 2 log ( 1 / δ ) T .

5.4. Proof of Lemma 1

For all t < τ , under the condition η 1 / L 1 , the following holds according to Lemma 7:
f ( x t + 1 ) f ( x t ) + f ( x t ) , x t + 1 x t + L 2 x t + 1 x t 2 = f ( x t ) f ( x t ) T H t m t + L 2 m t T H t 2 m t f ( x t ) f ( x t ) T H t ( m t f ( x t ) ) f ( x t ) H t 2 + η L 2 λ m t H t 2 = f ( x t ) f ( x t ) H t 2 f ( x t ) T H t ϵ t + η L 2 λ f ( x t ) + ϵ t H t 2 ( a ) f ( x t ) 2 3 β 2 f ( x t ) H t 2 + 3 β 2 4 ϵ t H t 2 + η L λ ( f ( x t ) H t 2 + ϵ t H t 2 ) ( b ) f ( x t ) 1 2 β 2 f ( x t ) H t 2 + β 2 ϵ t H t 2
where ϵ t : = m t f ( x t ) . For ( a ) , we apply Lemma 8 with ϵ = 1 3 β 2 and Cauchy’s inequality. For ( b ) , we utilize the step size condition η λ β 2 6 L λ 6 L β 2 for 0 < β < 1 .
Then, according to the lower bound and upper bound of H t in Lemma 7, we have
f ( x t + 1 ) f ( x t ) η 2 β 2 ( M + σ ) 2 t + λ f ( x t ) 2 + η β 2 λ ϵ t 2 .
By rearranging the above inequality, we have
η 2 β 2 ( M + σ ) 2 t + λ f ( x t ) 2 f ( x t ) f ( x t + 1 ) + η β 2 λ ϵ t 2 .
This implies that for all t < τ , it holds that
η 2 β 2 ( M + σ ) 2 ( τ 1 ) + λ f ( x t ) 2 η 2 β 2 ( M + σ ) 2 t + λ f ( x t ) 2 f ( x t ) f ( x t + 1 ) + η β 2 λ ϵ t 2 .
Take a summation from t = 1 to τ 1 . Then,
η 2 β 2 ( M + σ ) 2 ( τ 1 ) + λ t = 1 τ 1 f ( x t ) 2 f ( x 1 ) f 🟉 ( f ( x τ ) f 🟉 ) + η β 2 λ t = 1 τ 1 ϵ t 2
From the above equation, the bound in Lemma 1 follows
f ( x τ ) f 🟉 f ( x 1 ) f 🟉 + η β 2 λ t = 1 τ 1 ϵ t 2
and
t = 1 τ 1 f ( x t ) 2 2 β 2 ( M + σ ) 2 ( τ 1 ) + λ η ( f ( x 1 ) f 🟉 ) + 2 β 4 ( M + σ ) 2 ( τ 1 ) + λ λ t = 1 τ 1 ϵ t 2 .

5.5. Proof of Lemma 2

ϵ t can be represented as follows:
ϵ t = m t f ( x t ) = ( 1 β ) ( ϵ t 1 + f ( x t 1 ) f ( x t ) ) ( 🟉 ) + β ( g t f ( x t ) ) .
This representation highlights how ϵ t is recursively defined in terms of the momentum term ( m t ) , the gradient differences, and the stochastic gradient noise. First, for all t < τ , by using η 1 / L 1 , Assumption 2, and Lemma 7, we have
f ( x t 1 ) f ( x t ) 2 ( L 0 + L 1 f ( x t ) ) 2 x t x t 1 2 L 2 x t x t 1 2
η 2 L 2 λ m t 1 2 2 η 2 L 2 λ ( f ( x t 1 ) 2 + ϵ t 1 2 ) .
Using this, we can bound the term ( 🟉 ) as follows:
( 1 β ) ϵ t 1 + ( 1 β ) ( f ( x t 1 ) f ( x t ) ) 2 ( a ) ( 1 β ) 2 ( 1 + β ) ϵ t 1 2 + ( 1 β ) 2 ( 1 + 1 β ) f ( x t 1 ) f ( x t ) 2 ( b ) ( 1 β ) ϵ t 1 2 + 1 β f ( x t 1 ) f ( x t ) 2 ( c ) ( 1 β ) ϵ t 1 2 + 2 η 2 L 2 β λ ( f ( x t 1 ) 2 + ϵ t 1 2 ) ( d ) ( 1 β 2 ) ϵ t 1 2 + 2 η 2 L 2 β λ f ( x t 1 ) 2 ( e ) ( 1 β 2 ) ϵ t 1 2 + β 2 2 f ( x t 1 ) 2 .
In ( a ) , we use Lemma 9 with k = β . ( b ) follows from the following inequalities:
( 1 β ) 2 ( 1 + β ) = ( 1 β ) ( 1 β 2 ) ( 1 β ) , ( 1 β ) 2 ( 1 + 1 β ) = 1 β ( 1 β ) 2 ( 1 + β ) 1 β ( 1 β ) 1 β .
( c ) is due to (4). ( d ) is derived from the step size condition η λ β 2 6 L λ β 2 L . Finally, ( e ) is due to the step size condition η λ β 2 6 L λ β 3 / 2 2 L . Hence,
ϵ t 2 = ( 1 β ) ϵ t 1 + ( 1 β ) ( f ( x t 1 ) f ( x t ) ) 2 + β 2 g t f ( x t ) 2 + 2 ( 1 β ) β ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t ) ( 1 β 2 ) ϵ t 1 2 + β 2 2 f ( x t 1 ) 2 + β 2 g t f ( x t ) 2 + 2 ( 1 β ) β ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t ) .
Then,
β 2 ϵ t 1 2 ϵ t 1 2 ϵ t 2 + β 2 2 f ( x t 1 ) 2 + β 2 g t f ( x t ) 2 + 2 ( 1 β ) β ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t ) .
Multiply 2 / β and take the summation from t = 2 to τ 1 . Then, we obtain
t = 2 τ 1 ϵ t 1 2 = t = 1 τ 2 ϵ t 2 2 β ( ϵ 1 2 ϵ τ 1 2 ) + β t = 2 τ 1 f ( x t 1 ) 2 + 2 β t = 2 τ 1 g t f ( x t ) 2 + 4 ( 1 β ) t = 2 τ 1 ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t ) 2 β σ 2 + β M 2 ( τ 2 ) + 2 β σ 2 ( τ 2 ) + 4 ( 1 β ) t = 2 τ 1 ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t )
where we use Lemma 7 for the last inequality. To handle the term t = 2 τ 1 ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t ) , we apply the Azuma–Hoeffding inequality (Lemma 10). First, we observe that since τ 1 T by its definition,
t = 2 τ 1 ϵ t 1 + f ( x t 1 ) f ( x t ) , g t f ( x t ) = t = 2 T ( ϵ t 1 + f ( x t 1 ) f ( x t ) ) 1 τ t , g t f ( x t ) .
Let X t = ( ϵ t 1 + f ( x t 1 ) f ( x t ) ) 1 τ t , g t f ( x t ) for t { 2 , , T } . We now show ϵ t 2 σ for all t < τ through mathematical induction on t. When t = 1 , according to Lemma 7, it holds that
ϵ 1 = g 1 f ( x 1 ) σ 2 σ
Furthermore, according to Lemma 4, we have
f ( x t 1 ) f ( x t ) L η β β σ
where the last inequality comes from η β σ L . Recall that
ϵ t = ( 1 β ) ( ϵ t 1 + f ( x t 1 ) f ( x t ) ) + β ( g t f ( x t ) ) .
Now, suppose that ϵ t 1 2 σ . Then, we have
ϵ t ( 1 β ) ϵ t 1 + ( 1 β ) f ( x t 1 ) f ( x t ) + β g t f ( x t ) ( 2 β ) σ + f ( x t 1 ) f ( x t ) ( 2 β ) σ + β σ = 2 σ
where we use the induction hypothesis and g t f ( x t ) σ for the second inequality and f ( x t 1 ) f ( x t ) β σ for the last inequality. Hence, it holds that
| X t | ϵ t 1 + f ( x t 1 ) f ( x t ) g t f ( x t ) σ ϵ t 1 + f ( x t 1 ) f ( x t ) σ ( ϵ t 1 + f ( x t 1 ) f ( x t ) ) 3 σ 2
where we use f ( x t 1 ) f ( x t ) β σ and β 1 for the last inequality. Note that E [ X t | X 1 , , X t 1 ] = 0 by its definition. This implies that { X t } t T is a martingale difference sequence. Now, we apply the Azuma–Hoeffding inequality (Lemma 10) to obtain the following: with a probability of at least 1 δ ,
t = 2 T X t 3 σ 2 2 T log 1 δ 9 2 σ 2 T log 1 δ .
Therefore, with a probability of at least 1 δ ,
t = 1 τ 2 ϵ t 2 2 β σ 2 + β M 2 ( τ 2 ) + 2 β σ 2 ( τ 2 ) + 18 ( 1 β ) σ 2 T log 1 δ .
With ϵ τ 1 2 4 σ 2 , which is from ϵ τ 1 2 σ , we obtain
t = 1 τ 1 ϵ t 2 2 β σ 2 + β M 2 ( τ 2 ) + 2 β σ 2 ( τ 2 ) + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 .

5.6. Proof of Lemma 3

According to Lemmas 1 and 2, with a probability of at least 1 δ , we have
f ( x τ ) f 🟉 f ( x 1 ) f 🟉 + η β 2 λ [ 2 β σ 2 + β M 2 ( τ 2 ) + 2 β σ 2 ( τ 2 ) + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 ] f ( x 1 ) f 🟉 + β 2 λ L 1 [ 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 ] = : I 1
where the last inequality is from η 1 / L 1 and τ T + 1 . If τ T , then according to the definition of τ , we can immediately derive the following lower bound on f ( x τ ) : f ( x τ ) f 🟉 > G . We now show τ = T + 1 by showing that this lower bound exceeds the upper bound I 1 , which results in a contradiction. Specifically, we show that I 1 / G < 1 as follows:
I 1 G Δ G + β 2 λ L 1 G 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 Δ G + 2 λ L 1 G + 1 λ L 1 G + 2 λ L 1 G + 18 λ L 1 G + 4 λ L 1 G = 28 29 < 1 .
For the second inequality, we use the conditions T 1 / β 2 and β min 1 , 1 σ 2 , 1 M 2 , 1 σ 2 ι . For the last equality, we use G 29 L 1 λ and G 29 Δ . Since I 1 / G < 1 (i.e., I 1 < G ), which contradicts our assumption that τ T , it holds that τ = T + 1 .

6. Proof of Corollary 1

First, we restate the hyper-parameter conditions in Corollary 1 for convenience as follows:
η = λ β 2 6 L , T = 1 β 2 , 0 < β min 1 2 , 6 L L 1 λ , 6 σ λ , 1 M 2 , 1 σ 2 ι , 1 σ 2 , 1 M , 1 σ , 1 σ ι λ ϵ 2 144 Δ L ( M + σ ) , ϵ 12 1 L Δ , λ 4 ϵ 12 M + σ , ϵ 2 144 .
Note that since β min { 6 L L 1 λ , 6 σ λ } , we have η = λ β 2 6 L min { 1 L 1 , β σ L } , i.e., the condition on η in Lemmas 1–3 is satisfied. According to Lemmas 1–3, we derive the following inequality as in (3): for any δ ( 0 , 1 ) ,
1 T t = 1 T f ( x t ) 2 2 Δ β 2 ( M + σ ) 2 T + λ η T + 2 β 4 ( M + σ ) 2 T + λ λ T [ 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 ] 2 Δ β 2 ( M + σ ) η T + 2 Δ β 2 λ η T + 2 β 4 ( M + σ ) λ T + 2 β 4 T [ 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 ]
with a probability of at least 1 δ where the second inequality is from the inequality a + b a + b . First, we bound the following terms:
2 Δ β 2 ( M + σ ) η T + 2 Δ β 2 λ η T .
Since T = 1 / β 2 , we have
1 β 2 1 T 1 β 2 .
This implies that
1 T β 2 1 β 2 = β 2 ( 1 β ) ( 1 + β ) β 2 1 β 2 β 2
where the last inequality holds because 1 β 1 / 2 . By using η = λ β 2 / 6 L and (6), we obtain
2 Δ β 2 ( M + σ ) η T + 2 Δ β 2 λ η T 18 L Δ β ( M + σ ) λ + 24 L Δ β 2 42 144 ϵ 2
where the last inequality is due to the following condition on β :
β min λ ϵ 2 144 Δ L ( M + σ ) , ϵ 12 1 L Δ .
Next, we bound the following term:
2 β 4 ( M + σ ) λ T 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 .
Using the condition T = 1 / β 2 1 / β 2 and (6), we obtain
2 β 4 ( M + σ ) λ T 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 3 β 5 ( M + σ ) λ 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 3 β 2 ( M + σ ) λ 2 β 2 σ 2 + β 2 M 2 + 2 β 2 σ 2 + 18 β 2 σ 2 ι + 4 β 3 σ 2 .
Then, according to the following condition of β
β min 1 2 , 1 σ , 1 M , 1 σ ι , λ 4 ϵ 12 M + σ ,
we obtain
3 β 2 ( M + σ ) λ 2 β 2 σ 2 + β 2 M 2 + 2 β 2 σ 2 + 18 β 2 σ 2 ι + 4 β 3 σ 2 75 β 2 ( M + σ ) λ 75 144 ϵ 2 .
Lastly, we bound the following term:
2 β 4 T 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 .
Using (6), we obtain
2 β 4 T 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 4 β 6 2 b e t a σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 .
Then, according to the following condition of β
β min 1 2 , 1 σ , 1 M , 1 σ ι , ϵ 2 144 ,
it holds that
4 β 6 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 β 4 2 β σ 2 + β M 2 T + 2 β σ 2 T + 18 ( 1 β ) σ 2 T log 1 δ + 4 σ 2 25 β 25 144 ϵ 2 .
Therefore, for any δ ( 0 , 1 ) , with a probability of at least 1 δ , we have
1 T t = 1 T f ( x t ) 2 2 Δ β 2 ( M + σ ) η T + 2 Δ β 2 λ η T + 2 β 4 ( M + σ ) λ T + 2 β 4 T [ 2 β σ 2 + β M 2 T + 2 β σ 2 T + 12 ( 1 β ) σ T log 1 δ + 4 σ 2 ] 142 144 ϵ 2 ϵ 2 .

7. Conclusions

In this paper, we proved the dimension-independent convergence rates of Adagrad with momentum under the ( L 0 , L 1 ) -smoothness assumption. We demonstrated that Adagrad with momentum achieves convergence to a stationary point at a rate of O ( 1 / T ) that does not scale with the dimension and 1 / β . We believe that our results can improve the theoretical understanding of adaptive gradient methods with momentum.

Author Contributions

Conceptualization, K.N. and S.P.; methodology, K.N.; validation, S.P.; investigation, K.N.; writing—original draft preparation, K.N.; writing—review and editing, S.P.; supervision, S.P.; project administration, S.P.; funding acquisition, S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019-0-00079, Artificial Intelligence Graduate School Program, Korea University); supported by the Culture, Sports and Tourism R&D Program through a Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (Project Name: International Collaborative Research and Global Talent Development for the Development of Copyright Management and Protection Technologies for Generative AI; Project Number: RS-2024-00345025); and partially supported by the Culture, Sports, and Tourism R&D Program through another Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2024 (Project Name: Research on neural watermark technology for copyright protection of generative AI 3D content, RS-2024-00348469, 25%).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Discussion on (L0, L1)-Smoothness

A differentiable function f : R d R is L-smooth if a constant L > 0 exists such that
f ( x ) f ( y ) L x y , for all x , y R d .
For twice-differentiable functions, this is equivalent to 2 f ( x ) L for all x R d . Carmon et al. [19] demonstrated that the gradient descent algorithm with a learning rate of η = 1 / L is optimal for optimizing L-smooth, non-convex functions. However, the assumption that the Hessian norm is globally bounded by a constant L may exclude a wide range of functions. To address this limitation, Zhang et al. [17] conducted experiments and observed the following:
The smoothness of a function is positively correlated with the gradient norm.
This observation led to the proposal of a more flexible smoothness condition, where local smoothness grows with the gradient norm. Specifically, a twice-differentiable function f is ( L 0 , L 1 ) -smooth if
2 f ( x ) L 0 + L 1 f ( x ) .
That is, any L-smooth function is ( L 0 , L 1 ) -smooth for all L 0 L . Furthermore, the ( L 0 , L 1 ) -smoothness is strictly weaker than the L-smoothness.
Lemma A1 
(Lemma from Zhang et al. [17]). Both f ( x ) = i = 1 n a i x i where n 3 , and g ( x ) = exp ( x ) are ( L 0 , L 1 ) -smooth for some L 0 and L 1 but not L-smooth.
In a subsequent study, Zhang et al. [16] provided an equivalent definition of ( L 0 , L 1 ) -smoothness for differentiable functions. According to this definition, the constants L 0 , L 1 > 0 exist such that if x y 1 / L 1 , then
f ( x ) f ( y ) ( L 0 + L 1 f ( y ) ) x y .
Since then, many studies have analyzed the convergence rate of algorithms under ( L 0 , L 1 ) -smoothness assumptions (e.g., Hong and Lin [1], Faw et al. [10], Zhang et al. [16], etc.) Based on these previous studies, we conducted our analysis under the ( L 0 , L 1 ) -smoothness assumption, which more accurately reflects the loss landscape of neural networks than L-smoothness.

Appendix B. Experimental Results

In this section, we present three experiments that demonstrate that Adagrad with momentum converges to a stationary point and that its convergence rate does not degrade as the dimensionality of the input increases. We run experiments for the logistic regression problem with the following objective function:
min x R d 1 100 i = 1 100 log 1 + exp ( a i x )
where each a i is sampled from the d-dimensional unit sphere. For each iteration of Adagrad with momentum, we randomly sample i t { 1 , . . . , 100 } and choose the stochastic gradient g t as the gradient of log ( 1 + exp ( a i t x ) ) . Then, one can observe that the objective function is ( 1 / 4 , 1 ) -smooth and our choice of the stochastic gradient satisfies Assumption 3 with σ 0 = 2 and σ 1 = σ 2 = 1 . In the experiments, we use the zero initialization (i.e., x 1 = ( 0 , , 0 ) ) T = 60,000, η = 0.004 , and β = 0.004 for Adagrad with momentum and vary the input dimension d from 5000 to 500,000. Figure A1 summarizes the experimental results; the convergence speed of Adagrad with momentum does not decrease as d increases.
Figure A1. Squared gradient norm per iteration.
Figure A1. Squared gradient norm per iteration.
Mathematics 13 00681 g0a1

References

  1. Hong, Y.; Lin, J. Revisiting Convergence of AdaGrad with Relaxed Assumptions. arXiv 2024, arXiv:2402.13794. [Google Scholar]
  2. Liu, Z.; Nguyen, T.D.; Nguyen, T.H.; Ene, A.; Nguyen, H. High probability convergence of stochastic gradient methods. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 21884–21914. [Google Scholar]
  3. Kavis, A.; Levy, K.Y.; Bach, F.; Cevher, V. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  4. Bach, F.; Levy, K.Y. A universal algorithm for variational inequalities adaptive to smoothness and noise. In Proceedings of the Annual Conference on Learning Theory, PMLR, Phoenix, AZ, USA, 25–28 June 2019; pp. 164–194. [Google Scholar]
  5. Ene, A.; Nguyen, H.L.; Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. Proc. AAAI Conf. Artif. Intell. 2021, 35, 7314–7321. [Google Scholar] [CrossRef]
  6. Li, X.; Orabona, F. A high probability analysis of adaptive sgd with momentum. arXiv 2020, arXiv:2007.14294. [Google Scholar]
  7. Défossez, A.; Bottou, L.; Bach, F.; Usunier, N. A simple convergence proof of adam and adagrad. arXiv 2020, arXiv:2003.02395. [Google Scholar]
  8. Kavis, A.; Levy, K.Y.; Cevher, V. High probability bounds for a class of nonconvex algorithms with adagrad stepsize. arXiv 2022, arXiv:2204.02833. [Google Scholar]
  9. Wang, B.; Zhang, H.; Ma, Z.; Chen, W. Convergence of adagrad for non-convex objectives: Simple proofs and relaxed assumptions. In Proceedings of the Annual Conference on Learning Theory, PMLR, Bangalore, India, 12–15 July 2023; pp. 161–190. [Google Scholar]
  10. Faw, M.; Rout, L.; Caramanis, C.; Shakkottai, S. Beyond uniform smoothness: A stopped analysis of adaptive sgd. In Proceedings of the Annual Conference on Learning Theory, PMLR, Bangalore, India, 12–15 July 2023; pp. 89–160. [Google Scholar]
  11. Attia, A.; Koren, T. SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance. arXiv 2023, arXiv:2302.08783. [Google Scholar]
  12. Patel, V.; Zhang, S.; Tian, B. Global convergence and stability of stochastic gradient descent. Adv. Neural Inf. Process. Syst. 2022, 35, 36014–36025. [Google Scholar]
  13. Patel, V. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Math. Program. 2022, 195, 693–734. [Google Scholar] [CrossRef]
  14. Li, H.; Qian, J.; Tian, Y.; Rakhlin, A.; Jadbabaie, A. Convex and non-convex optimization under generalized smoothness. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
  15. Li, H.; Rakhlin, A.; Jadbabaie, A. Convergence of adam under relaxed assumptions. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
  16. Zhang, B.; Jin, J.; Fang, C.; Wang, L. Improved analysis of clipping algorithms for non-convex optimization. Adv. Neural Inf. Process. Syst. 2020, 33, 15511–15521. [Google Scholar]
  17. Zhang, J.; He, T.; Sra, S.; Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv 2019, arXiv:1905.11881. [Google Scholar]
  18. Khaled, A.; Richtárik, P. Better theory for SGD in the nonconvex world. arXiv 2020, arXiv:2002.03329. [Google Scholar]
  19. Carmon, Y.; Duchi, J.C.; Hinder, O.; Sidford, A. Lower bounds for finding stationary points I. Math. Program. 2020, 184, 71–120. [Google Scholar] [CrossRef]
Table 1. Summary of related works. ADM refers to Adagrad with momentum, ADS refers to Adagrad. ( L 0 , 0 ) -smoothness denotes the standard smoothness assumption, and ( L 0 , L 1 ) -smoothness is its relaxed version (see Assumption 2). We only reveal the total number of Adagrad iterations T and the input dimension d in O ( · ) for the convergence rate.
Table 1. Summary of related works. ADM refers to Adagrad with momentum, ADS refers to Adagrad. ( L 0 , 0 ) -smoothness denotes the standard smoothness assumption, and ( L 0 , L 1 ) -smoothness is its relaxed version (see Assumption 2). We only reveal the total number of Adagrad iterations T and the input dimension d in O ( · ) for the convergence rate.
ReferenceAlg.SmoothnessConvergence Rate
Hong and Lin [1]ADM ( L 0 , L 1 ) O d 2 log T T + d log T T
Liu et al. [2]ADS ( L 0 , 0 ) O d log T T
Li and Orabona [6]ADM ( L 0 , 0 ) O d log T T
Défossez et al. [7]ADM ( L 0 , 0 ) O d log T T
Wang et al. [9]ADS ( L 0 , 0 ) O d log T T
Ours (Theorem 1)ADM ( L 0 , L 1 ) O 1 T
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nam, K.; Park, S. Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum. Mathematics 2025, 13, 681. https://doi.org/10.3390/math13040681

AMA Style

Nam K, Park S. Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum. Mathematics. 2025; 13(4):681. https://doi.org/10.3390/math13040681

Chicago/Turabian Style

Nam, Kyunghun, and Sejun Park. 2025. "Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum" Mathematics 13, no. 4: 681. https://doi.org/10.3390/math13040681

APA Style

Nam, K., & Park, S. (2025). Dimension-Independent Convergence Rate for Adagrad with Heavy-Ball Momentum. Mathematics, 13(4), 681. https://doi.org/10.3390/math13040681

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop