Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization

Krivovichev, Gerasim V.; Sergeeva, Valentina Yu.

doi:10.3390/a17030126

Open AccessArticle

Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization

by

Gerasim V. Krivovichev

^*

and

Valentina Yu. Sergeeva

Faculty of Applied Mathematics and Control Processes, Saint Petersburg State University, 7/9 Universitetskaya nab., Saint Petersburg 199034, Russia

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(3), 126; https://doi.org/10.3390/a17030126

Submission received: 24 February 2024 / Revised: 14 March 2024 / Accepted: 15 March 2024 / Published: 18 March 2024

(This article belongs to the Special Issue Numerical Optimization in Honor of the 60th Birthday of Marko M. Mäkelä)

Download

Browse Figures

Versions Notes

Abstract

The paper is devoted to the theoretical and numerical analysis of the two-step method, constructed as a modification of Polyak’s heavy ball method with the inclusion of an additional momentum parameter. For the quadratic case, the convergence conditions are obtained with the use of the first Lyapunov method. For the non-quadratic case, sufficiently smooth strongly convex functions are obtained, and these conditions guarantee local convergence.An approach to finding optimal parameter values based on the solution of a constrained optimization problem is proposed. The effect of an additional parameter on the convergence rate is analyzed. With the use of an ordinary differential equation, equivalent to the method, the damping effect of this parameter on the oscillations, which is typical for the non-monotonic convergence of the heavy ball method, is demonstrated. In different numerical examples for non-quadratic convex and non-convex test functions and machine learning problems (regularized smoothed elastic net regression, logistic regression, and recurrent neural network training), the positive influence of an additional parameter value on the convergence process is demonstrated.

Keywords:

convex optimization; gradient descent; heavy ball method

1. Introduction

Nowadays, many problems in machine learning [1], optimal control [2], applied linear algebra [3], system identification [4], and other applications lead to the problems of unconstrained convex optimization. The theory of convex optimization is well-developed [5,6,7], but methods that can be additionally analyzed or improved exist. A typical example of an improvement of the standard gradient descent method is the heavy ball method (HBM), proposed by B.T. Polyak in [7,8], which is based on the inclusion of a momentum term. The local convergence of this method for functions from

F_{l, L}^{2, 1}

(twice continuously differentiable, l-strongly convex functions with Lipschitz gradient) was proved in [7]. Recently, Ghadimi et al. [9] formulated the conditions of global linear convergence. Aujol et al. [10] analyzed the dynamical system associated with the HBM in order to obtain optimal convergence rates for convex functions with some additional properties, such as quasi-strong and strong convexity.

In the last few decades, extended modifications of the HBM have been developed, and interesting results on their behavior have been obtained. Bhaya and Kaszkuremicz [11] demonstrated that the HBM for minimization of quadratic functions can be considered a stationary version of the conjugate gradient method. Recently, Goujand et al. [12] proposed an adaptive modification of the HBM with Polyak stepsizes and demonstrated that this method can be considered a variant of the conjugate gradient method for quadratic problems, having many advantages, such as finite-time convergence and instant optimality. Danilova et al. [13] demonstrated the non-monotonic convergence of the HBM and analyzed the peak effect for ill-conditioned problems. In order to carry out the damping of this effect in [14], an averaged HBM was constructed. A global and local convergence of momentum method for semialgebraic functions with locally Lipschitz gradients was demonstrated in [15]. Wang et al. [16] used the theory of PID controllers for the construction of momentum methods for deep neural network training. A quasi-hyperbolic momentum method with two parameters, momentum and parameter, which performs a sort of interpolation between gradient descent and the HBM, was presented in [17]. A complete analysis of such algorithms for deterministic and stochastic cases was performed in [18], where the influence of parameters on stability and convergence rate was analyzed. Sutskever et al. [19] proposed a stochatic version of Nesterov’s method, where the momentum was included bilinearly with the step. An improved accelerated momentum method for stochastic optimization was presented in [20].

In [21], the authors investigated the ravine method and momentum methods from dynamical system perspectives. A high-resolution differential equation describing these methods was proposed, and the damping effect of the additional term driven by the Hessian was demonstrated. Similar results for Hessian damping were obtained in [22] for the proximal methods.A continuous system with damping for primal-dual convex problems was constructed in [23]. Alecsa et al. [24] investigated a perturbed heavy ball system with a vanishing damping term that contained a Tikhonov regularization term. It was demonstrated that the presence of a regularization term led to a strong convergence of the descent trajectories in the case of smooth functions. An analysis of momentum methods from the positions of Hamiltonian dynamical systems was presented in [25].

Yan et al. [26] proposed a modification of the HBM with an additional parameter and an additional internal stage. In [27], a method with three momentum parameters (the so-called triple momentum method) was presented. This method has been classified as the fastest known globally convergent first-order method. In [28], the integral quadratic constraint method used in robust control theory was applied to the construction of first-order methods. A method with two momentum parameters was introduced. In [29], this scheme was analyzed for a strongly convex function with a Lipschitz gradient, and the range of the possible convergence rate was presented.

Despite the results obtained for different methods with momentum mentioned above, there is a lack of correct understanding of the roles of parameters in computational schemes with momentum. As mentioned by investigators, understanding the role of momentum remains important for practical problems. For example, in [19], the authors demonstrated that momentum is critical for good performance in deep learning problems. However, in another modification of the HBM, Ma and Yarats [17] demonstrated that momentum in practice can have a minor effect, which is insufficient for acceleration of convergence. Therefore, additional theoretical analysis of methods with momentum is important in our time.

The presented paper is devoted to the analysis of a method with two momentum parameters, as proposed in [28]. For the functions from

F_{l, L}^{1, 1}

(l-strongly convex L-smooth functions), this method was analyzed in [29], where global convergence for the special choice of parameters was proven. In the presented paper, we try to focus our attention on the case of quadratic functions from

F_{l, L}^{1, 1}

, in order to obtain the inequalities for parameters that guarantee global convergence, to obtain optimal values of the parameters, and to understand the effect of an additional momentum parameter on the convergence rate. Convergence conditions are obtained, and corresponding theorems are formulated. The constrained optimization problem for obtaining optimal parameters is stated. As demonstrated in numerical experiments, in the quadratic case, the inclusion of an additional parameter does not improve the convergence rate. The role of this parameter is demonstrated with the use of the ordinary differential equation (ODE), which is equivalent to the method. This parameter provides an additional damping effect on the oscillations, typical for the HBM, according to its non-monotonic convergence, and can be useful in practice. In the numerical experiments for non-quadratic functions, it is demonstrated that this parameter also provides damping of the oscillations and leads to faster convergence to the minimum in comparison with the standard HBM. Additionally, the effect of this parameter is demonstrated for the non-convex function that arises in recurrent neural network training.

The paper has the following structure: Section 2 is devoted to the theoretical analysis method in application to strongly convex quadratic functions. The effect of an additional momentum parameter is analyzed. The results of the numerical experiments for non-quadratic, strongly convex, and non-convex functions are presented in Section 3. Some concluding remarks are made in Section 4.

2. Analysis of Two-Step Method

Let the scalar function

f : R^{d} \to R

from

F_{l, L}^{1, 1}

be considered. We try to find its minimizer

x^{*}

. So the unconstrained minimization problem is stated:

f (x) \to min_{x \in R^{d}} .

(1)

The gradient descent method (GD) for numerical solution of (1) is written as

x^{k + 1} = x^{k} - h \nabla f (x^{k}),

(2)

where

h > 0

is a step. If we additionally propose that

f (x) \in F_{l, L}^{2, 1}

, the optimal step and convergence rate for (2) are presented as in [7]

h_{o p t} = \frac{2}{l + L}, ρ_{o p t} = \frac{κ - 1}{κ + 1},

where

κ = L / l

is the condition number and

L, l

can be associated with the minimum and maximum eigenvalues of a Hessian of

f (x)

.

Polyak’s heavy ball method is presented as in [7,8]

x^{k + 1} = x^{k} - h \nabla f (x^{k}) + β (x^{k} - x^{k - 1}),

(3)

where

β \in [0, 1)

is the momentum. The optimal values in the case of strongly convex quadratic function are written as in [7]

h_{o p t} = \frac{4}{{(\sqrt{L} + \sqrt{l})}^{2}}, β_{o p t} = {(\frac{\sqrt{κ} - 1}{\sqrt{κ} + 1})}^{2}, ρ_{o p t} = \frac{\sqrt{κ} - 1}{\sqrt{κ} + 1} .

Lessard et al. [28] proposed the following method with an additional momentum parameter:

x^{k + 1} = x^{k} - h \nabla f (y^{k}) + β_{1} (x^{k} - x^{k - 1}), y^{k} = x^{k} + β_{2} (x^{k} - x^{k - 1}) .

(4)

As can be seen, for the case of

β_{2} = 0

, method (4) leads to (3). In [29], the global convergence of this method for

f (x) \in F_{l, L}^{1, 1}

with the convergence rate

ρ \in [1 - \frac{1}{\sqrt{κ}}, 1 - \frac{1}{κ}]

is demonstrated for the following specific choice of parameters:

h = \frac{κ {(1 - ρ)}^{2} (1 + ρ)}{L}, β_{1} = \frac{κ ρ^{3}}{κ - 1}, β_{2} = \frac{ρ^{3}}{(κ - 1) {(1 - ρ)}^{3} (1 + ρ)} .

In the theoretical part of the presented paper, we try to analyze the influence of parameter

β_{2}

on the convergence of method (4) for the case of a quadratic function, written as

f (x) = \frac{1}{2} (x, A x) - (b, x),

(5)

where

b \in R^{d}

, A is a positive definite and symmetric matrix with eigenvalues

0 < l = λ_{1} \leq λ_{2} \leq \dots \leq λ_{d} = L

. The gradient of this function is computed as

\nabla f (x) = A x - b

, and

x^{*}

is treated as the solution of the linear system

A x = b

. The obtained results can be considered as the results of the local convergence of method (4), applied to

f (x) \in F_{l, L}^{2, 1}

, because in the neighborhood of

x^{*}

f (x)

from this class can be presented as (5) with

A = \nabla^{2} f (x^{*})

. This approach for obtaining local convergence conditions and optimal parameters values is widely used in literature [7,18].

Method (4), when applied to (5), leads to the following difference system:

x^{k + 1} = (E - h A) x^{k} + (β_{1} E - β_{2} h A) (x^{k} - x^{k - 1}) - h b,

(6)

where E is the unity matrix.

2.1. Convergence Conditions

The following theorem on the convergence of an iterative method, as presented by (6), can be formulated

Theorem 1.

For

h > 0

,

β_{1} \in [0, 1)

and

β_{2} \geq 0

, the following inequality takes place:

h < \frac{2 (1 + β_{1})}{(1 + 2 β_{2}) L} .

(7)

Then, method (6) converges to

x^{*}

for any

x^{0}

.

Proof of Theorem 1.

(1): Let the new variable $z^{k} = {(x^{k} - x^{*}, x^{k - 1} - x^{*})}^{T}$ be introduced. Then, method (6) can be rewritten as a single-step method:

$z^{k + 1} = T z^{k},$

where matrix T is written as

$T = (\begin{matrix} (1 + β_{1}) E - h (1 + β_{2}) A & h β_{2} A - β_{1} E \\ E & 0_{d \times d} \end{matrix}) .$

This method converges if, and only if, $r (T)$ (spectral radius of matrix T) is strictly less than unity [3].
Matrix A can be represented by the spectral decomposition $A = S Λ S^{T}$ , where $Λ$ is the diagonal matrix of eigenvalues of A, S is a matrix of eigenvectors, and $S S^{T} = S^{T} S = E$ . The following transformation of T can be introduced: $\bar{T} = Σ^{T} T Σ$ , where

$Σ = (\begin{matrix} S & 0_{d \times d} \\ 0_{d \times d} & S \end{matrix}), \bar{T} = (\begin{matrix} (1 + β_{1}) E - h (1 + β_{2}) Λ & h β_{2} Λ - β_{1} E \\ E & 0_{d \times d} \end{matrix}) .$

Matrix $\bar{T}$ has the same eigenvalues, as matrix T.
Let us demonstrate that $\bar{T}$ has the same spectrum as the following matrix:

$\tilde{T} = (\begin{matrix} T_{1} & 0_{2 \times 2} & \dots & 0_{2 \times 2} \\ 0_{2 \times 2} & T_{2} & \dots & 0_{2 \times 2} \\ \dots & \dots & \dots & \dots \\ 0_{2 \times 2} & 0_{2 \times 2} & \dots & T_{d} \end{matrix}) .$

where $T_{i}$ are $2 \times 2$ matrices, which are presented as

$T_{i} = (\begin{matrix} 1 + β_{1} - h (1 + β_{2}) λ_{i} & h β_{2} λ_{i} - β_{1} \\ 1 & 0 \end{matrix}) .$

Matrix $\bar{T} - ζ E$ is presented as

$\bar{T} - ζ E = (\begin{matrix} T_{11} & T_{12} \\ T_{21} & T_{22} \end{matrix}),$

where $T_{11} = (1 + β_{1}) E - h (1 + β_{2}) Λ - ζ E$ , $T_{12} = h β_{2} Λ - β_{1} E$ , $T_{21} = E$ , $T_{22} = - ζ E$ . The determinant of this matrix is computed by the following rule [30]:

$det (\bar{T} - ζ E) = det (T_{11}) det (T_{22} - T_{21} T_{11}^{- 1} T_{12}) =$

$det (T_{11}) det (\begin{matrix} - ζ + \frac{β_{1} - h β_{2} λ_{1}}{1 + β_{1} - h (1 + β_{2}) λ_{1} - ζ} & 0 & \dots & 0 \\ 0 & - ζ + \frac{β_{1} - h β_{2} λ_{2}}{1 + β_{1} - h (1 + β_{2}) λ_{2} - ζ} & \dots & 0 \\ \dots & \dots & \dots & \dots \\ 0 & 0 & \dots & - ζ + \frac{β_{1} - h β_{2} λ_{d}}{1 + β_{1} - h (1 + β_{2}) λ_{d} - ζ} \end{matrix}) =$

$(β_{1} - h β_{2} λ_{1} - ζ χ_{1}) \dots (β_{1} - h β_{2} λ_{d} - ζ χ_{d}),$

where $χ_{i} = 1 + β_{1} - h (1 + β_{2}) λ_{i} - ζ$ , $i = \bar{1, d}$ .
The determinant of the block-diagonal matrix $\tilde{T} - ζ E$ is written as

$det (\tilde{T} - ζ E) = det (T_{1} - ζ E_{2 \times 2}) det (T_{2} - ζ E_{2 \times 2}) \dots det (T_{d} - ζ E_{2 \times 2}),$

and, as can be seen, it is equal to $det (\bar{T} - ζ E)$ . So, both matrices have the same eigenvalues $ζ_{k}$ , $k = \bar{1, 2 d}$ and these eigenvalues are computed as eigenvalues of matrices $T_{i}$ .
(2): According to the result presented above, the analysis of eigenvalues of T leads to the analysis of roots of an algebraic equation:

$ζ^{2} - (1 + β_{1} - h (1 + β_{2}) λ) ζ + β_{1} - h β_{2} λ = 0 .$

(8)

In order to guarantee convergence, parameters should be chosen in a way which guarantees that $| ζ_{1, 2} | < 1$ . For obtaining these conditions, we perform conformal mapping of the interior of the unit circle ${ζ : | ζ | < 1}$ to the set $Q = {q : Re (q) < 0}$ with use of the following function:

$ζ = \frac{q + 1}{q - 1} .$

(9)

After substitution of (9) into (8), the following equation is obtained:

$h λ q^{2} + 2 (1 - β_{1} + β_{2} λ h) q + 2 (1 + β_{1} - β_{2} λ h) - h λ = 0 .$

(10)

The conditions on coefficients of (10) guarantee roots $q_{i} \in Q$ are provided by the Routh–Hurwitz criterion [30,31]. The Hurwitz matrix for (10) is written as

$(\begin{matrix} 2 (1 - β_{1} + β_{2} λ h) & h λ \\ 0 & 2 (1 + β_{1} - β_{2} λ h) - h λ \end{matrix}) .$

The conditions of the Routh–Hurwitz criterion lead to two inequalities:

$1 - β_{1} + β_{2} λ h > 0,$

(11)

$2 (1 + β_{1} - β_{2} λ h) - h λ > 0 .$

(12)

Inequity (11) is valid $\forall λ \in [l, L]$ , $\forall h > 0$ according to the ranges of values of $β_{i}$ stated in the conditions of the theorem. Inequity (12) is rewritten as

$h < \frac{2 (1 + β_{1})}{λ (1 + 2 β_{2})},$

and it is valid for values of h chosen from Inequity (7). This, condition (7) guarantees that $q_{i} \in Q$ , and as a consequence that $| ζ_{i} | < 1$ under the stated conditions. This leads to the convergence of (6) for any $x^{0} \in R^{d}$ .

□

2.2. Analysis of Convergence Rate

Let us analyze the convergence rate of method (6). At first, let us obtain the expression for the spectral radius of matrix T. Let

s = (h, β_{1}, β_{2})

and the spectral radius be presented as the function

r (s, λ)

, where

λ \in [l, L]

. The expression for r can be obtained with the use of an expression for the roots of (8):

ζ_{1, 2} = \frac{1}{2} (A_{1} \pm \sqrt{D}),

where

D = A_{1}^{2} - 4 A_{2}

,

A_{1} = 1 + β_{1} - h (1 + β_{2}) λ

,

A_{2} = β_{1} - h λ β_{2}

and is written as

r (s, λ) = \frac{1}{2} max (| ζ_{1} |, | ζ_{2} |) .

(13)

If

r (s, λ)

is considered a function of

λ

, the following theorem on its extremal property can be formulated:

Theorem 2.

The maximum value of

r (s, λ)

as a function of

λ \in [l, L]

takes place for

λ = l

or

λ = L

.

Proof of Theorem 2.

(1): Let us obtain the expression for $r (s, λ)$ . For $D > 0$ in the case of $A_{1} > 0$ , the following inequality takes place: $A_{1} + \sqrt{D} > 0$ and $| A_{1} - \sqrt{D} | = A_{1} - \sqrt{D}$ if $A_{1} > \sqrt{D}$ , and in this case $A_{1} + \sqrt{D} > A_{1} - \sqrt{D} > 0$ . If $A_{1} - \sqrt{D} < 0$ , we obtain that $| A_{1} - \sqrt{D} | = \sqrt{D} - A_{1}$ and $A_{1} + \sqrt{D} > \sqrt{D} - A_{1}$ . So, if $D > 0$ and $A_{1} > 0$ , we have that $r (s, λ) = \frac{1}{2} (A_{1} + \sqrt{D})$ .
If $D > 0$ and $A_{1} < 0$ , we have that $| A_{1} - \sqrt{D} | = \sqrt{D} - A_{1}$ , and for $| A_{1} + \sqrt{D} |$ we have that if $A_{1} + \sqrt{D} > 0$ , then $\sqrt{D} - A_{1} > A_{1} + \sqrt{D}$ . If $A_{1} + \sqrt{D} < 0$ , then $| A_{1} + \sqrt{D} | = - A_{1} - \sqrt{D}$ and $\sqrt{D} - A_{1} > - A_{1} - \sqrt{D}$ . So if $A_{1} < 0$ and $D > 0$ , we obtain that $r (s, λ) = \frac{1}{2} (\sqrt{D} - A_{1})$ .
For $D < 0$ , it is easy to see that $r (s, λ) = \sqrt{A_{2}}$ . The case of $D = 0$ and case $A_{1} = 0$ are trivial to analyze. Thus, it is demonstrated that

$r (s, λ) = \{\begin{matrix} \frac{1}{2} (A_{1} + \sqrt{D}), A_{1} \geq 0, D \geq 0, \\ \frac{1}{2} (\sqrt{D} - A_{1}), A_{1} < 0, D \geq 0, \\ \sqrt{A_{2}}, D < 0 . \end{matrix}$
(2): Let us analyze the behavior of $r (s, λ)$ for $λ \in [l, L]$ . The expression for D is written as

$D = {(1 + β_{2})}^{2} λ^{2} h^{2} - 2 ((1 + β_{1}) (1 + β_{2}) - 2 β_{2}) λ h + {(1 - β_{1})}^{2} .$

So, the non-negative values of D are associated with the solutions of the following inequality:

${(1 + β_{2})}^{2} t^{2} - 2 ((1 + β_{1}) (1 + β_{2}) - 2 β_{2}) t + {(1 - β_{1})}^{2} \geq 0 .$

The corresponding discriminant is equal to $16 (β_{1} + β_{2} (β_{1} - 1))$ . As can be seen, solutions to this inequality exist, when

$β_{2} \leq \frac{β_{1}}{1 - β_{1}} .$

(14)

The opposite inequality guarantees that it is valid for all $λ > 0$ . For analysis of the general situation of the sign of D this restriction is too strict, so we consider the case of condition (14).
The case of $D < 0$ leads to the investigation of function $ψ (λ) = \sqrt{A_{2} (λ)} = \sqrt{β_{1} - λ h β_{2}}$ . Condition $A_{2} (λ) > 0$ leads to the restriction $λ < \frac{β_{1}}{h β_{2}}$ , which is valid for $h < \frac{β_{1}}{L β_{2}}$ . It should be noted that this condition correlates with (7) for values of $β_{1} \in [0, 1)$ , $β_{2} \geq 0$ under condition $β_{2} > \frac{β_{1}}{2}$ . The derivative of $ψ (λ)$ is written as

$ψ^{'} (λ) = \frac{- h β_{2}}{2 \sqrt{A_{2} (λ)}}$

and for $β_{2} > 0$ , it is strictly negative, so $ψ$ decreases on the considered interval and its maximum is equal to $ψ (l) < ψ (0) = \sqrt{β_{1}} < 1$ .
For $D = 0$ , we obtain that $r = \frac{1}{2} | A_{1} (λ) | = \frac{1}{2} | 1 + β_{1} - h (1 + β_{2}) λ |$ . The case $A_{1} > 0$ corresponds to the interval $λ \in (0, \frac{1 + β_{1}}{h (1 + β_{2})}]$ , where r decreases, and case $A_{1} < 0$ corresponds to $λ > \frac{1 + β_{1}}{h (1 + β_{2})}$ , where r increases. So, the maximum of r in this situations is realized in point $λ = l$ or $λ = L$ .
For $D > 0$ , two situations should be considered. For $A_{1} \geq 0$ , the behavior of function $φ_{1} (λ) = \frac{1}{2} (A_{1} (λ) + \sqrt{A_{1}^{2} (λ) - 4 A_{2} (λ)})$ should be analyzed. Its derivative is written as

$φ_{1}^{'} (λ) = \frac{1}{2} (A_{1}^{'} (λ) + \frac{A_{1}^{'} (λ) A_{1} (λ) - 2 A_{2}^{'} (λ)}{\sqrt{A_{1}^{2} (λ) - 4 A_{2} (λ)}})$

and according to $A_{1}^{'} (λ) = - h (1 + β_{2}) < 0$ , we can see that if $A_{1}^{'} A_{1} - 2 A_{2}^{'} \leq 0$ , $φ_{1}^{'} (λ)$ will be negative, so $φ_{1}$ decreases. Let us determine where this inequality is valid:

$A_{1}^{'} A_{1} - 2 A_{2}^{'} \leq 0 \Leftrightarrow - (1 + β_{1}) (1 + β_{2}) + h λ {(1 + β_{2})}^{2} + 2 β_{2} \leq 0,$

so

$λ \leq η (β_{1}, β_{2}) = \frac{(1 + β_{1}) (1 + β_{2}) - 2 β_{2}}{h {(1 + β_{2})}^{2}} .$

Function $η$ is strictly positive, when

$β_{2} < \frac{1 + β_{1}}{1 - β_{1}} .$

(15)

As can be seen, this inequality is valid when condition (14) is realized on values of $β_{2}$ . So, in the interval $(0, η]$ , function $φ_{1} (λ)$ decreases.
Positive values of $φ_{1}^{'} (λ)$ can be realized when the following inequality is valid:

$A_{1}^{'} \sqrt{A_{1}^{2} - 4 A_{2}} + A_{1}^{'} A_{1} - 2 A_{2}^{'} > 0,$

(16)

which leads to $A_{1}^{'} A_{1} - 2 A_{2}^{'} > - A_{1}^{'} \sqrt{A_{1}^{2} - 4 A_{2}}$ . According to $A_{1}^{'} = - h (1 + β_{2}) < 0$ , this leads to the evident inequality $A_{1}^{'} A_{1} - 2 A_{2}^{'} > 0$ , which takes place for $λ > η (β_{1}, β_{2})$ under condition (15).
Let us demonstrate that (16) correlates with (14): Inequity (16) leads to $- A_{1}^{'} A_{2}^{'} A_{1} + A_{2}^{' 2} > - A_{2} A_{1}^{' 2}$ , which leads to the following inequality:

$- (1 + β_{2} + β_{1} + β_{1} β_{2}) β_{2} + β_{2}^{2} > - β_{1} - 2 β_{1} β_{2} - β_{1} β_{2}^{2},$

which is equivalent to

$β_{2} + β_{1} β_{2} < β_{1} + 2 β_{1} β_{2},$

which is equivalent to (14).
It is easy to see that

$\frac{(1 + β_{1}) (1 + β_{2}) - 2 β_{2}}{{(1 + β_{2})}^{2}} < \frac{1 + β_{1}}{1 + β_{2}},$

(17)

so, when $λ \in (0, \frac{(1 + β_{1})}{h (1 + β_{2})}]$ (corresponds to $A_{1} \geq 0$ ), r decreases when $λ \in (0, η (β_{1}, β_{2})]$ and increases when $λ > η (β_{1}, β_{2})$ and its maximum is realized in the boundary point.
The case $A_{1} < 0$ leads to the analysis of function $φ_{2} (λ) = \frac{1}{2} (\sqrt{A_{1}^{2} (λ) - 4 A_{2} (λ)} - A_{1} (λ))$ on the interval, defined by inequality (see case $D = 0$ )

$λ > \frac{1 + β_{1}}{h (1 + β_{2})} .$

(18)

The first derivative of $φ_{2} (λ)$ is written as

$φ_{2}^{'} (λ) = \frac{1}{2} (\frac{A_{1}^{'} (λ) A_{1} (λ) - 2 A_{2}^{'} (λ)}{\sqrt{A_{1}^{2} (λ) - 4 A_{2} (λ)}} - A_{1}^{'} (λ)) .$

According to $- A_{1}^{'} = h (1 + β_{2}) > 0$ , we obtain that if $A_{1}^{'} A_{1} - 2 A_{2}^{'} > 0$ (this takes place when $λ > η (β_{1}, β_{2})$ ), this derivative is strictly positive. According to (17), it is valid for the interval defined by (18), so $φ_{2} (λ)$ and, as a consequence, function r increases in the case of $A_{1} < 0$ corresponding to (18) and its maximum takes place in the right boundary point $λ = L$ , if intervals $[l, L]$ and (18) have an intersection.
Thus, for all values of D, we can see that r reaches its maximum value at the boundaries of interval $[l, L]$ .

□

Notation 1.

Formulated theorems for the case of function (5) provide the conditions that guarantee global convergence [7]:

| | x^{k} - x^{*} | | \leq {(ρ + ε)}^{k} | | x^{0} - x^{*} | |, \forall ε \in (0, 1 - ρ), \forall k \leq 0,

where

ρ = max (r (s, l), r (s, L))

.

If the non-quadratic

f (x) \in F_{l, L}^{2, 1}

is considered, then these conditions provide a local convergence (see Theorem 1 from subsection 2.1.2 in [7]). Any sufficiently smooth function

f (x)

in the neighborhood of

x^{*}

can be presented as

f (x) \approx f (x^{*}) + \frac{1}{2} (\nabla^{2} f (x^{*}) (x - x^{*}), x - x^{*}),

and according to the following property:

f (x^{k}) - f (x^{*}) \leq \frac{L}{2} | | x^{k} - x^{*} {| |}^{2},

we can see that if ∃

δ > 0

,

| | x^{0} - x^{*} | | \leq δ

, then for method (4) the following inequality is obtained

\forall k \geq 0

:

f (x^{k}) - f (x^{*}) \leq \frac{L}{2} δ^{2} {(ρ + ε)}^{2 k}, \forall ε \in (0, 1 - ρ) .

Notation 2.

Theorem 2 provides an approach to obtain optimal parameters with the solution of the following problem for obtaining an optimal convergence rate:

ρ_{o p t} = min_{s \in Σ \subset R^{3}} max (r (s, l), r (s, L)),

(19)

where Σ is defined as:

Σ = \{(β_{1}, β_{2}, h) : β_{1} \in [0, 1), β_{2} \geq 0, h \in (0, \frac{2 (1 + β_{1})}{L (1 + 2 β_{2})}]\} .

(20)

Similar minimax problems arise in the theory of the standard HBM (3) [7] and multiparametric method in [18].

2.3. Optimal Parameters

In this subsection, we discuss the solution of minimization problems (19) and (20) and the following problem, which is stated in order to analyze the effect of parameter

β_{2}

:

F (β_{1}, h) = max (r (h, β_{1}, β_{2}, l), r (h, β_{1}, β_{2}, L)) \to min_{Δ},

(21)

where

Δ = \{(β_{1}, h) : β_{1} \in [0, 1), h \in (0, \frac{2 (1 + β_{1})}{L (1 + 2 β_{2})}]\} .

So in (21)

β_{2}

is treated as an external parameter, which can be varied. In our computations, problem (21) is solved using the following approach: in the first stage, we obtain three ’good’ initial points in

Δ

by random search, and in the second stage, we apply the Nelder–Mead method in order to obtain the optimal point more precisely than in the first stage. For computations at any value of

β_{2}

, we use

10^{5}

random points in

Δ

and the accuracy

10^{- 5}

for the Nelder–Mead method. The use of a large number of random points provides the possibility of obtaining the initial points in the small neighborhood of the optimal point, and the points obtained with the Nelder–Mead method do not leave

Δ

. This approach to solving the problem is very simple to realize and eliminates the need to use methods of unconstrained optimization. All computations were realized with the use of codes implemented in Matlab 2021a.

In Figure 1, the plots of optimal values of F are presented for the cases of interval

β_{2} \in [0, 1]

(Figure 1a) and

β_{2} \in [0, 100]

(Figure 1b) for four values of

κ

: 10,

10^{2}

,

10^{3}

, and

10^{5}

. As can be seen, for both intervals and all considered values of

κ

, the minimum values of

F_{o p t}

takes place for

β_{2} = 0

. The value of

F_{o p t}

becomes smaller at smaller values of

κ

. The last feature is also mentioned for the multi-parametric method of [18].

In addition, we try to compare the optimal convergence rate as a function of

κ

for method (4) with the optimal rates for the GD method (2), the HBM (3), and the following Nesterov methods:

(1): Nesterov’s accelerated gradient method for $f \in F_{l, L}^{1, 1}$ (Nesterov1) [6,28]:

$x^{k + 1} = y^{k} - h \nabla f (y^{k}), y^{k} = x^{k} + β (x^{k} - x^{k - 1}),$

$h_{o p t} = \frac{1}{L}, β_{o p t} = \frac{\sqrt{κ} - 1}{\sqrt{κ} + 1}, ρ_{o p t} = 1 - \frac{1}{\sqrt{κ}} .$
(2): Nesterov’s accelerated gradient method for a strongly convex quadratic function (Nesterov2) [28]:

$x^{k + 1} = y^{k} - h \nabla f (y^{k}), y^{k} = x^{k} + β (x^{k} - x^{k - 1}),$

$h_{o p t} = \frac{4}{3 L + l}, β_{o p t} = \frac{\sqrt{3 κ + 1} - 2}{\sqrt{3 κ + 1} + 2}, ρ_{o p t} = 1 - \frac{2}{\sqrt{3 κ + 1}} .$

The numerical solution to problem (19) is realized using the same method as for problem (21), but for the Nelder–Mead method, four ’good’ points are obtained with a random search. The interval on $β_{2} \geq 0$ is bounded by 0.5, according to the behavior, illustrated in Figure 1.

Plots of

ρ_{o p t}

are presented in Figure 2. As can be seen, the minimum values of

ρ_{o p t}

took place for methods (3) and (4), and they were very close. So, from the results of the computations, the following conclusion can be drawn: for the quadratic function

f (x)

, parameter

β_{2}

does not provide an additional acceleration effect in comparison with the standard HBM (3).

2.4. Equivalent ODE

For an additional analysis of the influence of

β_{2}

on the convergence of (4), we consider an approach based on the ODE, which is constructed as a continuous analogue of the iterative method. At present, this approach is widely used for the analysis of optimization methods [7,21,22,23,24,32,33].

Let method (4) for quadratic function

f (x) = \frac{a}{2} x^{2}

, where

x \in R

,

a > 0

, be considered. This function can be treated as a quadratic approximation of the arbitrarily smooth function, which has its minimum zero value in point

x = 0

. Application of (4) leads to the following difference equation:

x^{k + 1} = x^{k} - h a (x^{k} + β_{2} (x^{k} - x^{k - 1})) + β_{1} (x^{k} - x^{k - 1}) .

(22)

Let us introduce function

x (t)

, where t is defined as

t = k \sqrt{h}

, so

x (t) \approx x^{\frac{t}{\sqrt{h}}} = x^{k}

and

x (t + \sqrt{h}) \approx x^{k + 1}

,

x (t - \sqrt{h}) \approx x^{k - 1}

. Equation (22) can be rewritten as

\frac{x^{k + 1} - x^{k}}{\sqrt{h}} = - \sqrt{h} a x^{k} + (β_{1} - h a β_{2}) \frac{x^{k} - x^{k - 1}}{\sqrt{h}} .

(23)

Let the new parameters

γ_{1} > 0

,

γ_{2} \geq 0

be introduced:

β_{1} = 1 - γ_{1} \sqrt{h}

,

γ_{2} = \sqrt{h} β_{2}

and the following new variable is considered:

m^{k + 1} = \frac{x^{k + 1} - x^{k}}{\sqrt{h}} .

So, (23) is rewritten as

\frac{m^{k + 1} - m^{k}}{\sqrt{h}} = - a x^{k} - (γ_{1} + a γ_{2}) m^{k} .

(24)

For

h \to 0

, we find that (24) is rewritten as

\dot{m} = - a x - (γ_{1} + a γ_{2}) m

and with the use of

m = \dot{x}

, we obtain the following second-order ODE:

\ddot{x} = - a x - (γ_{1} + a γ_{2}) \dot{x} .

(25)

The case of HBM corresponds to

γ_{2} = 0

[7] and the ODE describes the dynamics of a material point with unit mass under a force with a potential represented by

f (x)

and under a resistive force with coefficient

γ_{1}

. Thus, if

γ_{2} \neq 0

, we have the following mechanical meaning of

β_{2}

: this presents an additional damping effect on the solution of the ODE (25) and, as a consequence, on the behavior of method (4). With the use of proper values of

β_{2}

, we can realize the damping of oscillations related to the non-monotonic convergence of the method. This is typical for the case of

κ ≫ 1

[13]. In Section 3, this will also be illustrated for the minimization of non-quadratic convex and non-convex functions.

3. Numerical Experiments and Discussion

In this section, we tried to apply method (4) to the minimization of non-quadratic functions that arise in test problems for optimization solvers and in machine learning. The main purpose of these numerical experiments was to demonstrate the effect of

β_{2}

on the convergence of method (4) in comparison with the standard HBM (3). The initial point for all test examples (except the RNN) was chosen as a fixed (not random) point, for better illustration of the convergence process. It was chosen far from the minimum points, but not so far that the method had a large number of iterations.

For the numerical examples, only a comparison of method (4) with the HBM (3) was realized, because (4) was treated as an improvement of the HBM, so it was decided to only perform a comparison with this method, in order to demonstrate the practical effect of such an improvement.

3.1. Rosenbrock Function

Let the 2D Rosenbrock function be considered:

f (x_{1}, x_{2}) = {(1 - x_{1}^{2})}^{2} + 100 {(x_{2} - x_{1}^{2})}^{2} .

This function has a minimum at the point

x^{*} = (1, 1)

. For the numerical simulation, we used the following values:

x^{0} = (1, 3)

,

h = 2 \times 10^{- 4}

,

β_{1} = 0.97

,

β_{2} = 1

. The descent trajectories for the methods (3) and (4) are presented in Figure 3a. The plots of the dependence of the logarithm of error, computed as

f (x^{k}) - f (x^{*})

on the iteration number are presented in Figure 3b. From both figures, it can be seen that the inclusion of

β_{2}

led to the damping of oscillations typical for the HBM, and, as a consequence, to a faster entry of the trajectory in the neighborhood of the minimum point.

The Rosenbrock function considered in this example can be classified as a ravine function, so the traditional gradient methods (without the application of the ravine method) converge slowly to the minimum point and they need many iterations. As can be seen from Figure 3b, both methods converged in the neighborhood of the minimum point with good accuracy, but method (4) converged faster according to the damping of the oscillations.

3.2. Himmelblau Function

For the minimization of the non-convex Himmelblau function

f (x_{1}, x_{2}) = {(x_{1}^{2} + x_{2} - 11)}^{2} + {(x_{1} + x_{2}^{2} - 7)}^{2},

which has four local minima, the following parameters were used:

h = 0.01

,

β_{1} = 0.95

,

β_{2} = 1

. For the initial point

x^{0} = (0, 0)

both methods converged to the local minimum

x^{*} = (3, 2)

. The trajectories are presented in Figure 4a, and the plots of the error logarithm are presented in Figure 4b. As can be seen, the damping effect realized with the proper choice

β_{2}

led to a faster convergence in comparison with the standard HBM.

3.3. Styblinski–Tang Function

Let the following non-convex function be considered:

f (x) = \frac{1}{2} \sum_{i = 1}^{d} (x_{i}^{4} - 16 x_{i}^{2} + 5 x_{i}),

which has a local minimum at

x^{*} = (- 2.903534, \dots, - 2.903534)

and

f (x^{*}) = - 39.16599 \cdot d

. For the case of

d = 2

, we used

x^{0} = (- 1, - 4)

,

h = 0.02

,

β_{1} = 0.99

,

β_{2} = 1

. The trajectories for both methods are presented in Figure 5a and plots of the logarithms of error are presented in Figure 5b. As can be seen, for this situation, parameter

β_{2} \neq 0

had a positive influence on the convergence. For

d = 100

, we used the initial vector

x^{0} = (- 1, \dots, - 1)

and the parameters

h = 0.03

,

β_{1} = 0.95

,

β_{2} = 1

. Plots of the dependence of error on iteration number in log–log scale are presented in Figure 6. As can be seen, method (4) for

β_{2} = 1

converged to

x^{*}

faster than the HBM.

3.4. Zakharov Function

This convex function is presented as

f (x) = \sum_{i = 1}^{d} x_{i}^{2} + {(\sum_{i = 1}^{d} 0.5 i x_{i})}^{2} + {(\sum_{i = 1}^{d} 0.5 i x_{i})}^{4} .

It has a unique minimum point

x^{*} = 0

. For

d = 2

, we chose

x^{0}

as

(4, 2)

and performed computations with the following parameter values:

h = 10^{- 4}

,

β_{1} = 0.985

,

β_{2} = 15

. The trajectories are presented in Figure 7a and plots of the error logarithm dependence on the iteration number are presented in Figure 7b. As can be seen, the selected value of

β_{2}

led to a damping of oscillations typical for the HBM and led to a faster entry of the trajectory into the neighborhood of

x^{*}

. For

d = 10

, computations were performed for

x^{0}

, selected as the vector of units,

h = 10^{- 6}

,

β_{1} = 0.99

,

β_{2} = 4

. Plots of the dependence of error on the iteration number in log–log axes are presented in Figure 8. As can be seen, the value of

β_{2}

led to the damping of oscillations, as in the 2D case.

3.5. Non-Convex Function in Multidimensional Space

Let the following function be considered:

f (x) = \sum_{i = 1}^{10^{6}} \frac{x_{i}^{2}}{1 + x_{i}^{2}} .

(26)

This function has a unique minimum point

x^{*} = 0

. We performed computations with

x^{0}

chosen as a vector of units and for

h = 0.1

,

β_{1} = 0.95

,

β_{2} = 1

. Plots of the error’s dependence on the iteration number in log–log axes are presented in Figure 9. As in the previous examples, the inclusion of

β_{2}

led to a faster convergence in comparison with the standard HBM.

3.6. Smoothed Elastic Net Regularization

The following function that arises in machine learning was considered [34]:

f (x) = \frac{1}{2} | | A x - {b | |}_{2}^{2} + α ν_{τ} (x) + \frac{γ}{2} {| | x | |}_{2}^{2},

where

x \in R^{d}

,

b \in R^{d}

is the vector of values,

dim (A) = m \times d

is a matrix of features,

α > 0

,

γ > 0

are the regularization parameters, function

ν_{τ} (x)

,

τ > 0

is the smooth approximation of

ℓ_{1}

-norm (so-called pseudo-Huber function [35]):

ν_{τ} (x) = \sum_{i = 1}^{d} (\sqrt{τ^{2} + x_{i}^{2}} - τ) .

As mentioned in [34,35]

f (x) \in F_{l, L}

, where

l = γ + min (eig (A))

,

L \approx {(1 + \sqrt{m / d})}^{2} + γ + α / τ

. Datasets, represented by A and b at various values of m and d were simulated using the function randn() in Matlab: matrix A was simulated as a random matrix from the Gaussian distribution normalized by

\sqrt{d}

, and b was simulated as a random vector from the same distribution. Computations were performed with the following parameter values:

τ = 10^{- 4}

,

α = γ = 10^{- 2}

. Steps h and

β_{1}

were computed as optimal values for the quadratic case, and

β_{2}

was chosen to as equal to 0.5. Condition number

κ

for all model datasets was approximately equal to

10^{4}

. The error was computed as

f (x^{k}) - f (x^{*})

, where

x^{*}

was the benchmark solution, obtained by method (4) for

2 \times 10^{4}

iterations. For all cases,

x^{0}

was chosen as a vector of units. In Figure 10, the plots of the dependence of error on the iteration number are presented in log–log axes. As can be seen, the presence of

β_{2}

led to an improvement in convergence.

3.7. Logistic Regression

For the binary classification, the following convex function related to the model of logistic regression is widely used:

f (x) = \sum_{i = 1}^{m} log (1 + exp (- y_{i} ξ_{i}^{T} x)),

where

ξ_{i}

represents the rows of matrix

Ξ

,

dim (Ξ) = m \times d

and

y_{i} \in {- 1, 1}

,

i = \bar{1, d}

. Matrix

Ξ

and vector y represent the training dataset.

For the computations, we used two datasets: SONAR (

m = 208

,

d = 60

) and CINA0 (

m = 16,033

,

d = 132

). The first was used for a comparison of different methods in [36]. The second is a well-known test dataset, which can be downloaded from https://www.causality.inf.ethz.ch/challenge.php?page=datasets (accessed on 14 March 2024). The error was computed as

f (x^{k}) - f (x^{*})

. For the SONAR dataset, the values

h = 0.1

,

β_{1} = 0.9999

, and

β_{2} = 10

were used, and a benchmark solution was obtained with method (4) in the case of

2 \times 10^{4}

iterations. For CINA0, the following parameters were used:

h = 10^{- 6}

,

β_{1} = 0.99

,

β_{2} = 2

and a benchmark solution was obtained for

5 \times 10^{3}

iterations of method (4). For both datasets,

x^{0}

was chosen as a vector of zeroes.

In Figure 11, plots of the dependence of error on the iteration number in log–log axes are presented. As can be seen, the adding of

β_{2} \neq 0

led to the damping of oscillations typical for the standard HBM.

3.8. Recurrent Neural Network

Let us consider the model recurrent neural network (RNN) used for the analysis of phrase tone. For details of its architecture and realization, see https://python-scripts.com/recurrent-neural-network (accessed on 16 March 2024). This RNN was realized using the following recurrent relations:

h_{s} = tanh (W_{x h} x_{s} + W_{h h} h_{s - 1} + b_{h}), s = \bar{1, M}, y = W_{h y} h_{M} + b_{y},

where M is the number of words of vocabulary in the phrase;

x_{s}

is a vector, which represents the s-th word in the phrase;

h_{s}

is a vector used for iterations in the hidden layer; y is the output vector;

W_{x h}

,

W_{h h}

,

W_{h y}

are the matrices of weights; and

b_{h}

and

b_{y}

are the vectors of biases. The vector of probabilities of the ’good’ or ’bad’ tone of the phrase was computed as

softmax (y)

. The training dataset consisted of 67 phrases from the vocabulary, with 19 unique words. The following dimensions of vectors were used:

dim (x) = 19

,

dim (y) = 2

, the dimension of h was chosen as 64 (the maximum number of words from vocabulary in the phrase; this number can be varied).

As a result of forward propagation, we obtained a 2D vector of probabilities for the phrase tone, computed with the use of the softmax function. The loss function used for the training of this RNN was computed as

L (X, θ) = H_{μ} (μ, p (X; θ)),

where X is a matrix of vectors

x_{1}, \dots, x_{M}

, which represents the phrase with M words,

μ \in {0, 1}

is a label of phrase; represented by X;

p (X) = softmax (y (X))

is the probability of the phrase tone;

H_{μ}

is a proper component of a cross-entropy function

H (ν, p) = - (ν log (p) + (1 - ν) log (1 - p));

and

θ \in R^{d}

is a vector of parameters of RNN. The objective function is written as

f (θ) = \frac{1}{N} \sum_{i = 1}^{N} L (X_{i}, θ),

where

N = 67

is the size of the training dataset (number of phrases). With all considered dimensions, we minimized the function of

d = 5506

variables.

For minimization, we applied deterministic methods, as was considered in the theoretical part of the presented paper and despite the use of stochastic methods in most works on the training of neural networks. The computations were performed with

h = 0.05

,

β_{1} = 0.9

and we tried to vary the value of

β_{2}

in order to analyze its effect on the convergence. We realized a numerical experiment for 250 random initializations of weights and biases and performed computations for

3 \times 10^{3}

epochs. In Figure 12, the plots of the dependence of the objective function value on the epoch number averaged at all random initializations are presented for the standard GD (2), HBM (3), and method (4) in the case of

β_{2} = 1

. As can be seen, methods with momentum led to a faster convergence in comparison with the standard GD, as mentioned by many authors (e.g., see [19]), and the presence of

β_{2}

led to a faster convergence to the minimum in practice. In Figure 13, the plots obtained for different values of

β_{2}

are presented. As can be seen, the value of

β_{2}

had an effect on the convergence of method (4).

4. Conclusions

In the presented paper, we tried to perform an analysis of the properties of method (4) in theory and practice. Despite the results of the investigations presented in [28,29], this method requires further analysis, so we tried to realize this in the presented paper.

The following new results were obtained:

It was demonstrated that, in the case of the quadratic function, method (4) can be easily investigated using the first Lyapunov method. As a result of its application, the convergence conditions presented in Theorem 1 were obtained. Such conditions led to the conditions for the HBM (3) in the case of $β_{2} = 0$ (see [7]). For functions from $F_{l, L}^{2, 1}$ , such conditions can be treated as the conditions of local convergence.
In comparison with the HBM, optimal parameters for method (4) can only be obtained numerically by the solution of the 3D constrained problems (19) and (20). As demonstrated, for the quadratic case, the optimal value of $β_{2}$ was equal to zero, so method (4) did not provide additional acceleration in comparison to the standard HBM.
The ’mechanical’ role of $β_{2}$ was demonstrated by the consideration of the ODE (25), which is equivalent to (4) in the 1D case. This ODE describes the descent process in the neighborhood of $x^{*}$ . As can be seen from (25), the presence of $β_{2}$ realized an additional damping of oscillations associated with non-monotone convergence of the HBM [13].
In numerical examples from different applications, it was demonstrated that, with the use of proper values of $β_{2}$ , a decrease in oscillation amplitudes typical of the HBM can be realized.

The following remarks on future investigations can be made:

In this paper, a local convergence analysis was presented. For $f (x) \in F_{l, L}^{1, 1}$ , global convergence for a specific choice of the parameters was demonstrated in [29]. It is imperative to obtain the general conditions for the parameters that guarantee global convergence.As is known for the HBM (e.g., see [28]), the convergence conditions obtained for strongly convex quadratic functions can lead to a lack of global convergence for $f (x) \in F_{l, L}^{1, 1}$ .
An analysis of method (4) was performed for the case of constant values of $β_{1}$ and $β_{2}$ . But as known [18], it is effective to use methods with adaptive momentum, whose value is dependent on k in order to improve the convergence. Thus, the construction of extensions of method (4) to the case of adaptive parameters is a perspective for future research.
In this paper, all methods were considered in their deterministic formulations. However, in modern problems, especially those arising in machine learning, stochastic gradient methods are used according to the size of the datasets. Therefore, the extension of method (4) and its modifications for stochastic optimization has potential for future investigation, especially for applications in machine learning.

Author Contributions

Conceptualization, G.V.K.; methodology, G.V.K.; software, G.V.K. and V.Y.S.; validation, G.V.K. and V.Y.S.; formal analysis, G.V.K.; investigation, G.V.K. and V.Y.S.; writing—original draft preparation, G.V.K.; writing—review and editing, G.V.K. and V.Y.S.; visualization, G.V.K. and V.Y.S.; supervision, G.V.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are included in the article.

Acknowledgments

The authors wish to thank anonymous reviewers for their useful comments and discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bishop, C. Pattern Recognition and Machine Learning; Springer: Berlin, Germany, 2006. [Google Scholar]
Leonard, D.; van Long, N.; Ngo, V.L. Optimal Control Theory and Static Optimization in Economics; Cambridge University Press: Cambridge, UK, 1992. [Google Scholar]
Saad, Y. Iterative Methods for Sparse Linear Systems; SIAM: Philadelphia, PA, USA, 2003. [Google Scholar]
Ljung, L. System Identification: Theory for the User; Prentice Hall PTR: Hoboken, NJ, USA, 1999. [Google Scholar]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer: Berlin, Germany, 2004. [Google Scholar]
Polyak, B. Introduction to Optimization; Optimization Software Inc.: New York, NY, USA, 1987. [Google Scholar]
Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Ghadimi, E.; Feyzmahdavian, H.R.; Johansson, M. Global convergence of the heavy-ball method for convex optimization. In Proceedings of the 2015 European Control Conference (ECC), Linz, Austria, 15–17 July 2015; pp. 310–315. [Google Scholar]
Aujol, J.-F.; Dossal, C.; Rondepierre, A. Convergence rates of the heavy ball method for quasi-strongly convex optimization. SIAM J. Optim. 2022, 32, 1817–1842. [Google Scholar] [CrossRef]
Bhaya, A.; Kaszkurewicz, E. Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method. Neural Netw. 2004, 17, 65–71. [Google Scholar] [CrossRef] [PubMed]
Goujaud, B.; Taylor, A.; Dieuleveut, A. Quadratic minimization: From conjugate gradients to an adaptive heavy-ball method with Polyak step-sizes. In Proceedings of the OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), New Orleans, LA, USA, 3 December 2022. [Google Scholar]
Danilova, M.; Kulakova, A.; Polyak, B. Non-monotone behavior of the heavy ball method. In Difference Equations and Discrete Dynamical Systems with Applications. ICDEA 2018. Springer Proceedings in Mathematics and Statistics; Bohner, M., Siegmund, S., Simon Hilscher, R., Stehlik, P., Eds.; Springer: Berlin, Germany, 2020; pp. 213–230. [Google Scholar]
Danilova, M.; Malinovskiy, G. Averaged heavy-ball method. Comput. Res. Model. 2022, 14, 277–308. [Google Scholar] [CrossRef]
Josz, C.; Lai, L.; Li, X. Convergence of the momentum method for semialgebraic functions with locally Lipschitz gradients. SIAM J. Optim. 2023, 33, 3012–3037. [Google Scholar] [CrossRef]
Wang, H.; Luo, Y.; An, W.; Sun, Q.; Xu, J.; Zhang, L. PID controller-based stochastic optimization acceleration for deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5079–5091. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Yarats, D. Quasi-hyperbolic momentum and Adam for deep learning. In Proceedings of the ICLR 2019: International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Gitman, I.; Lang, H.; Zhang, P.; Xiao, L. Understanding the role of momentum in stochastic gradient methods. In Proceedings of the NeurIPS 2019: Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; Volume 28, pp. 1139–1147. [Google Scholar]
Kidambi, R.; Netrapalli, P.; Jain, P.; Kakade, S. On the insufficiency of existing momentum schemes for Stochastic Optimization. In Proceedings of the NeurIPS 2018: Neural Information Processing Systems, Montreal, QC, Canada, 2–8 December 2018. [Google Scholar]
Attouch, H.; Fadili, J. From the ravinemethod to the Nesterov method and vice versa: A dynamical system perspective. SIAM J. Optim. 2022, 32, 2074–2101. [Google Scholar] [CrossRef]
Attouch, H.; Laszlo, S.C. Newton-like inertial dynamics and proximal algorithms governed by maximally monotone operators. SIAM J. Optim. 2020, 30, 3252–3283. [Google Scholar] [CrossRef]
He, X.; Hu, R.; Fang, Y.P. Convergence rates of inertial primal-dual dynamical methods for separable convex optimization problems. SIAM J. Control Optim. 2020, 59, 3278–3301. [Google Scholar] [CrossRef]
Alecsa, C.D.; Laszlo, S.C. Tikhonov regularization of a perturbed heavy ball system with vanishing damping. SIAM J. Optim. 2021, 31, 2921–2954. [Google Scholar] [CrossRef]
Diakonikolas, J.; Jordan, M.I. Generalized momentum-based methods: A Hamiltonian perspective. SIAM J. Optim. 2021, 31, 915–944. [Google Scholar]
Yan, Y.; Yang, T.; Li, Z.; Lin, Q.; Yang, Y. A unified analysis of stochastic momentum methods for deep learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
Van Scoy, B.; Freeman, R.; Lynch, K. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Syst. Lett. 2018, 2, 49–54. [Google Scholar] [CrossRef]
Lessard, L.; Recht, B.; Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 2016, 26, 57–95. [Google Scholar] [CrossRef]
Cyrus, S.; Hu, B.; Van Scoy, B.; Lessard, L. A robust accelerated optimization algorithm for strongly convex functions. In Proceedings of the 2018 Annual American Control Conference (ACC), Milwaukee, WI, USA, 27–29 June 2018. [Google Scholar]
Gantmacher, F.R. The Theory of Matrices; Chelsea Publishing Company: New York, NY, USA, 1984. [Google Scholar]
Gopal, M. Control Systems: Principles and Design; McGraw Hill: New York, NY, USA, 2002. [Google Scholar]
Su, W.; Boyd, S.; Candes, J. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. J. Mach. Learn. Res. 2016, 17, 1–43. [Google Scholar]
Luo, H.; Chen, L. From differential equation solvers to accelerated first-order methods for convex optimization. Math. Program. 2022, 195, 735–781. [Google Scholar] [CrossRef]
Eftekhari, A.; Vandereycken, B.; Vilmart, G.; Zygalakis, K.C. Explicit stabilised gradient descent for faster strongly convex optimisation. BIT Numer. Math. 2021, 61, 119–139. [Google Scholar] [CrossRef]
Fountoulakis, K.; Gondzio, J. A second-order method for strongly convex ℓ₁-regularization problems. Math. Program. 2016, 156, 189–219. [Google Scholar] [CrossRef]
Scieur, D.; d’Aspremont, A.; Bach, F. Regularized nonlinear acceleration. Math. Program. 2020, 179, 47–83. [Google Scholar] [CrossRef]

Figure 1. Plots of the dependence of optimal values of F on the value of

β_{2}

: (a)

β_{2} \in [0, 1]

; (b)

β_{2} \in [0, 100]

.

Figure 1. Plots of the dependence of optimal values of F on the value of

β_{2}

: (a)

β_{2} \in [0, 1]

; (b)

β_{2} \in [0, 100]

.

Figure 2. Plots of the dependence of the optimal convergence rate on logarithm of

κ

.

Figure 2. Plots of the dependence of the optimal convergence rate on logarithm of

κ

.

Figure 3. Plots of the descent trajectories (a) and dependence of the error logarithm on iteration number (b) for the minimization of the 2D Rosenbrock function. Blue line corresponds to the HBM, red line—to method (4).

Figure 4. Plots of the descent trajectories (a) and dependence of the error logarithm on iteration number (b) for the minimization of the Himmelblau function. Blue line corresponds to the HBM, red line—to method (4).

Figure 5. Plots of the descent trajectories (a) and dependence of the error logarithm on iteration number (b) for the minimization of the Styblinski–Tang function. Blue line corresponds to the HBM, red line—to method (4).

Figure 6. Plots of the dependence of error on iteration number for minimization of Styblinski–Tang function for

d = 100

in log–log axes. Blue line corresponds to the HBM, red line—to method (4).

Figure 6. Plots of the dependence of error on iteration number for minimization of Styblinski–Tang function for

d = 100

in log–log axes. Blue line corresponds to the HBM, red line—to method (4).

Figure 7. Plots of the descent trajectories (a) and the dependence of the error logarithm on the iteration number (b) for the minimization of the Zakharov function. Blue line corresponds to the HBM, red line—to method (4).

Figure 8. Plots of the dependence of the error on iteration number for the minimization of the Zakharov function for

d = 10

in log–log axes. Blue line corresponds to the HBM, red line—to method (4).

Figure 8. Plots of the dependence of the error on iteration number for the minimization of the Zakharov function for

d = 10

in log–log axes. Blue line corresponds to the HBM, red line—to method (4).

Figure 9. Plots of the dependence of error on iteration number for minimization of function (26) in log–log axes. Blue line corresponds to the HBM, red line — to method (4).

Figure 10. Plots of the dependence of the error on the iteration number for the regression problem with smoothed elastic net regularization in log–log axes for different model datasets. Blue line corresponds to the HBM, red line—to method (4).

Figure 11. Plots of the dependence of error on the iteration number for the logistic regression problem in log–log axes for datasets SONAR (a) and CINA0 (b). Blue line corresponds to the HBM, red line—to method (4).

Figure 12. Plots of the dependence of the objective function value on the epoch number for the problem of RNN training.

Figure 13. Plots of the dependence of the objective function value on epoch number for the problem of RNN training for method (4) at different values of

β_{2}

.

Figure 13. Plots of the dependence of the objective function value on epoch number for the problem of RNN training for method (4) at different values of

β_{2}

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krivovichev, G.V.; Sergeeva, V.Y. Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization. Algorithms 2024, 17, 126. https://doi.org/10.3390/a17030126

AMA Style

Krivovichev GV, Sergeeva VY. Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization. Algorithms. 2024; 17(3):126. https://doi.org/10.3390/a17030126

Chicago/Turabian Style

Krivovichev, Gerasim V., and Valentina Yu. Sergeeva. 2024. "Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization" Algorithms 17, no. 3: 126. https://doi.org/10.3390/a17030126

APA Style

Krivovichev, G. V., & Sergeeva, V. Y. (2024). Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization. Algorithms, 17(3), 126. https://doi.org/10.3390/a17030126

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of a Two-Step Gradient Method with Two Momentum Parameters for Strongly Convex Unconstrained Optimization

Abstract

1. Introduction

2. Analysis of Two-Step Method

2.1. Convergence Conditions

2.2. Analysis of Convergence Rate

2.3. Optimal Parameters

2.4. Equivalent ODE

3. Numerical Experiments and Discussion

3.1. Rosenbrock Function

3.2. Himmelblau Function

3.3. Styblinski–Tang Function

3.4. Zakharov Function

3.5. Non-Convex Function in Multidimensional Space

3.6. Smoothed Elastic Net Regularization

3.7. Logistic Regression

3.8. Recurrent Neural Network

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI