Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction

Tovbis, Elena; Krutikov, Vladimir; Kazakovtsev, Lev

doi:10.3390/math12111618

Open AccessArticle

Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction

by

Elena Tovbis

¹

,

Vladimir Krutikov

^1,2

and

Lev Kazakovtsev

^1,*

¹

Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, 31, Krasnoyarskii Rabochii Prospekt, Krasnoyarsk 660037, Russia

²

Department of Applied Mathematics, Kemerovo State University, 6 Krasnaya Street, Kemerovo 650043, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(11), 1618; https://doi.org/10.3390/math12111618

Submission received: 30 March 2024 / Revised: 14 May 2024 / Accepted: 16 May 2024 / Published: 22 May 2024

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

The work proves that under conditions of instability of the second derivatives of the function in the minimization region, the estimate of the convergence rate of Newton’s method is determined by the parameters of the irreducible part of the conditionality degree of the problem. These parameters represent the degree of difference between eigenvalues of the matrices of the second derivatives in the coordinate system, where this difference is minimal, and the resulting estimate of the convergence rate subsequently acts as a standard. The paper studies the convergence rate of the relaxation subgradient method (RSM) with optimization of the parameters of two-rank correction of metric matrices on smooth strongly convex functions with a Lipschitz gradient without assumptions about the existence of second derivatives of the function. The considered RSM is similar in structure to quasi-Newton minimization methods. Unlike the latter, its metric matrix is not an approximation of the inverse matrix of second derivatives but is adjusted in such a way that it enables one to find the descent direction that takes the method beyond a certain neighborhood of the current minimum as a result of one-dimensional minimization along it. This means that the metric matrix enables one to turn the current gradient into a direction that is gradient-consistent with the set of gradients of some neighborhood of the current minimum. Under broad assumptions on the parameters of transformations of metric matrices, an estimate of the convergence rate of the studied RSM and an estimate of its ability to exclude removable linear background are obtained. The obtained estimates turn out to be qualitatively similar to estimates for Newton’s method. In this case, the assumption of the existence of second derivatives of the function is not required. A computational experiment was carried out in which the quasi-Newton BFGS method and the subgradient method under study were compared on various types of smooth functions. The testing results indicate the effectiveness of the subgradient method in minimizing smooth functions with a high degree of conditionality of the problem and its ability to eliminate the linear background that worsens the convergence.

Keywords:

minimization; subgradient method; convergence rate

MSC:

90C53

1. Introduction

We consider a problem of minimization of convex differentiable function f(x), x ∈ Rⁿ, where Rⁿ is finite-dimensional Euclidean space. Under conditions of a high degree of a function degeneracy, it is necessary to use Newton-type minimization methods, for example, modifications of Newton’s method, or quasi-Newton methods.

While in the minimum neighborhood, there is a stable quadratic representation of the function, most iterations of the minimization method take place outside the extremum area, so it seems relevant to study the accelerating properties of methods with changing the space metric under conditions of instability of the quadratic properties of the function.

Numerous studies on the convergence rate of Newton methods and quasi-Newton methods in the extremum region have been conducted, and some of them are given in [1,2,3,4,5,6,7,8]. The results obtained in [9,10] refer to the convergence rate of quasi-Newton minimization methods under the assumption that the method operates in the extremum area of the function. The authors of [11] aimed at accelerating the symmetric rank-1 quasi-Newton method with Nesterov’s gradient. A convergence rate of incremental quasi-Newton method was investigated in [12,13]. Large-scale optimization through the sampled versions of quasi-Newton method was considered in [14,15]. Also, the convergence rate of randomized and greedy variants of Newtonian methods and quasi-Newton methods were presented in [16,17,18,19,20,21,22,23,24].

As an object of minimization, we use strongly convex functions with a Lipschitz gradient [25]. In the case of the existence of second derivatives, these constants limit the spread of the Hessian eigenvalues in the minimization region [25]. The ratio ρ/L ≤ 1 of the strong convexity constant ρ and Lipschitz constant L determines the convergence rate of gradient minimization methods with an indicator q ≈ 1 − ρ/L of approach to the extremum by function [26].

As the presence of a removable linear background, we will understand the existence of a linear coordinate transformation V ∈ R^nxn that allows us to significantly increase the ratio of constants in the new coordinate system ρ_V/L_V >> ρ/L. The advantages of the gradient method in the new coordinate system with the indicator q ≈ 1 − ρ_V/L_V are obvious. However, this estimate is not feasible, since the transformation V is not known.

This research is a continuation of previous studies [27,28] and aimed at studying the capabilities of the Newton’s method and the relaxation subgradient method with optimization of the parameters of rank-two correction of metric matrices [27] to eliminate the linear background that worsens the convergence in the conditions of the existence of transformation V with the properties noted above. Similar studies for quasi-Newton methods were carried out in [29].

Newton’s method is invariant with respect to linear coordinate transformation and allows one to obtain an estimate of the convergence rate for Newton’s method with indicator q ≈ 1 − ρ²_V/L²_V. This makes it possible to draw a conclusion about the ability of Newton’s method to exclude from the function being minimized a linear background that worsens the convergence, which is eliminated using a linear transformation of coordinates. In what follows, this estimate serves as a standard, and the ability of a certain method, like Newton’s method, to exclude linear background will be called its Newtonian property. The main goal of the work is to substantiate the presence of the Newtonian property in RSM with a change in the space metric [27]. As shown in [29], the noted Newtonian property is inherent in quasi-Newton methods.

There are a number of directions for constructing non-smooth optimization methods, some of which are given in [25,30,31]. The works [32,33,34] considered an approach to creating smooth approximations for non-smooth functions. Methods of this class are applicable to a wide range of problems. A number of effective approaches in the field of non-smooth optimization arose as a result of the creation of the first subgradient methods with space dilation [35,36], in the class of minimization methods relaxing both in function and in distance to the extremum [25,37,38].

The first RSMs were proposed in [39,40,41]. In [36], an effective RSM with space dilation in the direction of the subgradient difference (RSMSD) was developed. Subsequent work on the creation of effective RSMs is associated with identifying the origin of RSMSD and its theoretical justification [42,43]. Formalization of the model of subgradient sets and the use of ideas and machine learning algorithms [44] made it possible to identify the principles of organizing RSM with space dilation [43] and obtain a theoretical basis for their creation. It turned out that the problem of finding the descent direction in RSM can be reduced to the problem of solving a system of inequalities on subgradient sets and mathematically formulated as a solution to the problem of minimizing a quality functional. In this case, the convergence rate of the minimization method is determined by the properties of the learning algorithm.

The principle of RSM organizing does not rely on second derivative of functions. The method under study is similar in structure to the quasi-Newton methods, and its formulas for transforming metric matrices are similar in structure to the formulas of the quasi-Newton DFP method. The purpose of converting metric matrices in RSM is to find a metric matrix that transforms subgradients into a direction that forms an acute angle with all subgradients in the neighborhood of the current minimum approximation. Using this direction enables us to go beyond this neighborhood.

Studied RSM with optimization of the parameters of rank-two metric matrix correction [27] is the result of RSM improvement from [43]. The problem of finding the descent direction in the RSM from [43] is reduced to the problem of solving a system of inequalities to develop a descent direction that forms an acute angle with the set of subgradients of a certain neighborhood of the current minimum. In this case, the descent direction is found similarly to how it is done in quasi-Newton methods, by multiplying the matrix by the subgradient. In [27], compared with the algorithm from [43], a faster algorithm for solving systems of inequalities was proposed, which was confirmed by a computational experiment in [27] for RSM on this basis.

In this work, a qualitative analysis of formulas for choosing algorithm parameters from [27] is carried out, and on this basis, a new method is proposed for finding matrix transformation parameters. In contrast to RSM from [27,42], where the algorithm convergence is justified under strict restrictions on the transformation parameters of metric matrices, in this work, estimates of the algorithm convergence rate on smooth functions are obtained for a wide range of matrix transformation parameters. Therefore, one can customize the method to solve problems of a certain class by selecting parameters for converting metric matrices.

For the studied RSM, it is shown that the method is invariant under linear coordinate transformation. An estimate of its convergence rate on strongly convex functions with a Lipschitz gradient is obtained. The property of Newton’s method is to eliminate the high degree of conditionality of the minimization problem caused by the linear background, which is also inherent in the subgradient method under study. At the same time, estimates of the convergence rate of Newton’s method and the method under study are qualitatively similar in reflecting the influence of the characteristics of the ill-conditioned problem.

To solve both smooth and non-smooth problems, universal algorithms have been developed and implemented, which are the practical implementation of an idealized version of the method. To detect the Newtonian property in the proposed methods, special test functions have been developed. The first of them simulates the random nature of changes in the properties of a function. In another function, a targeted change is made in the elongation of the function level lines along the coordinate axes as it approaches the extremum. In one of the functions, the axes of level lines elongation change due to movement along an ellipsoidal ravine.

In the computational experiment, a comparison is made of the quasi-Newtonian BFGS method and the investigated universal subgradient methods on the proposed test functions. The testing results indicate the effectiveness of the developed methods in minimizing smooth functions with a high degree of conditionality and their ability to exclude linear background that worsens convergence. Depending on the type of function, different methods dominate, which allows us to conclude that the subgradient method is applicable along with quasi-Newton methods in solving problems of minimizing smooth functions with a high degree of conditionality.

The rest of the paper is organized as follows. In Section 2, the accelerating properties of Newton’s method under conditions of instability of second derivatives of the function are considered. In Section 3, a subgradient method is presented that solves the problem of forming the direction of descent. The convergence rate of the subgradient method on strongly convex functions with Lipschitz gradient is discussed in Section 4. Features of the implementation of the subgradient method are presented in Section 5. The results of a numerical study on smooth functions are shown in Section 6. Section 7 concludes the work.

2. Accelerating Properties of Newton’s Method under Conditions of Instability of Second Derivatives

Denote

f_{k} = f (x_{k})

. For non-smooth functions, we will denote a vector from the subgradient set

g_{k} = g (x_{k}) \in \partial f (x_{k})

. Due to the coincidence of the gradient and subgradient on smooth functions, we will also use this notation for smooth functions

g_{k} = g (x_{k}) = \nabla f (x_{k})

.

Condition 1.

We will assume that the function being minimized f(x), x ∈

R^{n}

is differentiable and strongly convex in

R^{n}

, i.e., there exists ρ > 0 such that the inequality:

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y) - α (1 - α) ρ {‖x - y‖}^{2} / 2,

holds for all x,y ∈

R^{n}

and α ∈ [0, 1], and the gradient

g (x) = \nabla f (x)

satisfies the Lipschitz condition

‖g (x) - g (y)‖ \leq L ‖x - y‖ \forall x, y \in R^{n}, L > 0 .

(1)

Functions which fulfill Condition 1 satisfy the relations [25]:

f (x) - f^{*} \leq \frac{{‖g (x)‖}^{2}}{2 ρ}, \forall x \in R^{n},

(2)

f (x) - f^{*} \geq \frac{ρ {‖x - x^{*}‖}^{2}}{2}, \forall x \in R^{n},

(3)

{‖g (x)‖}^{2} \leq 2 L (f (x) - f^{*}), \forall x \in R^{n},

(4)

where x* is the minimum point and f* = f(x*) is the function value at the minimum point.

The iteration of the gradient-consistent method with exact one-dimensional descent has the form:

x_{k + 1} = x_{k} - β_{k} s_{k}, 〈s_{k}, g (x_{k})〉 > 0,

(5)

β_{k} = \underset{β \geq 0}{a r g m i n} f (x_{k} - β s_{k}),

(6)

where the initial point is x₀ and s_k is a search direction.

Theorem 1.

Let the function satisfy Condition 1. Then, the sequence of iterations j = 0, 1,…, k of the process (5), (6) is estimated as:

f_{k + 1} - f^{*} \leq Q_{k} (f_{0} - f^{*}), Q_{k} = \prod_{j = 0}^{k} q_{j} .

(7)

q_{j} = 1 - \frac{ρ {〈g_{j}, s_{j}〉}^{2}}{L {‖g_{j}‖}^{2} {‖s_{j}‖}^{2}}

(8)

Proof of Theorem 1.

We present the exact value of the function reduction indicator

{q_{k}}^{*}

at iteration in the form:

{q_{k}}^{*} = \frac{f_{k + 1} - f^{*}}{f_{k} - f^{*}} = \frac{f_{k + 1} - f_{k} + f_{k} - f^{*}}{f_{k} - f^{*}} = 1 - \frac{f_{k} - f_{k + 1}}{f_{k} - f^{*}} .

(9)

Let us make estimates for numerator and denominator in (9). According to (2), for the denominator, we obtain:

f_{k} - f^{*} \leq \frac{{‖g_{k}‖}^{2}}{2 ρ} .

(10)

According to (6), f_k₊₁ is the minimum of a one-dimensional function whose gradient is

〈{g (x}_{k}), s_{k}〉 / ‖s_{k}‖

. Due to the fact that this one-dimensional function also satisfies Condition 1, to estimate the numerator in (9), taking into account inequality (4), we obtain:

f_{k} - f_{k + 1} \geq \frac{{〈{g (x}_{k}), s_{k}〉}^{2}}{2 L {‖s_{k}‖}^{2}},

Using (9) and (10), we obtain (8):

{q_{k}}^{*} = 1 - \frac{f_{k} - f_{k + 1}}{f_{k} - f^{*}} \leq q_{k} = 1 - \frac{ρ {〈g_{k}, s_{k}〉}^{2}}{L {‖s_{k}‖}^{2} {‖g_{k}‖}^{2}} .

□

Based on Theorem 1, the convergence rate indicator for the gradient method (5), (6) with a choice of descent direction

s_{k} = \nabla f (x_{k})

(11)

according to (8) and (11), will have the form:

q_{k} = 1 - \frac{ρ}{L} .

(12)

Let us consider an estimate of the convergence rate of Newton’s method under Condition 1 and the assumption of the existence of second derivatives of the function.

Theorem 2.

Let the function be twice differentiable and satisfy Condition 1. Then, for a sequence of iterations j = 0, 1,…, k of process (5), (6) with the choice of Newton’s method direction

s_{k} = {[\nabla^{2} f (x_{k})]}^{- 1} \nabla f (x_{k}),

(13)

the estimation takes place:

f_{k + 1} - f^{*} \leq Q_{k} (f_{0} - f^{*}), Q_{k} = \prod_{j = 0}^{k} q_{j}, q_{j} = 1 - \frac{ρ^{2}}{L^{2}}, j = 0, 1, \dots, k .

(14)

Proof of Theorem 2.

Hessian

\nabla^{2} f (x_{k})

under Condition 1 satisfies the constraints [25]:

ρ 〈z, z〉 \leq 〈\nabla^{2} f (x) z, z〉 \leq L 〈z, z〉, \forall z \in R^{n} .

(15)

Denote

H_{k} = {[\nabla^{2} f (x_{k})]}^{- 1}

, and

{H_{k}}^{0.5}

is a symmetric matrix such that

H_{k} = {H_{k}}^{0.5} {H_{k}}^{0.5}

,

{z = H}_{k}^{0.5} g_{k}

.

To use Theorem 1, we estimate

{〈g_{k}, s_{k}〉}^{2} / {‖g_{k}‖}^{2} {‖s_{k}‖}^{2}

for direction (13) subject to constraints (15):

\frac{{〈g_{k}, s_{k}〉}^{2}}{{‖g_{k}‖}^{2} {‖s_{k}‖}^{2}} = \frac{{〈g_{k}, H_{k} g_{k}〉}^{2}}{{‖g_{k}‖}^{2} 〈{H_{k} g}_{k}, H_{k} g_{k}〉} = \frac{{〈z, z〉}^{2}}{〈z, {H_{k}}^{- 1} z〉 〈H_{k} z, z〉} \geq \frac{ρ}{L} .

Using the last estimate in (8), we obtain estimate (14). □

Let function f(x) satisfy Condition 1. Define the transformation of variables:

\hat{x} = P x,

(16)

where P ∈ Rⁿ^·n is a non-singular matrix. In the new coordinate system, the function to be minimized takes the form:

f (x) = f (P^{- 1} \hat{x}) = f_{p} (\hat{x}) .

(17)

The resulting function also satisfies Condition 1 with the strong convexity constant ρ_p and Lipschitz constants L_p.

Let V ∈ Rⁿ^·n be a non-singular matrix such that for the strong convexity and Lipschitz constants of functions

f_{V} (\hat{x})

with

\hat{x} = V x

(18)

and

f_{p} (\hat{x}), f o r a n a r b i t r a r y n o n - s i n g u l a r m a t r i x

P ∈ Rⁿ^·n, the inequality takes place:

\frac{ρ_{V}}{L_{V}} \geq \frac{ρ_{p}}{L_{p}}

(19)

Transformation (18) subsequently plays the role of a selected coordinate system, the best in terms of the convergence rate of gradient methods. Due to the fact that the gradient method, unlike Newton’s method, is not invariant under a linear coordinate transformation, we cannot use the strong convexity and Lipschitz constants in the preferred coordinate system (18).

Theorem 3.

Let the function be twice differentiable and satisfy Condition 1. Then, for the sequence of iterations j = 0, 1,…, k process (5), (6) with the choice of the Newton’s method direction (13), the following estimate holds:

f_{k + 1} - f^{*} \leq Q_{k} (f_{0} - f^{*}), Q_{k} = \prod_{j = 0}^{k} q_{j}, q_{j} = 1 - \frac{ρ_{V}^{2}}{L_{V}^{2}}, j = 0, 1, \dots, k,

(20)

corresponding to the selected coordinate system (18), which has property (19).

Proof of Theorem 3.

The iteration of Newton’s method (5), (6), (13) with exact one-dimensional descent (6), has the form:

x_{k + 1} = x_{k} - β_{k} {[\nabla^{2} f (x_{k})]}^{- 1} \nabla f (x_{k}) .

(21)

Characteristics of functions

f (x)

and

f_{p} (\hat{x}),

taking into account (16) and (17), are related by:

f_{p} (\hat{x}) = f (x), \nabla {\hat{f}}_{p} (\hat{x}) = P^{- T} \nabla f (x), \nabla^{2} {\hat{f}}_{p} (\hat{x}) = P^{- T} \nabla^{2} f (x) P^{- 1} .

(22)

After transferring process (21) to a new coordinate system, we obtain its coincidence with the method in the new coordinate system:

P x_{k + 1} = P x_{k} - β_{k} P {[\nabla^{2} f (x_{k})]}^{- 1} P^{T} P^{- T} \nabla f (x_{k}) = {\hat{x}}_{k} - β_{k} {[{P^{- T} \nabla}^{2} f (x_{k}) P^{- 1}]}^{- 1} P^{- T} \nabla f (x_{k}) = {\hat{x}}_{k} - β_{k} {[\nabla^{2} {\hat{f}}_{p} ({\hat{x}}_{k})]}^{- 1} \nabla {\hat{f}}_{p} ({\hat{x}}_{k}) .

(23)

In the case of the relation of the initial points

{\hat{x}}_{0} = P x_{0}

for Newton’s method in different coordinate systems, according to (23), at

β_{k} = {\hat{β}}_{k}

sequences of points related by the

{\hat{x}}_{k} = P x_{k}

and equal values of the functions

{f_{p} (\hat{x}}_{k}) = f (x_{k})

are generated. Moreover, taking into account the fact that the method with exact one-dimensional minimization (6) is considered, due to the extremum condition:

〈s_{k}, \nabla f (x_{k})〉 = 〈P^{- 1} P s_{k}, \nabla f (x_{k + 1})〉 = 〈P s_{k}, P^{- T} \nabla f (x_{k + 1})〉 = 〈{\hat{s}}_{k}, \nabla {\hat{f}}_{p} ({\hat{x}}_{k + 1})〉 = 0

equality

β_{k} = {\hat{β}}_{k}

will hold. Due to the invariance of Newton’s method with respect to the linear transformation of coordinates (16), when the initial conditions

{\hat{x}}_{0} = P x_{0}

are related, Newton’s method generates identical sequences of function values in different coordinate systems. Applying the estimate in the coordinate system

\hat{x} = V x

, taking into account the results of Theorem 2, we obtain estimate (20). □

The last estimate according to (19) determines the advantages of Newton’s method compared to the gradient method in the case of:

\frac{ρ_{V}^{2}}{L_{V}^{2}} ≫ \frac{ρ}{L} .

(24)

Taking into account the fact that when solving practical problems, most of the iterations of the method often occur under conditions of significant Hessian variation (15), estimate (20), subject to condition (24), explains the advantages of Newton’s method. In this case, no additional restrictions on the second derivatives under smoothness conditions are required.

3. Subgradient Minimization Method

Here, we will give an exposition of the subgradient method [27], which solves the problem of forming the descent direction, which makes it possible to obtain a new point of the current minimum approximation by means of one-dimensional minimization along it outside a certain neighborhood of the current minimum. In this case, the appropriate direction is a vector consistent with all subgradients at points in a certain neighborhood of the current minimum approximation. In the case of smooth functions, the descent direction is matched with a set of neighborhood gradients obtained at iterations of the method.

In relaxation processes of the ε-subgradient type, successive approximations are constructed according to the formulas [39,40,41,43,45]:

x_{k + 1} = x_{k} - γ_{k} s_{k + 1}, γ_{k} = a r g \min_{γ} f (x_{k} - γ s_{k + 1})

(25)

The descent direction s_k₊₁ is selected from a set

S (\partial_{ε} f (x_{k}))

, where

\partial_{ε} f (x_{k})

is ε-subgradient set at a point

x_{k}

and

S (G) = {s \in R^{n} | \min_{g \in G} (s, g) > 0},

G \subset R^{n}

is a set of feasible directions. Denote a subgradient set at a point x by

\partial f (x) \equiv \partial f_{ε = 0} (x)

. If the set S(G) is not empty, then, according to its definition, any vector s ∈ S(G) is a solution to the set of inequalities:

〈s, g〉 > 0, \forall g \in G,

(26)

that is, it specifies the normal of the separating plane of the origin and the set G. One of the solutions to (26) is a vector η(G) of minimal length from G. For example, in the ε-steepest descent method,

s_{k + 1} = η (\partial_{ε} f (x_{k}))

[41]. Due to the absence of an explicit definition of the ε-subgradient set, in (25), the vector s that satisfies condition (26) is used as the descent direction, and the set G here is the shell of subgradients obtained on the descent trajectory [39,40,41].

The elements of the set G on smooth functions are the gradients of the current minimum neighborhood. Figure 1 shows the set G with the designations of its elements, which will be given below.

Denote by η_G a vector of minimum length from the set G,

ρ_{G} = ‖η_{G}‖

,

μ_{G} = η_{G} / ‖η_{G}‖

,

{s^{*} = μ}_{G} / ρ_{G}

,

R_{G} = \max_{g \in G} ‖g‖

,

R_{s} = \max_{g \in G} (μ_{G}, g)

,

M_{G} = R_{S} / ρ_{G}

. For a certain set G, we will also use the noted characteristics indicating the set as an argument, for example, η(G), r(G).

We will assume that the following assumption holds for the set G.

Assumption 1.

Set G is convex, closed, limited (

R_{G}

< ∞), and satisfies the separability condition, i.e.,

ρ_{G} > 0

.

Let us introduce the relation θ(M) and its inverse function m(θ).

θ (M) = {(M - 1)}^{2} / {(M + 1)}^{2}, m (θ) = (1 + θ^{\frac{1}{2}}) / (1 - θ^{\frac{1}{2}}),

(27)

Thus,

m (θ (M)) = M

. For some limited θ, define the relations:

a (θ) = 1 / (2 θ), b (θ) = 1 / (2 (1 - θ)), 0 < θ < 1 / 2 .

(28)

Vector s* is a solution to the system of inequalities (26). Parameters ρ and R_S characterize the thickness of the set G in the direction μ. The quantity R_S, according to its definition, determines the thickness of the set G and significantly affects the convergence rate of learning algorithms with space dilation. When the thickness of the set is zero, when

R_{s} = ρ

, we have the case of a flat set.

The quantity

M_{G}

determines the complexity of solving system (26). The transformation parameters of the metric matrices of the subgradient method are found according to expression (28).

In this work, two versions of the subgradient method will be presented. The first of them involves an exact one-dimensional search. For this version of the algorithm, estimates of the convergence rate on smooth functions will be obtained. The second version of the minimization algorithm is intended for practical implementation, where a rough one-dimensional search is used. To correctly integrate a method for solving systems of inequalities into minimization algorithms and comply with the restrictions imposed on its actions, we need to outline it. In the subgradient methods under study, the following Algorithm 1 for solving the system of inequalities (26) is used to estimate the separating plane parameters.

Algorithm 1 [27]. Algorithm for solving a system of inequalities

1. Assume k = 0, H₀ = I, q ≥ 1. Set θ_A such that:

θ (M_{G}) \leq θ_{A} < 1 / 2

(29)

and

M_{A} \equiv m (θ_{A})

.
2. Set

g_{k} \in G

and

s_{k} = H_{k} g_{k}

, which is the current approximation of the solution to the system of inequalities

〈s, g〉 > 0 \forall g \in G

. Find a vector

u_{k} \in G

such that:

〈H_{k} g_{k}, u_{k}〉 \leq 0 .

(30)

If such a vector does not exist, then the solution

s_{k} = H_{k} g_{k}

is found; stop the algorithm.
3. Compute vectors:

y_{k} = g_{k} - u_{k}, p_{k} = g_{k} + t_{k} y_{k}, where t_{k} = - \frac{〈y_{k}, H_{k} g_{k}〉}{〈y_{k}, H_{k} y_{k}〉} .

(31)

Here, the vector p_k, is found from the condition of vectors

v_{k} = H_{k} p_{k}

and

y_{k}

orthogonality:

〈{y_{k}, H}_{k} p_{k}〉 = 〈y_{k} {, v}_{k}〉 = 0 .

(32)

Compute

θ_{g k} (M_{A})

, where:

θ_{g k} (M) = {(1 + \frac{〈y_{k}, H y_{k}〉}{{(M - 1)}^{2} 〈p_{k}, H p_{k}〉} {(1 + \frac{C_{k}}{〈y_{k}, H y_{k}〉} (M - 1))}^{2})}^{- 1},

(33)

C_{k} = \min \{|〈{y_{k}, H}_{k} g_{k}〉|, |〈{y_{k}, H}_{k} u_{k}〉|\} .

Find the parameter:

θ_{k} = \{\begin{matrix} θ_{A} / q^{2}, i f θ_{g k} (M_{A}) \leq θ_{A} / q^{2}, \\ θ_{A}, i f θ_{g k} (M_{A}) \geq θ_{A}, \\ θ_{g k} (M_{A}), o t h e r w i s e . \end{matrix}

(34)

Find the parameters according to (28):

α_{k}^{2} = a (θ_{k}), β_{k}^{2} = b (θ_{k}) .

(35)

We obtain a new approximation of the metric matrix

H_{k + 1} = (H_{k}, α_{k}, β_{k}, y_{k}, p_{k})

, where:

(H, α, β, y, p) = H - (1 - \frac{1}{α^{2}}) \frac{H y y^{T} H^{T}}{〈y, H y〉} - (1 - \frac{1}{β^{2}}) \frac{H p p^{T} H^{T}}{〈p, H p〉} .

(36)

4. Assign k = k + 1. Go to step 2.

Constraint (29) for the set G in the case of applying Algorithm 1 in the minimization method imposes restrictions on the subgradient sets of the non-smooth minimization problem. In the case of smooth minimization problems, one can arbitrarily choose the parameter satisfying (29). This parameter is selected experimentally in order to optimize the algorithm efficiency.

Denote

α_{A}^{2} = a (θ_{A}), β_{A}^{2} = b (θ_{A})

. It was proven in [27] that Algorithm 1 converges in a finite number of iterations on a set G, satisfying Assumption 1, and for algorithm parameters V₀ and θ_A, for which the restrictions

0 < V_{0} \leq ρ_{G}^{2} / R_{G}^{2}

and (29) are satisfied. In this case, the number of iterations does not exceed k₀—the minimum integer number from the range of values k satisfying the inequality:

\frac{25 k (q^{2} α_{A}^{2} - 1)}{n V_{0}^{2} [{(α_{A}^{2} β_{A}^{2})}^{k / n} - 1]} < 1

From the above estimate, the conclusion can be made that larger values of

α_{A}^{2} β_{A}^{2}

correspond to fewer number of iterations k₀, which means that the desired direction will be found in fewer number of iterations. The last estimate is based on the worst-case scenario, when all

α_{k}^{2} β_{k}^{2} = α_{A}^{2} β_{A}^{2}

. In fact, according to the results of a computational experiment in [27], a minimization algorithm based on Algorithm 1 with parameter (35) is more effective than with fixed parameters

α_{A}^{2} β_{A}^{2}

.

The version of the minimization algorithm presented in this section uses exact one-dimensional descent and is intended to estimate its convergence rate on smooth functions. A practically implementable version of the algorithm without exact one-dimensional descent will be presented in the next section. Here, as well as in the practically implemented version of the algorithm, there are no updates for the parameters of the algorithm for solving systems of inequalities in the form of setting H_k = I, which are used in the theoretical version of the algorithm from [27], necessary for the theoretical justification of the convergence of the minimization algorithm on non-smooth functions. In the version of the minimization algorithm used in practice when minimizing both smooth and non-smooth functions, the above update is absent, but there are minor changes to the diagonal elements of the matrix

H_{k} \to H_{k} + λ I

, excluding its poor conditionality and scaling, and

H_{k} \to {c H}_{k} c > 1

, excluding excessive reduction of its elements. Therefore, the described version of the algorithm is closest to the implemented versions designed to minimize smooth and non-smooth functions. As before, to denote both the gradient and the subgradient at some point x_k, we will use the notation

g_{k} = g (x_{k}) \in \partial f (x_{k})

.

At Step 2 of Algorithm 1, vector

g_{k} \in G

is given arbitrarily, vector

u_{k} \in G

having property (30) is found in the set. In the minimization algorithm, we assume

g_{k} \in \partial f (x_{k})

at the point of current minimum approximation, determine the descent direction

s_{k} = H_{k} g_{k}

, and find the new minimum approximation

x_{k + 1} = x_{k} - γ_{k} s_{k + 1}

.

In the case of exact one-dimensional minimization, the equality

〈g_{k + 1}, s_{k}〉 = 〈g_{k + 1}, H_{k} g_{k}〉

holds for the gradient at a point

x_{k + 1}

. Therefore, in Algorithm 1, built into the minimization algorithm, we can take the vectors

H_{k} g_{k}

and

u_{k} = g_{k + 1}

as a new pair of vectors in (30), for which, due to exact one-dimensional descent, an inequality similar to (30) will be satisfied

〈H_{k} g_{k}, g_{k + 1}〉 = 0 .

Due to the arbitrary choice of vector

g_{k}

in Algorithm 1, at the next iteration in the minimization algorithm, the vector

g_{k + 1}

can be chosen. An idealized version of such a minimization algorithm is Algorithm 2 described below. Estimation of the convergence rate of this algorithm is the goal of our work.

In the case of inexact one-dimensional descent, it is assumed that the one-dimensional minimum has been localized, that is, a point

z_{k + 1} = x_{k} - γ_{z} s_{k}

has been obtained such that the subgradient

u_{k + 1}

at the extreme point

z_{k + 1}

satisfies the inequality (30)

〈H_{k} g_{k}, u_{k + 1}〉 = 〈s_{k}, u_{k + 1}〉 \leq 0

(Figure 2). Subgradient

u_{k + 1}

will be used for the matrix

H_{k}

transformation. Figure 2 shows the point

x_{k + 1}

with the smallest found function value, which, at the next iteration, will become the new current minimum point with the direction of minimization

s_{k + 1} = H_{k + 1} g_{k + 1}

. A presentation of the practical version of the algorithm and its numerical analysis will be given in subsequent sections.

The following minimization algorithm assumes exact one-dimensional descent. An infinite sequence of points

x_{k}

is constructed until the gradient becomes zero. For this version of the algorithm, an estimate of the convergence rate has been made.

Algorithm 2. Minimization algorithm

1. Assume k = 0, H₀ = I, q ≥ 1. Set θ_A such that:

θ_{A} < 1 / 2

(37)

and

M_{A} \equiv m (θ_{A})

. Compute

g_{0} = \nabla f (x_{0})

, If

g_{0} = 0

then the minimum point is found, stop the algorithm.
2. Find a new minimum approximation:

x_{k + 1} = x_{k} - γ_{k} s_{k}, s_{k} = H_{k} g_{k}, γ_{k} = \underset{γ}{a r g m i n} f (x_{k} - γ s_{k}) .

(38)

3. Compute the gradient

g_{k + 1} = \nabla f (x_{k + 1})

based on the condition:

〈g_{k + 1} s_{k}〉 \leq 0 .

(39)

If

g_{k + 1} = 0,

then

x_{k + 1}

is the minimum point; stop the algorithm.
4. Compute vectors

y_{k}

,

p_{k}

:

y_{k} = g_{k} - g_{k + 1}, p_{k} = g_{k + 1} + t_{k} y_{k}, t_{k} = - \frac{〈y_{k}, H_{k} g_{k + 1}〉}{〈y_{k}, H_{k} y_{k}〉} .

(40)

Here, the vector

p_{k}

is found from (32) based on the orthogonality of

H_{k} p_{k}

and

y_{k}

.
Compute

θ_{g k} (M_{A})

according to formula (33), where:

C_{k} = \min \{|〈{y_{k}, H}_{k} g_{k}〉|, |〈{y_{k}, H}_{k} g_{k + 1}〉|\} .

(41)

Find

θ_{k}

according to (34) and

α_{k}^{2} = a (θ_{k}), β_{k}^{2} = b (θ_{k})

, as in (35). We obtain a new approximation of the metric matrix

H_{k + 1} = (H_{k}, α_{k}, β_{k}, y_{k}, p_{k})

,
5. Assign k = k + 1. Go to step 2

Here, the built-in method for solving the system of inequalities (26) is the transformations carried out at step 4 under condition (39). The solution to system (26) at iteration is the vector

H_{k + 1} g_{k + 1}

, which is used as the new descent direction.

In [27], the optimization of the parameters’

α_{k}, β_{k}

choice is related to the characteristics of the subgradient sets of the function. In [27], it is assumed that it is possible to choose a parameter M_A corresponding to the real characteristic M_ε, which is the union of subgradient sets of a certain ε-neighborhood of the current minimum point satisfies the relation:

θ (M_{ε}) = θ (M_{A}) < 1 / 2

(42)

In the case of smooth functions, due to the fact that the subgradient coincides with the gradient, and the subgradient set contains a single element—the gradient—it is easy to satisfy condition (42), since for small ε, the characteristics of the subgradient set:

R (\partial f (x_{k})) = ρ (\partial f (x_{k})) = 1, M = M (\partial f (x_{k})) = 1,

due to the fact that the gradient satisfies the Lipschitz condition, they change insignificantly. Therefore, for small ε:

M_{ε} = M (\partial_{ε} f (x_{k})) = R_{S} (\partial_{ε} f (x_{k})) / ρ (\partial_{ε} f (x_{k})) \approx 1, θ (M_{ε}) = {(M_{ε} - 1)}^{2} / {(M_{ε} + 1)}^{2} \approx 0,

which makes it possible to consider the algorithm for sufficiently large values of ε-neighborhoods that satisfy condition (42).

The smaller

θ (M) i s

, the more efficient the algorithm for solving systems of inequalities works [27]. But for small values of

θ (M)

, according to (35), the values of

α^{2}

will be very large. This will lead to large changes in the matrix H (36), which negatively affects the efficiency of the minimization method due to the difficulties that arise with the degeneration of metric matrices. Therefore, in the minimization algorithm, the smallest value

θ_{k}

has to be limited and consistent with the accuracy of the one-dimensional search. To do this, a constraint on the parameter

θ_{k}

is introduced into (34):

θ (M_{A}) / q^{2} \leq θ_{k} \leq θ (M_{A}) < 1 / 2 .

(43)

As a result, we obtain restrictions on the parameters of matrix transformation in (36):

α_{\max}^{2} = \frac{1}{2 θ (M_{A}) / q^{2}} \geq α_{k}^{2} \geq \frac{1}{2 θ (M_{A})} = α_{\min}^{2},

(44)

β_{\min}^{2} = \frac{1}{2 (1 - θ (M_{A}) / q^{2})} \leq β_{k}^{2} \leq \frac{1}{2 (1 - θ (M_{A}))} = β_{\max}^{2} .

(45)

Relation

θ_{k} (1 - θ_{k})

subject to restrictions on

θ_{k}

(43) is monotonically increasing on the segment

{0 < θ}_{k} < 1 / 2

. Hence the constraints:

\frac{1}{4 (1 - θ (M_{A}) / q^{2}) (θ (M_{A}) / q^{2})} \geq α_{k}^{2} β_{k}^{2} = \frac{1}{4 θ_{k} (1 - θ_{k})} \geq \frac{1}{4 θ (M_{A}) (1 - θ (M_{A}))} .

(46)

From here and (44), (45) the inequalities follow:

α_{\max}^{2} β_{\min}^{2} \geq α_{k}^{2} β_{k}^{2} \geq α_{\min}^{2} β_{\max}^{2} .

(47)

For the parameters

α_{k}, β_{k}

according to (44), (45), and (46), the inequalities hold:

α_{k} > 1, 0 < β_{k} \leq 1, α_{k} \cdot β_{k} > 1 .

(48)

The presented algorithm for solving systems of inequalities and the minimization algorithm also converge for fixed parameters

α_{k} = α_{c o n s t}, β_{k} = β_{c o n s t}

[27].

α_{k}^{2} = \frac{1}{2 θ (M_{A})} = α_{\min}^{2}, β_{k}^{2} = \frac{1}{2 (1 - θ (M_{A}))} = β_{\max}^{2} .

(49)

As a computational experiment shows, the convergence rate of the method for solving systems of inequalities and the minimization method based on it [27] is significantly higher if parameters

α_{k}, β_{k}

adjusted depending on the current situation (35) are used.

4. On the Convergence Rate of the Subgradient Method on Strongly Convex Functions with Lipschitz Gradient

As earlier, x* is a minimum point of the function f(x), f* = f(x*), f_k = f(x_k) and

g_{k} = g (x_{k}) = \nabla f (x_{k})

for a differentiable function satisfying Condition 1. Denote

A_{k} = H_{k}^{- 1}

, Sp(A) is a trace of matrix A, det A is a determinant of matrix A. For an arbitrary matrix A > 0, we denote A^1/2 as a symmetric matrix for which A^1/2 > 0 and A^1/2 A^1/2 = A. For characteristics of matrices

A_{k}, H_{k}

we used the result from [27], valid for arbitrary parameters

α_{k}, β_{k}

satisfying condition (48).

Lemma 1

[27]. Let

H_{k} > 0

, matrix

H_{k + 1}

obtained as a result of transformation

H_{k + 1} = (H_{k}, α_{k}, β_{k}, y_{k}, p_{k})

, where parameters

α_{k}, β_{k}

satisfy condition (48), and for arbitrary vectors

y_{k} \neq 0, p_{k} \neq 0

, equality (38) is satisfied. Then,

H_{k + 1} > 0

and:

A_{k + 1} = A_{k} + (α_{k}^{2} - 1) \frac{y_{k} y_{k}^{T}}{〈y_{k}, {H_{k} y}_{k}〉} + (β_{k}^{2} - 1) \frac{p_{k} p_{k}^{T}}{〈p_{k}, {H_{k} p}_{k}〉},

(50)

{S p (A}_{k + 1}) = {S p (A}_{k}) + (α_{k}^{2} - 1) \frac{〈y_{k}, y_{k}〉}{〈y_{k}, {H_{k} y}_{k}〉} + (β_{k}^{2} - 1) \frac{〈p_{k}, p_{k}〉}{〈p_{k}, {H_{k} p}_{k}〉},

(51)

{d e t H}_{k + 1} = \det H_{k} / α_{k}^{2} β_{k}^{2}, {d e t A}_{k + 1} = α_{k}^{2} β_{k}^{2} {d e t A}_{k} .

(52)

The following theorem shows that the presence of motion as a result of iterations (5), (6) lead to a decrease in the function.

Theorem 4.

Let the function f(x) satisfy Condition 1. Then, for the sequence {f_k}, k = 0, 1, 2… given by the process (5), (6) the following estimation takes place:

f_{k + 1} - f^{*} \leq (f_{0} - f^{*}) \exp [- \frac{ρ^{2}}{L^{2}} \sum_{i = 0}^{k} \frac{{‖ y_{i} ‖}^{2}}{{‖ g_{i} ‖}^{2}}],

(53)

where

y_{i} = g_{i + 1} - g_{i}

.

Proof of Theorem 4.

For a strongly convex function, inequality (2) is satisfied. Taking this inequality into account, we obtain:

f_{k + 1} - f^{*} = {(f}_{k} - f^{*}) - (f_{k} - f_{k + 1}) = (f_{k} - f^{*}) (1 - \frac{f_{k} - f_{k + 1}}{f_{k} - f^{*}}) \leq (f_{k} - f^{*}) (1 - \frac{2 ρ (f_{k} - f_{k + 1})}{{‖g_{k}‖}^{2}})

(54)

Inequality (3) is also valid for the one-dimensional function:

ϕ (t) = f (x_{k} - t s_{k} / ‖ s_{k} ‖) .

From here, taking into account the exact one-dimensional search, inequality (3) and Lipschitz condition (1), the estimate follows:

f_{k} - f_{k + 1} \geq ρ | | x_{k} - x_{k + 1} | |^{2} / 2 \geq ρ \frac{| |y_{k}| |^{2}}{2 L^{2}} .

Transform (54) using the last relation and inequality

\exp (- c) \geq 1 - c, c \geq 0

.

f_{k + 1} - f^{*} \leq (f_{k} - f^{*}) (1 - \frac{ρ^{2} {‖ y_{k} ‖}^{2}}{L^{2} {‖ g_{k} ‖}^{2}}) \leq (f_{k} - f^{*}) \exp (- \frac{ρ^{2} {‖ y_{k} ‖}^{2}}{L^{2} {‖ g_{k} ‖}^{2}}) .

Recurrent use of the last inequality leads to estimate (53). □

Let us estimate the convergence rate of Algorithm 2 under more general restrictions on the parameters

α_{k}^{2} β_{k}^{2}

.

a_{M} \geq α_{k}^{2} \geq a_{m}, 1 \geq β_{k}^{2} \geq b_{m}, a_{m} \geq 1 / b_{m} .

(55)

This implies the constraint:

a_{M} \geq α_{k}^{2} β_{k}^{2} \geq a_{m} b_{m} .

The following theorem substantiates the linear convergence rate of Algorithm 2 under constraints (55).

Theorem 5.

Let the function f(x) satisfy Condition 1. Then, for the sequence {f_k}, k = 0, 1, 2… given by the Algorithm 2 with limited initial matrix H₀:

m_{0} \leq 〈H_{0} z, z〉 / 〈z, z〉 \leq M_{0},

(56)

(1) with an arbitrary parameter

α_{k}^{2} β_{k}^{2}

satisfying (55), the following estimation takes place:

f_{k + 1} - f^{*} \leq (f_{0} - f^{*}) \exp {- \frac{ρ^{2} (k + 1)}{L^{2} n} [\frac{2 \ln (a_{m} b_{m})}{(a_{M} - 1)} + \frac{n \ln (m_{0} / M_{0})}{(k + 1) (a_{M} - 1)}]},

(57)

(2) with parameters

α_{k}^{2} β_{k}^{2}

specified in Algorithm 2, the estimation is:

f_{k + 1} - f^{*} \leq (f_{0} - f^{*}) \exp {- \frac{ρ^{2} (k + 1)}{L^{2} n} [\frac{2 \ln (α_{\min}^{2} β_{\max}^{2})}{(α_{\max}^{2} - 1)} + \frac{n \ln (m_{0} / M_{0})}{(k + 1) (α_{\max}^{2} - 1)}]} .

(58)

Proof of Theorem 5.

Based on (50), we obtain (51). Transform (51) taking into account

β_{k}^{2} - 1 \leq 0

, we obtain an estimate for the trace of matrices A_k:

S p (A_{k + 1}) \leq S_{P} (A_{k}) (1 + \frac{(α_{k}^{2} - 1) 〈y_{k}, y_{k}〉}{S p (A_{k}) 〈{H_{k} y}_{k}, y_{k}〉})

(59)

Due to exact one-dimensional descent (38), the following condition is satisfied:

〈s_{k}, g_{k + 1}〉 = 〈H_{k} g_{k}, g_{k + 1}〉 = 0,

which, together with the positive definiteness of the matrices, proves the inequality:

〈H_{k} y_{k}, y_{k}〉 = 〈H_{k} g_{k}, g_{k}〉 + 〈H_{k} g_{k + 1}, g_{k + 1}〉 - 2 〈H_{k} g_{k}, g_{k + 1}〉 \geq 〈H_{k} g_{k}, g_{k}〉 .

Hence, taking into account

S p (A_{k}) \geq M_{k}

, where

M_{k}

is the maximum eigenvalue of the matrix

A_{k}

, we obtain:

S p (A_{k}) 〈{H_{k} y}_{k}, y_{k}〉 \geq S_{P} (A_{k}) 〈{H_{k} g}_{k}, g_{k}〉 \geq \frac{S p (A_{k})}{M_{k}} 〈g_{k}, g_{k}〉 \geq 〈g_{k}, g_{k}〉 .

Based on the last estimate, inequality (59) is transformed to the form:

S p (A_{k + 1}) \leq S_{P} (A_{k}) (1 + (α_{k}^{2} - 1) \frac{{‖y_{k}‖}^{2}}{{‖{g (x}_{k})‖}^{2}}) .

(60)

Based on the relationship between the arithmetic mean and geometric mean of the matrix A > 0 eigenvalues, we have

S p (A) / n \geq {[d e t (A)]}^{1 / n}

. From here and (60), (52) in the case of restrictions on parameters

α_{k}^{2} β_{k}^{2}

(55), we obtain:

\frac{S p (A_{0})}{n} \prod_{i = 0}^{k} [1 + (α_{k}^{2} - 1) \frac{{‖y_{i}‖}^{2}}{{‖{g (x}_{i})‖}^{2}}] \geq \frac{S p (A_{k + 1})}{n} \geq {(\det (A_{k + 1}))}^{1 / n} = {[\prod_{i = 0}^{k} [α_{i}^{2} β_{i}^{2} d e t (A_{0})]]}^{1 / n} \geq {[({a_{m} b_{m})}^{k + 1} d e t (A_{0})]}^{1 / n}

and in case of choosing parameters

α_{k}^{2} β_{k}^{2}

, as in Algorithm 2, taking into account (47), we obtain an estimate:

\frac{S p (A_{0})}{n} \prod_{i = 0}^{k} [1 + (α_{k}^{2} - 1) \frac{{‖y_{i}‖}^{2}}{{‖{g (x}_{i})‖}^{2}}] \geq \frac{S p (A_{k + 1})}{n} \geq {(\det (A_{k + 1}))}^{1 / n} = {[\prod_{i = 0}^{k} [α_{i}^{2} β_{i}^{2} d e t (A_{0})]]}^{1 / n} \geq {[({α_{m i n}^{2} β_{m a x}^{2})}^{k + 1} d e t (A_{0})]}^{1 / n} .

The last inequalities based on ratio

1 + p \leq e x p (p)

, transform to the form:

\frac{S p (A_{0})}{n} e x p [(a_{M} - 1) \sum_{i = 0}^{k} \frac{{‖y_{i}‖}^{2}}{{‖{g (x}_{i})‖}^{2}}] \geq {[({a_{m} b_{m})}^{(k + 1) / n} d e t (A_{0})]}^{1 / n},

(61)

\frac{S p (A_{0})}{n} e x p [(α_{m a x}^{2} - 1) \sum_{i = 0}^{k} \frac{{‖y_{i}‖}^{2}}{{‖{g (x}_{i})‖}^{2}}] \geq {[({α_{m i n}^{2} β_{m a x}^{2})}^{(k + 1) / n} d e t (A_{0})]}^{1 / n} .

(62)

Due to condition (55)

S p (A_{0}) / n \leq 1 / m_{0}, {[\det (A)]}^{\frac{1}{n}} \geq 1 / M_{0}

. Taking logarithms of (61) and (62), taking into account the last inequalities, we find:

\begin{matrix} [(a_{M} - 1) \sum_{i = 0}^{k} \frac{{‖ y_{i} ‖}^{2}}{{‖ g (x_{i}) ‖}^{2}}] \geq (k + 1) \ln (a_{m} b_{m}) / n + \ln (1 / M_{0}) - \ln (\frac{1}{m_{0}}), \\ [(α_{\max}^{2} - 1) \sum_{i = 0}^{k} \frac{{‖ y_{i} ‖}^{2}}{{‖ g (x_{i}) ‖}^{2}}] \geq (k + 1) \ln (α_{\min}^{2} β_{\max}^{2}) / n + \ln (1 / M_{0}) - \ln (\frac{1}{m_{0}}), \end{matrix}

This implies:

\begin{matrix} \sum_{i = 0}^{k} \frac{{‖ y_{i} ‖}^{2}}{{‖ g (x_{i}) ‖}^{2}} \geq \frac{2 (k + 1) \ln (a_{m} b_{m})}{n (a_{M} - 1)} + \frac{\ln (m_{0} / M_{0})}{(a_{M} - 1)}, \\ \sum_{i = 0}^{k} \frac{{‖ y_{i} ‖}^{2}}{{‖ g (x_{i}) ‖}^{2}} \geq \frac{2 (k + 1) \ln (α_{\min}^{2} β_{\max}^{2})}{n (α_{\max}^{2} - 1)} + \frac{\ln (m_{0} / M_{0})}{(α_{\max}^{2} - 1)}, \end{matrix}

which, together with estimate (53) of Theorem 4, proves (57) and (58). □

Estimating the convergence rate of Algorithm 2 under more general constraints (55) on parameters

α_{k}^{2} β_{k}^{2}

makes it possible to use parameters different from those generated in Algorithm 2. The paper presents a computational experiment where the parameters of Algorithm 2 were changed as follows:

α_{k}^{2} \to α_{k}^{2} \times β_{k}^{2} / c, β_{k}^{2} \to c .

(63)

Here, parameters c were set as follows: c = {0.2; 0.1; 0.05}. As a result of a computational experiment, it was revealed that in ill-conditioned problems such changes increase the efficiency of the minimization method, including in non-smooth optimization problems. For non-smooth problems, there is no theoretical justification for convergence under transformation (63).

The obtained estimates do not explain the fact of the high convergence rate the method, for example, on quadratic functions. To justify the accelerating properties of the method, we need to show its invariance with respect to the linear transformation of coordinates and then use estimate (58) in the coordinate system with maximal ratio ρ/L. A similar possibility exists, for example, in the case of quadratic functions, where this ratio will be equal to 1.

Let us establish a relation between the characteristics of Algorithm 2, used to minimize the functions

f (x)

and

f_{p} (\hat{x})

from (17).

Theorem 6.

Let the initial conditions of Algorithm 2, used to minimize the functions

f (x)

and

f_{p} (\hat{x})

, defined in (17), be related by the equalities:

{\hat{x}}_{0} = P x_{0}, {\hat{H}}_{0} = P H_{0} P^{T} .

(64)

Then, the characteristics of these processes are related by the relations:

f_{P} ({\hat{x}}_{k}) = f (x_{k}), {\hat{x}}_{k} = P x_{k}, \nabla f_{P} ({\hat{x}}_{k}) = P^{- T} \nabla f (x_{k}), {\hat{H}}_{k} = P H_{k} P^{T}, k = 0, 1, 2, \dots

(65)

Proof of Theorem 6.

For derivatives of functions

f (x)

and

f_{p} (\hat{x})

, relation

\nabla f_{p} (\hat{x}) = P^{- T} \nabla f (x)

holds. From this and assumption (64) follows (65) for k = 0. Let us assume that equalities (65) are satisfied for all k = 0,1,…,i. Let us show their feasibility for k = i + 1. From (38) with k = i after multiplication by P on the left, taking into account the proven equalities (65), we obtain:

P x_{i + 1} = P x_{i} - γ_{i} P H_{i} P^{T} P^{- T} \nabla f (x_{i}) = {\bar{x}}_{i} - γ_{i} {\hat{H}}_{i} \nabla f_{P} ({\hat{x}}_{i}) .

(66)

Hence, according to the definition of the function f_p, at the stage of one-dimensional minimization (38), the equality

γ_{i} = {\bar{γ}}_{i}

is satisfied. Therefore, the right side of (66) is the implementation of step (38) in the new coordinate system. Hence:

{\hat{x}}_{i} = P x_{i}, \nabla f_{P} ({\hat{x}}_{i}) = P^{- T} \nabla f (x_{i}), {\hat{y}}_{i} = \nabla f_{P} ({\hat{x}}_{i + 1}) - \nabla f_{P} ({\hat{x}}_{i}) = P^{- T} y_{i} .

(67)

Multiplying (36) with the current indices on the left by P, and on the right by P^T, taking into account (67), we obtain:

\begin{matrix} P H_{i + 1} P^{T} = P H_{i} P^{T} - (1 - \frac{1}{{α_{i}}^{2}}) \frac{P H_{i} P^{T} P^{- T} y_{i} {y_{i}}^{T} P^{- 1} P {H^{T}}_{i} P^{T}}{〈y_{i}, P^{- 1} P H_{i} P^{T} P^{- T} y_{i}〉} - (1 - \frac{1}{{β_{i}}^{2}}) \frac{P H_{i} P^{T} P^{- T} p_{i} {p_{i}}^{T} P^{- 1} P {H^{T}}_{i} P^{T}}{〈p_{i}, P^{- 1} P H_{k} P^{T} P^{- T} p_{i}〉} \\ = {\bar{H}}_{i} - (1 - \frac{1}{{α_{i}}^{2}}) \frac{{\hat{H}}_{i} {\hat{y}}_{i} {\hat{y_{i}}}^{T} {\hat{H}}^{T}_{i}}{〈{\hat{H}}_{i} {\hat{y}}_{i}, \hat{y}_{i}〉} - (1 - \frac{1}{{β_{i}}^{2}}) \frac{{\hat{H}}_{i} {\hat{p}}_{i} {\hat{p_{i}}}^{T} {\hat{H}}^{T}_{i}}{〈{\hat{H}}_{i} {\hat{p}}_{i}, \hat{p}_{i}〉}, \end{matrix}

where the right side is the implementation of formula (36) in the new coordinate system. The denominators of the last formula establish a relationship:

〈y_{i}, P^{- 1} P H_{i} P^{T} P^{- T} y_{i}〉 = 〈{\hat{H}}_{i} {\hat{y}}_{i}, \hat{y}_{i}〉, 〈p_{i}, P^{- 1} P H_{k} P^{T} P^{- T} p_{i}〉 = 〈{\hat{H}}_{i} {\hat{p}}_{i}, \hat{p}_{i}〉 .

Using the last equalities and formulas (41), (33) of Algorithm 2, we obtain:

α_{i}^{2} = {\hat{α}}_{i}^{2}, β_{i}^{2} = {\hat{β}}_{i}^{2} .

Finally, we obtain

P H_{i + 1} P^{T} = {\hat{H}}_{i + 1}

. Consequently, equalities (65) will also be valid for k = i + 1. Continuing the induction process, we obtain the proof of Theorem 6. □

For function

f_{p} (\hat{x})

denote strong convexity constant by ρ_p, Lipschitz constant by L_p. Introduce the function K(P) = ρ_p/L_p. Denote by V the coordinate transformation matrix such that K(V) ≥ K(P) for an arbitrary non-singular matrices P.

Theorem 7.

Let the function f(x) satisfy Condition 1. Then, for the sequence {f_k}, k = 0,1,2… given by the Algorithm 2 with limited initial matrix H₀ according to (56)

(1) with an arbitrary parameter

α_{k}^{2} β_{k}^{2}

satisfying (55), the following estimation takes place:

f_{k + 1} - f^{*} \leq (f_{0} - f^{*}) \exp {- \frac{ρ_{V}^{2} (k + 1)}{L_{V}^{2} n} [\frac{2 \ln (a_{m} b_{m})}{(a_{M} - 1)} + \frac{n \ln (m_{0} / M_{0})}{(k + 1) (a_{M} - 1)}]},

(68)

(2) with parameters

α_{k}^{2} β_{k}^{2}

specified in Algorithm 2, the estimation is:

f_{k + 1} - f^{*} \leq (f_{0} - f^{*}) \exp {- \frac{ρ_{V}^{2} (k + 1)}{L_{V}^{2} n} [\frac{2 \ln (α_{\min}^{2} β_{\max}^{2})}{(α_{\max}^{2} - 1)} + \frac{n \ln (m_{0} / M_{0})}{(k + 1) (α_{\max}^{2} - 1)}]} .

(69)

where m₀, M₀ are the minimum and maximum eigenvalues of the matrix

{\hat{H}}_{0} = {V H}_{0} V^{T}

in the selected coordinate system (18) having property (19).

Proof of Theorem 7.

According to the results of Theorem 6, we can choose an arbitrary coordinate system to estimate the convergence rate of the minimization process of Algorithm 2. Therefore, we use estimates (57) and (58) in a coordinate system with the matrix P = V and obtain estimates (68) and (69). □

The first term in square brackets characterizes the constant in estimating the convergence rate of the method, and the second term characterizes the costs of setting up the metric matrix.

For the steepest descent method (scheme (5), (6) with (11)) on functions satisfying Condition 1, the order of the convergence rate is determined by expression (12)

q_{k} = 1 - ρ / L

. Given that

l_{V}^{2} / L_{V}^{2} ≫ l / L

estimate for Newton’s method (20) is

q_{k} = 1 - ρ_{V}^{2} / L_{V}^{2}

, for quasi-Newton method [27] is:

q_{k} = 1 - ρ_{V}^{3} / {(2 L}_{V}^{3})

(70)

and estimates (68) and (69) for the subgradient method turn out to be preferable to (12). This situation arises, for example, when minimizing quadratic functions whose Hessians have a large spread of eigenvalues.

Thus, Algorithm 2 on strongly convex functions, without assuming the existence of second derivatives, has accelerating properties compared to the steepest descent method.

For sufficiently small values of the

l_{V}^{2} / L_{V}^{2}

ratio, the average convergence rate of the subgradient method is given below:

{\bar{q}}_{k} \approx 1 - \frac{ρ_{V}^{2}}{n L_{V}^{2}} \times [\frac{2 \ln (α_{\min}^{2} β_{\max}^{2})}{(α_{\max}^{2} - 1)} + \frac{n \ln (m / M)}{(k + 1) (α_{\max}^{2} - 1)}] \approx 1 - \frac{2 ρ_{V}^{2}}{n L_{V}^{2}} \times \frac{\ln (α_{\min}^{2} β_{\max}^{2})}{(α_{\max}^{2} - 1)} .

(71)

The second term in square brackets of estimate (71) characterizes the stage of adjusting the metric matrix of Algorithm 2. From the analysis of expression (71), we can conclude that the qualitative nature of estimate (69) is similar to estimate (20) for Newton’s method, which takes into account the difference between information in the form of a matrix of second derivatives in (20) and a gradient in (69) through the presence of the factor 1/n in (69).

To test the effectiveness of the algorithm, it makes sense to implement Algorithm 2 and conduct numerical testing in order to identify its application possibilities in solving problems of minimizing smooth functions along with effective quasi-Newton methods for solving minimization problems with a high degree of conditionality.

5. Aspects of the Subgradient Method Implementation

In the case of inexact one-dimensional descent, in operation (25) of the minimization algorithm, it is assumed that the one-dimensional minimum has been localized, that is, a point

z_{k + 1} = x_{k} - γ_{z} s_{k}

has been obtained such that the subgradient u_k+₁ at the extreme point z_k+₁ satisfies inequality (30):

〈H_{k} g_{k}, u_{k +!}〉 = 〈s_{k}, u_{k + 1}〉 \leq 0,

which is shown in Figure 2.

The subgradient u_k+₁ is used to transform the matrix H_k. Figure 2 shows a point x_k+₁ with a smaller function value on the localization segment between points x_k and u_k+₁

f (x_{k}) \geq f (x_{k + 1}) \leq f (u_{k + 1}),

which, at the next iteration, will become the new current minimum point with the direction of minimization

s_{k + 1} = H_{k + 1} g_{k + 1}

.

In Algorithm 1, at each iteration vector,

g_{k} \in G

is chosen arbitrarily, and then, vector

u_{k} \in G

such that

〈H_{k} g_{k}, u_{k}〉 \leq 0

. In the minimization algorithm with one-dimensional minimization, from a point x along the direction s = Hg, when carrying out localization of the minimum, we obtain a point

x_{1} = x - γ_{1} s

for which a similar (30) inequality is satisfied, and a point

x_{m} = x - γ_{m} s

inside the localization segment with a smaller function value, which we take in the minimization algorithm as a new minimum approximation from which a new one-dimensional descent will then be carried out. Gradients g_x = g(x) and g₁ = g(x₁) at the points x and x₁ are used together for matrix transformation. Thus, in the practical version of the minimization algorithm, vectors g_x, g₁, g(x_m) will be used, corresponding in meaning to the vectors

g_{k} \in G, u_{k} \in G, g_{k + 1} \in G

from Algorithm 1.

We use the one-dimensional minimization procedure based on these principles, outlined in [27,43]. Its set of input parameters is

\{x, s, g_{x}, f_{x}, h_{0}\}

, where x is the point of the current minimum approximation, s is the descent direction,

h_{0}

is the initial search step,

f_{x} = f (x), g_{x} \in \partial f (x)

, and the necessary condition for the possibility of reducing the function along the direction

〈g_{x}, s〉 > 0

must be satisfied. Its output parameters are

\{γ_{m}, f_{m}, g_{m}, γ_{1}, {g_{1}, h}_{1}\}

. Here,

γ_{m}

is the step to the point of a new minimum approximation:

x_{m} = x - γ_{m} s, f_{m} = f (x_{m}), g_{m} \in \partial f (x_{m}),

γ_{1}

is the step along s such that at the point

x_{1} = x - γ_{1} s

for the subgradient

g_{1} \in \partial f (x_{1})

inequality

〈g_{1}, s〉 \leq 0

holds. This subgradient is used in the learning algorithm. The output parameter h₁ is the initial descent step for the next iteration. The step h₁ is adjusted to reduce the number of calls to the procedure for calculating the function and subgradient.

In the minimization algorithm, the vector

g_{1} \in \partial f (x_{1})

is used to solve a system of inequalities, and the point

x_{m} = x - γ_{m} s

as the point of a new minimum approximation.

We denote the call to the procedure as OM(

\{x, s, g_{x}, f_{x}, h_{0}\}

;

\{γ_{m}, f_{m}, g_{m}, {g_{1}, h}_{1}\}

). Here is a brief description of it.

Let us introduce a one-dimensional function

φ (β) = f (x - β s)

. To localize its minimum, we take an increasing sequence

β_{0} = 0, β_{i} = h_{0} q_{M}^{i - 1}, i \geq 1

. Here, q_M > 1 is a step increasing parameter. In most cases, it is specified q_M = 3. Denote

z_{i} = x - β_{i} s, r \in \partial f (z_{i}), i = 0, 1, 2, \dots,

l is number of i at which the relation

〈r_{i}, s〉 \leq 0

is first time satisfied. Let us determine the parameters of the localization segment

[γ_{0}, γ_{1}]

of one-dimensional minimum:

γ_{0} = β_{l - 1}, f_{0} = f (z_{l - 1}), g_{0} = r_{l - 1}, γ_{1} = β_{l}, f_{1} = f (z_{l}), g_{1} = r_{l - 1}

and find a minimum point

γ^{*}

through cubic approximation of the function [46] on the localization segment, using the values of the one-dimensional function and its derivative. Calculate:

γ_{m} = {\begin{cases} 0.1 γ_{1}, i f l = 1 a n d γ^{*} \leq 0.1 γ_{1}, \\ γ_{1}, i f γ_{1} - γ^{*} \leq 0.2 (γ_{1} - γ_{0}), \\ γ_{0,} i f l > 1 a n d γ^{*} - γ_{0} \leq 0.2 (γ_{1} - γ_{0}), \\ γ^{*}, o t h e r w i s e . \end{cases}

We calculate the initial descent step for the next iteration using the rule:

h_{1} = q_{m} h_{0} {(γ_{1} / h_{0})}^{1 / 2} .

Here, q_m < 1 is descent step decreasing parameter, which, in most cases, is set as q_m = 0.8. In the vast majority of applications, the set of parameters {q_M = 3, q_m = 0.8} is satisfactory. When solving complex problems with a high degree of level surfaces elongation, the parameter should be increased: q_m → 1. Subgradient method implementation is presented in Algorithm 3.

Algorithm 3. Subgradient method implementation

1. Assume k = 0, initial matrix H₀ = I, q ≥ 1, the number of iterations k_max to stop the algorithm. Set Θ_A satisfying inequality (37), and parameter

M_{A} \equiv m (θ_{A})

. Compute

g_{0} \in \partial f (x_{0})

. Set the initial step of a one-dimensional search h₀ and small

ε = 10^{- 10} . If g_{0} = 0

and then the x₀ is a minimum point, stop the algorithm.
2. If

\frac{〈s_{k}, g_{k}〉}{‖s_{k}‖ \times ‖g_{k}‖} \leq ε,

(72)

then correct the matrix:

H_{k} = H_{k} + 10 ε d_{m a x} I, d_{m a x} = \max_{i} {H_{i i, k}}, i = 1, 2, \dots, n .

(73)

Set

s_{k} = H_{k} g_{k} / {〈H_{k} g_{k}, g_{k}〉}^{1 / 2} .

(74)

Find a new minimum approximation:

OM (\{x_{k}, s_{k}, g_{k}, f_{k}, h_{k}\}; \{γ_{k + 1}, f_{k + 1}, g_{k + 1}, u_{k + 1}, h_{k + 1}\}) .

According to the description of the OM procedure, here the subgradient vector

u_{k + 1}

satisfies the condition

〈H_{k} g_{k}, u_{k + 1}〉 \leq 0

.
If

g_{k + 1} = 0

then

x_{k + 1}

is the minimum point, stop the algorithm.
If k > k_max then stop the algorithm.
3. Compute vectors

y_{k}, p_{k}

by (31).

y_{k} = g_{k} - u_{k + 1}, t_{k} = - \frac{〈y_{k}, H_{k} g_{k + 1}〉}{〈y_{k}, H_{k} y_{k}〉}, p_{k} = g_{k + 1} + t_{k} y_{k} .

Here, vector

p_{k}

is found from the orthogonality condition (32) of the vectors

H_{k} p_{k}

and

y_{k}

. Then, compute

θ_{g k} (M_{A})

by (33), where

C_{k}

is calculated by formula (41).
Find

θ_{k}

according to (34) and parameters

α_{k}^{2} = a (θ_{k}), β_{k}^{2} = b (θ_{k})

by (35). We obtain a new approximation of the metric matrix

H_{k + 1} = (H_{k}, α_{k}, β_{k}, y_{k}, p_{k})

,
If d_max ≤ ε, then carry out scaling

H_{k + 1} = H_{k + 1} / d_{\max}, h_{k + 1} = h_{k + 1} \sqrt{d_{\max}}, d_{\max} = \max_{i = 1, 2, \dots, n} {H_{i i, k + 1}}

(75)

4. Assign k = k + 1. Go to step 2.

Here, the built-in method for solving the system of inequalities (26) is the transformations carried out at Step 3 under condition (39). The current approximation of the solution to system (26) at the iteration is vector s_k (74), which is used as the new descent direction.

The algorithm uses soft matrix updating due to small changes in diagonal elements in the case of large angles (72) between vectors s_k and g_k. Due to the fact that as a result of matrix transformations, its elements are reduced to compensate for this effect, a scaling transformation (75) is carried out, which does not affect the computational process. Taking into account the scaling of the descent direction (74), simultaneously with the scaling of the matrix, the one-dimensional search step is also scaled, which is adjusted in the one-dimensional minimization procedure.

Along with formula (33), we used a simplified version of calculating the value of

θ_{g k} (M_{A})

, which enables us to analyze the qualitative nature of formula (33). Using symmetric matrix

H_{k}^{1 / 2}

, we form vectors

{a = H}_{k}^{1 / 2} y_{k}, b = H_{k}^{1 / 2} g_{k + 1}, c = H_{k}^{1 / 2} g_{k}, p = H_{k}^{1 / 2} p_{k}

and assume equality

‖a‖ = ‖b‖

. Hence, due to the equality a = b − c, the vectors a, b, c form an isosceles triangle (Figure 3).

Due to the fact that the lengths of the vectors b, c projections onto the vector a are the same, the equality

|〈a, b〉| = |〈a, c〉| = 〈a, a〉 / 2

holds. Therefore:

C_{k} = \min \{|〈{y_{k}, H}_{k} g_{k}〉|, |〈{y_{k}, H}_{k} g_{k + 1}〉|\} = |〈{y_{k}, H}_{k} g_{k}〉| = |〈a, c〉| = 〈a, a〉 / 2 = 〈{y_{k}, H}_{k} y_{k}〉 / 2

and the factor from (33) can be transformed as follows:

1 + \frac{C_{k}}{〈{y_{k}, H}_{k} y_{k}〉} (M - 1) = 1 + \frac{〈{y_{k}, H}_{k} y_{k}〉}{2 〈{y_{k}, H}_{k} y_{k}〉} (M - 1) = \frac{M + 1}{2} .

From here and (33), we obtain:

\begin{array}{l} θ_{g k} (M) = {(1 + \frac{〈y_{k}, H y_{k}〉}{{(M - 1)}^{2} 〈p_{k}, H p_{k}〉} {(1 + \frac{C_{k}}{〈y_{k}, H_{k} y_{k}〉} (M - 1))}^{2})}^{- 1} = {(1 + \frac{{(M + 1)}^{2} 〈y_{k}, H_{k} y_{k}〉}{4 {(M - 1)}^{2} 〈p_{k}, H_{k} p_{k}〉}))}^{- 1} \\ \approx {(\frac{{(M + 1)}^{2} 〈y_{k}, H_{k} y_{k}〉}{4 {(M - 1)}^{2} 〈p_{k}, H_{k} p_{k}〉}))}^{- 1} = \frac{{(M - 1)}^{2}}{{(M + 1)}^{2}} \times \frac{4 〈p_{k}, H_{k} p_{k}〉}{〈y_{k}, H_{k} y_{k}〉} = θ (M) \times \frac{4 〈p_{k}, H_{k} p_{k}〉}{〈y_{k}, H_{k} y_{k}〉} = θ (M) \times \frac{4 〈p, p〉}{〈a, a〉} \end{array}

(76)

At the last steps of the transformation in (76), we used the expression

θ (M) = {(M - 1)}^{2} / {(M + 1)}^{2}

introduced earlier in (27). As shown in [27], Algorithm 2 is also operable when using formula (27)

θ (M_{A}) = θ_{A}

to calculate the transformation coefficients of matrices (49) instead of

θ_{g k} (M_{A})

.

Approximate formula (76) reflects the qualitative nature of the relation

θ_{g k} (M_{A})

. According to Figure 3, larger angles between vectors b, c correspond to smaller values of the ratio

〈p, p〉 / 〈a, a〉

, which, according to (76), reduces the value of

θ_{g k} (M_{A})

and, accordingly, leads to an increase in the parameter

α_{k}^{2}

and an insignificant decrease in the parameter

β_{k}^{2}

at Step 3 of Algorithm 3. We used a simplified expression for

θ_{g k} (M)

from (76) in Algorithm 3.

Below, we present examples of solving test problems using the quasi-Newton BFGS method Algorithms 2 and 3.

6. Results of Numerical Study on Smooth Functions

Algorithms 2 and 3 were implemented with parameters

θ_{A} = 0.04356

and

M_{A} = 1.52755

, providing the following product:

α^{2} β^{2} = 1 / (4 θ_{A} (1 - θ_{A})) = 6

. These values were used in Algorithms 2 and 3 with dynamic parameters

{α_{k}}^{2} {β_{k}}^{2}

selection method. The methods used the one-dimensional search described above. For comparison, the quasi-Newtonian BFGS method was implemented with a one-dimensional search procedure using cubic interpolation [46]. In all methods, the function and gradient were calculated simultaneously.

Table 1, Table 2, Table 3, Table 4 and Table 5 show the number of calculations of function and gradient values required to achieve the designated accuracy by the function

f (x_{k}) - f^{*} \leq ε

. The initial point of minimization x₀ and the value ε are given in the description of the function.

The purpose of testing is to experimentally study the ability of subgradient method and quasi-Newton method to eliminate the background that slows down the convergence rate, which is eliminated through some linear transformation that normalizes the elongation of function level surfaces in different directions, which is predicted theoretically by the estimate (69) of Theorem 7.

Due to the fact that the use of subgradient methods with changing the space metric and quasi-Newtonian methods is justified primarily on functions with a high degree of conditionality, where conjugate gradient methods do not work, the test functions were selected based on this position. Due to the fact that the quasi-Newton method is based on a quadratic model of a function, its local convergence rate in a certain neighborhood of the current minimum is largely determined by how effective it is in minimizing ill-conditioned quadratic functions. Therefore, research was primarily carried out on quadratic functions and functions of their derivatives.

If the function is twice differentiable, then the eigenvalues of the Hessian are limited by the interval of the strong convexity parameter and Lipschitz parameter [ρ,L]. Previously, we did not use second derivatives in our proofs. Nevertheless, when developing tests, we used the representation of a quadratic function and the analysis of its conditionality, relying on its eigenvalues. The test functions simulate the oscillatory nature of the second derivatives in two ways. The first of them is the drift of the corresponding eigenvalue from one value to another. In the second method, we imposed noise on the length of the gradient vector randomly, which is reflected in the calculations of the gradient difference in subgradient Algorithms 2 and 3 (40) and in the quasi-Newton method:

y_{k} = \nabla f (x_{k + 1}) - \nabla f (x_{k}) .

With the described methods of simulating oscillations of the Hessian imposed on some basic quadratic function with given characteristics of the eigenvalues, we have characteristics of the degeneracy degree of the problem and can set, on the one hand, the scaling that the methods under study should exclude and the degree of oscillations of the scales simulating the change matrices of second derivatives within specified limits.

The following is accepted as the basic quadratic function:

f_{1} (x, [a \max]) = \frac{1}{2} \sum_{i = 1}^{n} a_{i} x_{i}^{2}, a_{i} = a \max^{\frac{i - 1}{n - 1}}

The eigenvalues a_i of this function have the limits

λ_{m i n} = 1, λ_{m a x} = a_{m a x}

. In this case, the methods under study have to remove the basic (trend) scaling specified by the coefficients of this function. To simulate random fluctuations of second derivatives, a function f₂ was created. To calculate the function values, the basic function

f_{2} = f_{1} (x, [a_{m a x}])

was used. Its gradients were distorted randomly according to the following scheme:

\nabla f_{2} = \nabla f_{1} \times (1 + r \times ξ)

where ξ ∈ [−1,1] is a random number uniformly distributed on a segment [−1,1], r = 0.3. Such function will be noted as

f_{2} (x, [a_{m a x}, r = 0.3])

.

Here, the parameters are the base function parameters and the gradient distortion parameter. It should be noted that distortion of gradients significantly reduces the accuracy of one-dimensional search, where gradients are used to estimate directional derivatives in cubic approximation.

In the third function, additional variables c_i were used to change the scales of a_i for each of the variables.

c_{i} = \frac{b_{m a x}}{b_{i}} (\frac{x_{i}^{2}}{1 + x_{i}^{2}}) + b_{i} (1 - \frac{x_{i}^{2}}{1 + x_{i}^{2}}), b_{i} = {b_{m a x}}^{\frac{i - 1}{n - 1}} .

This function near the extremum will have the form:

f_{3} (x, [a_{m a x}, b_{m a x}]) \approx \frac{1}{2} \sum_{i = 1}^{n} a_{i} b_{i} {x_{i}}^{2} .

Far from the extremum, we obtain a function in which the coefficients b_i are used in reverse order:

f_{3} (x, [a_{m a x}, b_{m a x}]) \approx \frac{1}{2} \sum_{i = 1}^{n} a_{i} \frac{b_{m a x}}{b_{i}} {x_{i}}^{2} .

Changes in coefficients c_i scales have the following range:

{λ_{m i n}}^{c} = 1, {λ_{m a x}}^{c} = b_{m a x}

.

The point x₀ = (100, 100, …, 100) was chosen as initial in all the above functions. Additionally, the following nonlinear functions were also used for testing and analysis.

Function f₄ has ellipsoidal level surfaces corresponding to a quadratic function.

f_{4} (x) = {(\sum_{i = 1}^{n} x_{i}^{2} \cdot i \cdot i)}^{2}, x_{0} = (1, 1, \dots, 1),

Function f₅ has a multidimensional ellipsoidal ravine. Minimization occurs when moving along this curvilinear ravine to the minimum point.

The stopping criterion was:

f (x^{k}) - f^{*} \leq ε = 10^{- 10} .

Minimization results are presented in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6. Table 1, Table 2, Table 3, Table 4 and Table 5 show the results of minimizing the five presented functions for various dimensions. These tables allow us to analyze the effect of removing the basic background using subgradient and quasi-Newton methods. The cells contain: N_it—number of iterations (one-dimensional searches along the direction); nfg—number of calls to the procedure for simultaneous calculation of a function and gradient.

Table 1 shows the results of minimizing the quadratic function f₁, intended for the basic scaling of variables. This function is a background that must be removed by the method’s metric matrix. The nfg costs of subgradient methods here are approximately two times higher compared to the BFGS method.

Table 1. Function

f_{1} (x, [a_{m a x} = 10^{8}])

minimization results.

Table 1. Function

f_{1} (x, [a_{m a x} = 10^{8}])

minimization results.

n	Algorithm 3		Algorithm 2		BFGS
n	N_it	nfg	N_it	nfg	N_it	nfg
100	370	784	331	696	125	276
200	527	1070	538	1096	243	523
300	738	1424	746	1430	348	746
400	934	1740	944	1779	447	948
500	1122	2084	1135	2129	542	1146
600	1298	2359	1301	2474	634	1334
700	1434	2645	1454	2695	724	1525
800	1564	2842	1598	2965	811	1710
900	1698	3056	1727	3166	897	1884
1000	1821	3280	1839	3429	982	2061

Table 2 shows the results for the function f₂. Algorithms 2 and 3 show approximately the same results. The results for the BFGS method are approximately two times worse. Gradient noise has a detrimental effect on the accuracy of one-dimensional search with cubic interpolation that use function gradients. Reducing the accuracy of one-dimensional descent has a negative impact on the BFGS method.

Table 2. Function

f_{2} (x, [a_{m a x} = 10^{8}, r = 0.3])

minimization results.

Table 2. Function

f_{2} (x, [a_{m a x} = 10^{8}, r = 0.3])

minimization results.

n	Algorithm 3		Algorithm 2		BFGS
n	N_it	nfg	N_it	nfg	N_it	nfg
100	357	771	360	783	561	1321
200	568	1198	565	1185	1083	2564
300	769	1607	761	1607	1486	3514
400	952	1995	975	2041	1827	4335
500	1132	2323	1152	2405	2222	5279
600	1306	2673	1345	2753	2587	6152
700	1470	3048	1489	3068	2802	6652
800	1599	3311	1666	3419	3167	7566
900	1733	3581	1783	3689	3543	8442
1000	1876	3866	1930	3992	3584	8577

Table 3 shows the results of function f₃ minimization. Algorithms 2 and 3 show approximately the same results. The results for the BFGS method are approximately five times worse. In this problem, as the extremum is approached, the variables are rescaled. Possibly, this is due to differences in the degree of inclusion of the relation

ρ_{V} / L_{V}

. For the BFGS method according to (70) it is

{ρ_{V}}^{3} / {2 L}_{V}^{3}

, and for subgradient methods according to (71), it is

{ρ_{V}}^{2} / {2 L}_{V}^{2}

.

Table 3. Function

f_{3} (x, [a_{m a x} = 10^{8}, b_{m a x} = 10^{2}])

minimization results.

Table 3. Function

f_{3} (x, [a_{m a x} = 10^{8}, b_{m a x} = 10^{2}])

minimization results.

n	Algorithm 3		Algorithm 2		BFGS
n	N_it	nfg	N_it	nfg	N_it	nfg
100	407	900	415	911	2654	6901
200	681	1461	696	1482	4780	11,885
300	951	1950	965	1980	6373	15,385
400	1202	2415	1221	2480	7571	17,917
500	1441	2837	1458	2912	8297	19,434
600	1653	3257	1674	3294	8968	20,900
700	1864	3672	1898	3710	9572	22,214
800	2061	4016	2108	4090	9914	22,967
900	2258	4343	2288	4411	10,391	24,001
1000	2457	4686	2481	4761	10,645	24,500

Table 4 shows the results of function f₄ minimization. Algorithms 2 and 3 show approximately the same results. The absence of quadraticity of the function while maintaining the topology of the function level surfaces, equivalent to the topology of the quadratic function, affects the convergence rate of Algorithms 2 and 3 to a lesser extent than the BFGS method. The lack of quadraticity of the function here significantly affects the convergence rate of BFGS method.

Table 4. Function

f_{4}

minimization results.

Table 4. Function

f_{4}

minimization results.

n	Algorithm 3		Algorithm 2		BFGS
n	N_it	nfg	N_it	nfg	N_it	nfg
100	154	267	156	295	953	2226
200	266	443	261	438	2012	4682
300	377	619	362	604	3136	7282
400	454	737	456	741	4314	10,027
500	556	889	573	929	5523	12,815
600	669	1078	672	1095	6747	15,658
700	762	1215	778	1259	7990	18,537
800	877	1400	870	1413	9243	21,430
900	968	1545	971	1558	10,541	24,455
1000	1094	1752	1089	1765	11,746	27,226

Table 5 shows the results of minimization for function f₅. Algorithm 2 is slightly better than Algorithm 3. This function also turned out to be difficult for the BFGS method. This function, like f₄, contains polynomials of the fourth degree, which, unlike subgradient methods, significantly affects the convergence rate of the BFGS method.

Table 5. Function

f_{5}

minimization results.

Table 5. Function

f_{5}

minimization results.

n	Algorithm 3		Algorithm 2		BFGS
n	N_it	nfg	N_it	nfg	N_it	nfg
100	498	1116	432	989	1170	2847
200	558	1286	450	1051	1417	3396
300	609	1423	496	1196	1700	4118
400	705	1687	442	1100	1862	4465
500	686	1653	388	980	1964	4722
600	613	1499	429	1091	2081	4955
700	581	1434	433	1106	2228	5315
800	451	1176	394	1048	2180	5200
900	533	1361	430	1135	2412	5727
1000	554	1430	435	1188	2490	5957

In Table 6, the results for functions f₁–f₅ at n = 1000 are presented. The results show the effectiveness of the methods on all functions under study simultaneously. Conclusions regarding the effectiveness of the methods were made earlier.

Table 6. Functions

f_{1} - f_{5}

minimization results at n = 1000.

Table 6. Functions

f_{1} - f_{5}

minimization results at n = 1000.

n	Algorithm 3		Algorithm 2		BFGS
n	N_it	nfg	N_it	nfg	N_it	nfg
f₁	1821	3280	1839	3429	982	2061
f₂	1876	3822	1930	3992	3584	8577
f₃	2457	4686	2481	4761	10,645	24,500
f₄	1094	1752	1089	1765	11,746	27,226
f₅	554	1430	435	1188	2490	5957

Table 7 shows the results of Algorithm 3 on the first three functions for n = 1000 with changed parameters

α_{k}^{2} \to α_{k}^{2} \times β_{k}^{2} / c, β_{k}^{2} \to c

according to (71) and different values of c = {0.2; 0.1; 0.05}. It is shown here that, on ill-conditioned problems, such changes increase the efficiency of the minimization method. These examples show the possibility of setting the method parameters for a certain fixed set of optimization problems.

Regarding the convergence rate of minimization methods, the following conclusions can be drawn:

For functions close in properties to quadratic (f₁), the quasi-Newton BFGS method significantly exceeds subgradient Algorithms 2 and 3 in terms of convergence rate.
In the case of significant interference imposed on the gradients of the function (f₂), subgradient Algorithms 2 and 3 are more effective than the BFGS method.
Variability of scales across variables (f₂) affects the convergence rate of subgradient methods to a lesser extent than the BFGS method.
The presence of polynomial degrees higher than 2 in the minimized function affects the convergence rate of subgradient methods to a lesser extent than the BFGS method.
A computational experiment showed the possibility of adjusting the parameters of the method in accordance with theoretical principles. Therefore, the efficiency of the method can be increased on a certain fixed set of optimization problems.
Based on the performed computational experiment, it can be seen that the theoretically predicted ability of subgradient methods to exclude the background that slows down the convergence rate has been confirmed by the computational experiment.

Based on the theoretical principles and experimental results, we can conclude that the presented subgradient methods complement quasi-Newton methods when solving smooth optimization problems.

7. Conclusions

The conditionality of the minimization problem determines the spread of the elongation of level surfaces in different directions, which determines the complexity of solving the problem. In minimization practice, in many cases, it turns out to be possible to reduce the elongation of level surfaces due to some linear transformation of coordinates. The paper studies the possibility of Newton’s method and the subgradient method with parameter optimization by changing the space metric to eliminate the conditionality of the problem using a linear transformation.

The paper proves that under conditions of instability of the second derivatives of the function in the minimization domain, the estimate of the convergence rate of Newton’s method is determined by the strong convexity parameter and Lipschitz parameter in the coordinate system where their ratio is maximum. This means the method’s ability to exclude the linear background, which increases the conditionality degree of the problem. The estimate of convergence rate serves as a standard for assessing the capabilities of the subgradient method being studied.

The paper studies RSM with parameters optimization of the rank-two correction of metric matrices on smooth, strongly convex functions with a Lipschitz gradient without assumptions about the existence of second derivatives of the function. Under broad assumptions on the transformation parameters of metric matrices, an estimate of the convergence rate of the studied RSM and an estimate of its ability to exclude removable linear background are obtained. The obtained estimates turn out to be qualitatively similar to estimates for Newton’s method.

A practical version of RSM and test functions have been developed that simulate the presence of a removable linear background. A computational experiment was carried out in which the quasi-Newton BFGS method and the subgradient method under study were compared on various types of smooth functions. The testing results indicate the effectiveness of the subgradient method in minimizing smooth functions with a high degree of conditionality of the problem and its ability to eliminate the linear background that worsens the convergence.

Depending on the type of function, one or another method dominates, which allows us to conclude that the subgradient method is applicable along with quasi-Newton methods when solving problems of minimizing smooth functions with a high degree of conditionality.

Author Contributions

Conceptualization, V.K. and E.T.; methodology, V.K. and E.T.; software, V.K.; validation, L.K. and E.T.; formal analysis, E.T.; investigation, E.T.; resources, L.K.; data curation, V.K.; writing—original draft preparation, E.T. and V.K.; writing—review and editing, E.T. and L.K.; visualization, V.K. and E.T.; supervision, L.K.; project administration, L.K.; funding acquisition L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation (State Contract FEFE-2023-0004).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Jensen, T.L.; Diehl, M. An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods. J. Optim. Theory Appl. 2017, 172, 206–221. [Google Scholar] [CrossRef]
Nesterov, Y. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady 1983, 27, 372–376. [Google Scholar]
Rodomanov, A.; Nesterov, Y. Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 2022, 194, 159–190. [Google Scholar] [CrossRef]
Rodomanov, A.; Nesterov, Y. New Results on Superlinear Convergence of Classical Quasi-Newton Methods. J. Optim. Theory Appl. 2021, 188, 744–769. [Google Scholar] [CrossRef]
Jin, Q.; Mokhtari, A. Non-asymptotic superlinear convergence of standard quasi-Newton methods. Math. Program. 2023, 200, 425–473. [Google Scholar] [CrossRef]
Davis, K.; Schulte, M.; Uekermann, B. Enhancing Quasi-Newton Acceleration for Fluid-Structure Interaction. Math. Comput. Appl. 2022, 27, 40. [Google Scholar] [CrossRef]
Hong, D.; Li, G.; Wei, L.; Li, D.; Li, P.; Yi, Z. A self-scaling sequential quasi-Newton method for estimating the heat transfer coefficient distribution in the air jet impingement. Int. J. Therm. Sci. 2023, 185, 108059. [Google Scholar] [CrossRef]
Argyros, I.; George, S. On a unified convergence analysis for Newton-type methods solving generalized equations with the Aubin property. J. Complex. 2024, 81, 101817. [Google Scholar] [CrossRef]
Dennis, J.E.; Schnabel, R.B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; SIAM: Philadelphia, PA, USA, 1996. [Google Scholar]
Polak, E. Computational Methods in Optimization; Mir: Russia, Moscow, 1974. [Google Scholar]
Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Kamio, T.; Asai, H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms 2022, 15, 6. [Google Scholar] [CrossRef]
Mokhtari, A.; Eisen, M.; Ribeiro, A. An incremental quasi-Newton method with a local superlinear convergence rate. In Proceedings of the EEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4039–4043. [Google Scholar] [CrossRef]
Mokhtari, A.; Eisen, M.; Ribeiro, A. IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 2018, 28, 1670–1698. [Google Scholar] [CrossRef]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Berahas, A.S.; Jahani, M.; Richtárik, P.; Takác, M. Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample. Optim. Methods Softw. 2022, 37, 1668–1704. [Google Scholar] [CrossRef]
Mokhtari, A.; Ribeiro, A. Regularized stochastic BFGS algorithm. IEEE Trans. Signal Proc. 2014, 62, 1109–1112. [Google Scholar] [CrossRef]
Gower, R.; Richtárik, P. Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 2017, 38, 1380–1409. [Google Scholar] [CrossRef]
Gao, W.; Goldfarb, D. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 2019, 34, 194–217. [Google Scholar] [CrossRef]
Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
Meng, S.; Vaswani, S.; Laradji, I.; Schmidt, M.; Lacoste-Julien, S. Fast and Furious Convergence: Stochastic Second Order Methods Under Interpolation. 2019. Available online: https://arxiv.org/pdf/1910.04920.pdf (accessed on 30 March 2024).
Zhou, C.; Gao, W.; Goldfarb, D. Stochastic adaptive quasi-Newton methods for minimizing expected values. In Proceedings of the 34th ICML (PMLR), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 4150–4159. [Google Scholar]
Makmuang, D.; Suppalap, S.; Wangkeeree, R. The regularized stochastic Nesterov’s accelerated Quasi-Newton method with applications. J. Comput. Appl. Math. 2023, 428, 115190. [Google Scholar] [CrossRef]
Rodomanov, A.; Nesterov, Y. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 2021, 31, 785–811. [Google Scholar] [CrossRef]
Lin, D.; Ye, H.; Zhang, Z. Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods. J. Mach. Learn. Res. 2022, 23, 1–40. [Google Scholar]
Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
Karmanov, V. Mathematical Programming; Mir: Moscow, Russia, 1989. [Google Scholar]
Krutikov, V.N.; Stanimirović, P.S.; Indenko, O.N.; Tovbis, E.M.; Kazakovtsev, L.A. Optimization of Subgradient Method Parameters Based on Rank-Two Correction of Metric Matrices. J. Appl. Ind. Math. 2022, 16, 427–439. [Google Scholar] [CrossRef]
Krutikov, V.; Gutova, S.; Tovbis, E.; Kazakovtsev, L.; Semenkin, E. Relaxation Subgradient Algorithms with Machine Learning Procedures. Mathematics 2022, 10, 3959. [Google Scholar] [CrossRef]
Krutikov, V.; Tovbis, E.; Stanimirović, P.; Kazakovtsev, L. On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient. Mathematics 2023, 11, 4715. [Google Scholar] [CrossRef]
Shor, N.Z. Application of the gradient descent method for solving network transportation problems. In Scientific Seminar on Theoretic and Applied Problems of Cybernetics and Operations Research; Nauch. Sovet po Kibernetike Akad. Nauk: Kiev, Ukraine, 1962; pp. 9–17. [Google Scholar]
Polyak, B. A general method for solving extremum problems. Sov. Math. Dokl. 1967, 8, 593–597. [Google Scholar]
Gol’shtein, E.G.; Nemirovskii, A.S.; Nesterov, Y.E. The level method and its generalizations and applications. Ekon. Mat. Metody 1983, 31, 164–180. (In Russian) [Google Scholar]
Nesterov, Y. Universal gradient methods for convex optimization problems. Math. Program. Ser. A. 2015, 152, 381–404. [Google Scholar] [CrossRef]
Gasnikov, A.V.; Nesterov, Y.E. Universal Method for Stochastic Composite Optimization. arXiv 2016, arXiv:1604.05275. [Google Scholar] [CrossRef]
Nemirovskii, A.S.; Yudin, D.B. Complexity of Problems and Efficiency of Methods in Optimization; Nauka: Moscow, Russia, 1979. [Google Scholar]
Shor, N. Minimization Methods for Nondifferentiable Functions; Springer: Berlin, Germany, 1985. [Google Scholar]
Polyak, B.T. Minimization of nonsmooth functional. Zh. Vychisl. Mat. Mat. Fiz. 1969, 9, 509–521. [Google Scholar]
Krutikov, V.N.; Samoilenko, N.S.; Meshechkin, V.V. On the Properties of the Method of Minimization for Convex Functions with Relaxation on the Distance to Extremum. Autom. Remote Contro 2019, 80, 102–111. [Google Scholar] [CrossRef]
Wolfe, P. Note on a method of conjugate subgradients for minimizing nondifferentiable functions. Math. Program. 1974, 7, 380–383. [Google Scholar] [CrossRef]
Lemarechal, C. An extension of Davidon methods to non-differentiable problems. Math. Program. Study 1975, 3, 95–109. [Google Scholar]
Dem’yanov, V.F.; Vasil’ev, L.V. Non-Differentiable Optimization; Nauka: Moscow, Russia, 1981. (In Russian) [Google Scholar]
Skokov, V.A. Note on minimization methods employing space stretching. Cybern. Syst. Anal. 1974, 10, 689–692. [Google Scholar] [CrossRef]
Krutikov, V.N.; Gorskaya, T.A. A family of subgradient relaxation methods with rank 2 correction of metric matrices. Ekon. Mat. Metody 2009, 45, 37–80. [Google Scholar]
Tsypkin, Y.Z. Foundations of the Theory of Learning Systems; Academic Press: New York, NY, USA, 1973. [Google Scholar]
Nurminsky, E.A.; Tien, D. Method of conjugate subgradients with constrained memory. Autom. Remote Control 2014, 75, 646–656. [Google Scholar] [CrossRef]
Bunday, B.D. Basic Optimization Methods; Edward Arnold: London, UK, 1984. [Google Scholar]

Figure 1. The set G and its characteristics [28].

Figure 2. Selection of subgradient vectors for inexact one-dimensional descent in the method of solving systems of inequalities.

Figure 3. Properties of vectors a, b, c, p.

Table 7. Functions

f_{1} - f_{3}

minimization results at n = 1000, Algorithm 3, changed parameters

α_{k}^{2}, β_{k}^{2}

, variants of parameter c.

Table 7. Functions

f_{1} - f_{3}

minimization results at n = 1000, Algorithm 3, changed parameters

α_{k}^{2}, β_{k}^{2}

, variants of parameter c.

	f₁		f₂		f₃
	N_it	nfg	N_it	nfg	N_it	nfg
no changes	1821	3280	1876	3866	2457	4686
c = 0.2	1655	2941	1800	3800	2240	4215
c = 0.1	1561	2798	1796	3756	2036	3815
c = 0.05	1625	2974	1877	3905	2175	4038

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tovbis, E.; Krutikov, V.; Kazakovtsev, L. Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction. Mathematics 2024, 12, 1618. https://doi.org/10.3390/math12111618

AMA Style

Tovbis E, Krutikov V, Kazakovtsev L. Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction. Mathematics. 2024; 12(11):1618. https://doi.org/10.3390/math12111618

Chicago/Turabian Style

Tovbis, Elena, Vladimir Krutikov, and Lev Kazakovtsev. 2024. "Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction" Mathematics 12, no. 11: 1618. https://doi.org/10.3390/math12111618

APA Style

Tovbis, E., Krutikov, V., & Kazakovtsev, L. (2024). Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction. Mathematics, 12(11), 1618. https://doi.org/10.3390/math12111618

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction

Abstract

1. Introduction

2. Accelerating Properties of Newton’s Method under Conditions of Instability of Second Derivatives

3. Subgradient Minimization Method

4. On the Convergence Rate of the Subgradient Method on Strongly Convex Functions with Lipschitz Gradient

5. Aspects of the Subgradient Method Implementation

6. Results of Numerical Study on Smooth Functions

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI