Next Article in Journal
Enhanced Unmanned Aerial Vehicle Localization in Dynamic Environments Using Monocular Simultaneous Localization and Mapping and Object Tracking
Previous Article in Journal
Hermite Finite Element Method for One-Dimensional Fourth-Order Boundary Value Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction

1
Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, 31, Krasnoyarskii Rabochii Prospekt, Krasnoyarsk 660037, Russia
2
Department of Applied Mathematics, Kemerovo State University, 6 Krasnaya Street, Kemerovo 650043, Russia
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(11), 1618; https://doi.org/10.3390/math12111618
Submission received: 30 March 2024 / Revised: 14 May 2024 / Accepted: 16 May 2024 / Published: 22 May 2024
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
The work proves that under conditions of instability of the second derivatives of the function in the minimization region, the estimate of the convergence rate of Newton’s method is determined by the parameters of the irreducible part of the conditionality degree of the problem. These parameters represent the degree of difference between eigenvalues of the matrices of the second derivatives in the coordinate system, where this difference is minimal, and the resulting estimate of the convergence rate subsequently acts as a standard. The paper studies the convergence rate of the relaxation subgradient method (RSM) with optimization of the parameters of two-rank correction of metric matrices on smooth strongly convex functions with a Lipschitz gradient without assumptions about the existence of second derivatives of the function. The considered RSM is similar in structure to quasi-Newton minimization methods. Unlike the latter, its metric matrix is not an approximation of the inverse matrix of second derivatives but is adjusted in such a way that it enables one to find the descent direction that takes the method beyond a certain neighborhood of the current minimum as a result of one-dimensional minimization along it. This means that the metric matrix enables one to turn the current gradient into a direction that is gradient-consistent with the set of gradients of some neighborhood of the current minimum. Under broad assumptions on the parameters of transformations of metric matrices, an estimate of the convergence rate of the studied RSM and an estimate of its ability to exclude removable linear background are obtained. The obtained estimates turn out to be qualitatively similar to estimates for Newton’s method. In this case, the assumption of the existence of second derivatives of the function is not required. A computational experiment was carried out in which the quasi-Newton BFGS method and the subgradient method under study were compared on various types of smooth functions. The testing results indicate the effectiveness of the subgradient method in minimizing smooth functions with a high degree of conditionality of the problem and its ability to eliminate the linear background that worsens the convergence.

1. Introduction

We consider a problem of minimization of convex differentiable function f(x), xRn, where Rn is finite-dimensional Euclidean space. Under conditions of a high degree of a function degeneracy, it is necessary to use Newton-type minimization methods, for example, modifications of Newton’s method, or quasi-Newton methods.
While in the minimum neighborhood, there is a stable quadratic representation of the function, most iterations of the minimization method take place outside the extremum area, so it seems relevant to study the accelerating properties of methods with changing the space metric under conditions of instability of the quadratic properties of the function.
Numerous studies on the convergence rate of Newton methods and quasi-Newton methods in the extremum region have been conducted, and some of them are given in [1,2,3,4,5,6,7,8]. The results obtained in [9,10] refer to the convergence rate of quasi-Newton minimization methods under the assumption that the method operates in the extremum area of the function. The authors of [11] aimed at accelerating the symmetric rank-1 quasi-Newton method with Nesterov’s gradient. A convergence rate of incremental quasi-Newton method was investigated in [12,13]. Large-scale optimization through the sampled versions of quasi-Newton method was considered in [14,15]. Also, the convergence rate of randomized and greedy variants of Newtonian methods and quasi-Newton methods were presented in [16,17,18,19,20,21,22,23,24].
As an object of minimization, we use strongly convex functions with a Lipschitz gradient [25]. In the case of the existence of second derivatives, these constants limit the spread of the Hessian eigenvalues in the minimization region [25]. The ratio ρ/L ≤ 1 of the strong convexity constant ρ and Lipschitz constant L determines the convergence rate of gradient minimization methods with an indicator q ≈ 1 − ρ/L of approach to the extremum by function [26].
As the presence of a removable linear background, we will understand the existence of a linear coordinate transformation VRnxn that allows us to significantly increase the ratio of constants in the new coordinate system ρV/LV >> ρ/L. The advantages of the gradient method in the new coordinate system with the indicator q ≈ 1 − ρV/LV are obvious. However, this estimate is not feasible, since the transformation V is not known.
This research is a continuation of previous studies [27,28] and aimed at studying the capabilities of the Newton’s method and the relaxation subgradient method with optimization of the parameters of rank-two correction of metric matrices [27] to eliminate the linear background that worsens the convergence in the conditions of the existence of transformation V with the properties noted above. Similar studies for quasi-Newton methods were carried out in [29].
Newton’s method is invariant with respect to linear coordinate transformation and allows one to obtain an estimate of the convergence rate for Newton’s method with indicator q ≈ 1 − ρ2V/L2V. This makes it possible to draw a conclusion about the ability of Newton’s method to exclude from the function being minimized a linear background that worsens the convergence, which is eliminated using a linear transformation of coordinates. In what follows, this estimate serves as a standard, and the ability of a certain method, like Newton’s method, to exclude linear background will be called its Newtonian property. The main goal of the work is to substantiate the presence of the Newtonian property in RSM with a change in the space metric [27]. As shown in [29], the noted Newtonian property is inherent in quasi-Newton methods.
There are a number of directions for constructing non-smooth optimization methods, some of which are given in [25,30,31]. The works [32,33,34] considered an approach to creating smooth approximations for non-smooth functions. Methods of this class are applicable to a wide range of problems. A number of effective approaches in the field of non-smooth optimization arose as a result of the creation of the first subgradient methods with space dilation [35,36], in the class of minimization methods relaxing both in function and in distance to the extremum [25,37,38].
The first RSMs were proposed in [39,40,41]. In [36], an effective RSM with space dilation in the direction of the subgradient difference (RSMSD) was developed. Subsequent work on the creation of effective RSMs is associated with identifying the origin of RSMSD and its theoretical justification [42,43]. Formalization of the model of subgradient sets and the use of ideas and machine learning algorithms [44] made it possible to identify the principles of organizing RSM with space dilation [43] and obtain a theoretical basis for their creation. It turned out that the problem of finding the descent direction in RSM can be reduced to the problem of solving a system of inequalities on subgradient sets and mathematically formulated as a solution to the problem of minimizing a quality functional. In this case, the convergence rate of the minimization method is determined by the properties of the learning algorithm.
The principle of RSM organizing does not rely on second derivative of functions. The method under study is similar in structure to the quasi-Newton methods, and its formulas for transforming metric matrices are similar in structure to the formulas of the quasi-Newton DFP method. The purpose of converting metric matrices in RSM is to find a metric matrix that transforms subgradients into a direction that forms an acute angle with all subgradients in the neighborhood of the current minimum approximation. Using this direction enables us to go beyond this neighborhood.
Studied RSM with optimization of the parameters of rank-two metric matrix correction [27] is the result of RSM improvement from [43]. The problem of finding the descent direction in the RSM from [43] is reduced to the problem of solving a system of inequalities to develop a descent direction that forms an acute angle with the set of subgradients of a certain neighborhood of the current minimum. In this case, the descent direction is found similarly to how it is done in quasi-Newton methods, by multiplying the matrix by the subgradient. In [27], compared with the algorithm from [43], a faster algorithm for solving systems of inequalities was proposed, which was confirmed by a computational experiment in [27] for RSM on this basis.
In this work, a qualitative analysis of formulas for choosing algorithm parameters from [27] is carried out, and on this basis, a new method is proposed for finding matrix transformation parameters. In contrast to RSM from [27,42], where the algorithm convergence is justified under strict restrictions on the transformation parameters of metric matrices, in this work, estimates of the algorithm convergence rate on smooth functions are obtained for a wide range of matrix transformation parameters. Therefore, one can customize the method to solve problems of a certain class by selecting parameters for converting metric matrices.
For the studied RSM, it is shown that the method is invariant under linear coordinate transformation. An estimate of its convergence rate on strongly convex functions with a Lipschitz gradient is obtained. The property of Newton’s method is to eliminate the high degree of conditionality of the minimization problem caused by the linear background, which is also inherent in the subgradient method under study. At the same time, estimates of the convergence rate of Newton’s method and the method under study are qualitatively similar in reflecting the influence of the characteristics of the ill-conditioned problem.
To solve both smooth and non-smooth problems, universal algorithms have been developed and implemented, which are the practical implementation of an idealized version of the method. To detect the Newtonian property in the proposed methods, special test functions have been developed. The first of them simulates the random nature of changes in the properties of a function. In another function, a targeted change is made in the elongation of the function level lines along the coordinate axes as it approaches the extremum. In one of the functions, the axes of level lines elongation change due to movement along an ellipsoidal ravine.
In the computational experiment, a comparison is made of the quasi-Newtonian BFGS method and the investigated universal subgradient methods on the proposed test functions. The testing results indicate the effectiveness of the developed methods in minimizing smooth functions with a high degree of conditionality and their ability to exclude linear background that worsens convergence. Depending on the type of function, different methods dominate, which allows us to conclude that the subgradient method is applicable along with quasi-Newton methods in solving problems of minimizing smooth functions with a high degree of conditionality.
The rest of the paper is organized as follows. In Section 2, the accelerating properties of Newton’s method under conditions of instability of second derivatives of the function are considered. In Section 3, a subgradient method is presented that solves the problem of forming the direction of descent. The convergence rate of the subgradient method on strongly convex functions with Lipschitz gradient is discussed in Section 4. Features of the implementation of the subgradient method are presented in Section 5. The results of a numerical study on smooth functions are shown in Section 6. Section 7 concludes the work.

2. Accelerating Properties of Newton’s Method under Conditions of Instability of Second Derivatives

Denote  f k = f x k . For non-smooth functions, we will denote a vector from the subgradient set  g k = g x k f x k . Due to the coincidence of the gradient and subgradient on smooth functions, we will also use this notation for smooth functions  g k = g x k = f x k .
Condition 1.
We will assume that the function being minimized f(x), x ∈   R n  is differentiable and strongly convex in  R n , i.e., there exists ρ > 0 such that the inequality:
f α x + 1 α y α f ( x ) + ( 1 α ) f ( y ) α ( 1 α ) ρ x y 2 / 2 ,
holds for all x,y ∈  R n  and α ∈ [0, 1], and the gradient  g x = f x  satisfies the Lipschitz condition
g ( x ) g ( y ) L x y         x , y R n ,         L > 0 .
Functions which fulfill Condition 1 satisfy the relations [25]:
f x f g x 2 2 ρ ,     x R n ,
f x f ρ x x 2 2 ,     x R n ,
g x 2   2 L ( f x f ) ,     x R n ,
where x* is the minimum point and f* = f(x*) is the function value at the minimum point.
The iteration of the gradient-consistent method with exact one-dimensional descent has the form:
x k + 1 = x k β k s k ,       s k , g ( x k ) > 0 ,
β k = a r g m i n β 0 f x k β s k ,
where the initial point is x0 and sk is a search direction.
Theorem 1.
Let the function satisfy Condition 1. Then, the sequence of iterations j = 0, 1,…, k of the process (5), (6) is estimated as:
f k + 1 f Q k ( f 0 f ) ,   Q k = j = 0 k q j .
q j = 1 ρ g j , s j 2 L g j 2 s j 2
Proof of Theorem 1.
We present the exact value of the function reduction indicator  q k  at iteration in the form:
q k = f k + 1 f f k f = f k + 1 f k + f k f f k f = 1 f k f k + 1 f k f .
Let us make estimates for numerator and denominator in (9). According to (2), for the denominator, we obtain:
f k f g k 2 2 ρ .
According to (6), fk+1 is the minimum of a one-dimensional function whose gradient is  g ( x k ) , s k / s k . Due to the fact that this one-dimensional function also satisfies Condition 1, to estimate the numerator in (9), taking into account inequality (4), we obtain:
f k f k + 1 g ( x k ) , s k 2 2 L s k 2 ,
Using (9) and (10), we obtain (8):
q k = 1 f k f k + 1 f k f q k = 1 ρ g k , s k 2 L s k 2 g k 2 .
Based on Theorem 1, the convergence rate indicator for the gradient method (5), (6) with a choice of descent direction
s k = f ( x k )
according to (8) and (11), will have the form:
q k = 1 ρ L .
Let us consider an estimate of the convergence rate of Newton’s method under Condition 1 and the assumption of the existence of second derivatives of the function.
Theorem 2.
Let the function be twice differentiable and satisfy Condition 1. Then, for a sequence of iterations j = 0, 1,…, k of process (5), (6) with the choice of Newton’s method direction
s k = 2 f x k 1 f x k ,
the estimation takes place:
f k + 1 f Q k ( f 0 f ) ,       Q k = j = 0 k q j ,         q j = 1 ρ 2 L 2 ,   j = 0 , 1 , , k .
Proof of Theorem 2.
Hessian  2 f x k  under Condition 1 satisfies the constraints [25]:
ρ z , z 2 f x z , z L z , z ,       z R n .
Denote  H k = 2 f x k 1 , and  H k 0.5  is a symmetric matrix such that  H k = H k 0.5 H k 0.5 z = H k 0.5 g k .
To use Theorem 1, we estimate  g k , s k 2 / g k 2 s k 2  for direction (13) subject to constraints (15):
g k , s k 2 g k 2 s k 2 = g k , H k g k 2 g k 2 H k g k , H k g k = z , z 2 z , H k 1 z H k z , z ρ L .
Using the last estimate in (8), we obtain estimate (14). □
Let function f(x) satisfy Condition 1. Define the transformation of variables:
x ^ = P x ,
where PRn·n is a non-singular matrix. In the new coordinate system, the function to be minimized takes the form:
f x = f P 1 x ^ = f p x ^ .
The resulting function also satisfies Condition 1 with the strong convexity constant ρp and Lipschitz constants Lp.
Let VRn·n be a non-singular matrix such that for the strong convexity and Lipschitz constants of functions  f V x ^  with
x ^ = V x
and  f p x ^ ,   f o r   a n   a r b i t r a r y   n o n s i n g u l a r   m a t r i x  PRn·n, the inequality takes place:
ρ V L V ρ p L p
Transformation (18) subsequently plays the role of a selected coordinate system, the best in terms of the convergence rate of gradient methods. Due to the fact that the gradient method, unlike Newton’s method, is not invariant under a linear coordinate transformation, we cannot use the strong convexity and Lipschitz constants in the preferred coordinate system (18).
Theorem 3.
Let the function be twice differentiable and satisfy Condition 1. Then, for the sequence of iterations j = 0, 1,…, k process (5), (6) with the choice of the Newton’s method direction (13), the following estimate holds:
f k + 1 f Q k ( f 0 f ) ,       Q k = j = 0 k q j ,     q j = 1 ρ V 2 L V 2 ,   j = 0 , 1 , , k ,
corresponding to the selected coordinate system (18), which has property (19).
Proof of Theorem 3.
The iteration of Newton’s method (5), (6), (13) with exact one-dimensional descent (6), has the form:
x k + 1 = x k β k 2 f x k 1 f x k .
Characteristics of functions  f x  and  f p x ^ ,  taking into account (16) and (17), are related by:
f p x ^ = f x ,         f ^ p x ^ = P T f x ,         2 f ^ p x ^ = P T 2 f x P 1 .
After transferring process (21) to a new coordinate system, we obtain its coincidence with the method in the new coordinate system:
P x k + 1 = P x k β k P 2 f x k 1 P T P T f x k = x ^ k β k P T 2 f x k P 1 1 P T f x k = x ^ k β k 2 f ^ p x ^ k 1 f ^ p x ^ k .
In the case of the relation of the initial points  x ^ 0 = P x 0  for Newton’s method in different coordinate systems, according to (23), at  β k = β ^ k  sequences of points related by the  x ^ k = P x k  and equal values of the functions  f p ( x ^ k ) = f ( x k )  are generated. Moreover, taking into account the fact that the method with exact one-dimensional minimization (6) is considered, due to the extremum condition:
s k , f ( x k ) = P 1 P s k , f ( x k + 1 ) = P s k , P T f ( x k + 1 ) = s ^ k , f ^ p ( x ^ k + 1 ) = 0
equality  β k = β ^ k  will hold. Due to the invariance of Newton’s method with respect to the linear transformation of coordinates (16), when the initial conditions  x ^ 0 = P x 0  are related, Newton’s method generates identical sequences of function values in different coordinate systems. Applying the estimate in the coordinate system  x ^ = V x , taking into account the results of Theorem 2, we obtain estimate (20). □
The last estimate according to (19) determines the advantages of Newton’s method compared to the gradient method in the case of:
ρ V 2 L V 2 ρ L .
Taking into account the fact that when solving practical problems, most of the iterations of the method often occur under conditions of significant Hessian variation (15), estimate (20), subject to condition (24), explains the advantages of Newton’s method. In this case, no additional restrictions on the second derivatives under smoothness conditions are required.

3. Subgradient Minimization Method

Here, we will give an exposition of the subgradient method [27], which solves the problem of forming the descent direction, which makes it possible to obtain a new point of the current minimum approximation by means of one-dimensional minimization along it outside a certain neighborhood of the current minimum. In this case, the appropriate direction is a vector consistent with all subgradients at points in a certain neighborhood of the current minimum approximation. In the case of smooth functions, the descent direction is matched with a set of neighborhood gradients obtained at iterations of the method.
In relaxation processes of the ε-subgradient type, successive approximations are constructed according to the formulas [39,40,41,43,45]:
x k + 1 = x k γ k s k + 1 ,   γ k = a r g min γ f ( x k γ s k + 1 )
The descent direction sk+1 is selected from a set  S ( ε f x k ) , where  ε f x k  is ε-subgradient set at a point  x k  and  S ( G ) = { s R n | min g G ( s , g ) > 0 } ,   G R n  is a set of feasible directions. Denote a subgradient set at a point x by  f ( x ) f ε = 0 ( x ) . If the set S(G) is not empty, then, according to its definition, any vector sS(G) is a solution to the set of inequalities:
s , g > 0 ,   g G ,
that is, it specifies the normal of the separating plane of the origin and the set G. One of the solutions to (26) is a vector η(G) of minimal length from G. For example, in the ε-steepest descent method,  s k + 1 = η ( ε f x k )  [41]. Due to the absence of an explicit definition of the ε-subgradient set, in (25), the vector s that satisfies condition (26) is used as the descent direction, and the set G here is the shell of subgradients obtained on the descent trajectory [39,40,41].
The elements of the set G on smooth functions are the gradients of the current minimum neighborhood. Figure 1 shows the set G with the designations of its elements, which will be given below.
Denote by ηG a vector of minimum length from the set G ρ G = η G μ G = η G / η G s = μ G / ρ G R G = max g G g R s = max g G ( μ G , g ) M G = R S / ρ G . For a certain set G, we will also use the noted characteristics indicating the set as an argument, for example, η(G), r(G).
We will assume that the following assumption holds for the set G.
Assumption 1.
Set G is convex, closed, limited ( R G  < ∞), and satisfies the separability condition, i.e.,  ρ G > 0 .
Let us introduce the relation θ(M) and its inverse function m(θ).
θ M = ( M 1 ) 2 / ( M + 1 ) 2 ,     m θ = ( 1 + θ 1 2 ) / ( 1 θ 1 2 ) ,
Thus,  m θ ( M ) = M . For some limited θ, define the relations:
a ( θ ) = 1 / ( 2 θ ) ,     b ( θ ) = 1 / ( 2 ( 1 θ ) ) ,     0 < θ < 1 / 2 .
Vector s* is a solution to the system of inequalities (26). Parameters ρ and RS characterize the thickness of the set G in the direction μ. The quantity RS, according to its definition, determines the thickness of the set G and significantly affects the convergence rate of learning algorithms with space dilation. When the thickness of the set is zero, when  R s = ρ , we have the case of a flat set.
The quantity  M G  determines the complexity of solving system (26). The transformation parameters of the metric matrices of the subgradient method are found according to expression (28).
In this work, two versions of the subgradient method will be presented. The first of them involves an exact one-dimensional search. For this version of the algorithm, estimates of the convergence rate on smooth functions will be obtained. The second version of the minimization algorithm is intended for practical implementation, where a rough one-dimensional search is used. To correctly integrate a method for solving systems of inequalities into minimization algorithms and comply with the restrictions imposed on its actions, we need to outline it. In the subgradient methods under study, the following Algorithm 1 for solving the system of inequalities (26) is used to estimate the separating plane parameters.
Algorithm 1 [27]. Algorithm for solving a system of inequalities
1. Assume k = 0, H0 = I, q ≥ 1. Set θA such that:
θ ( M G ) θ A < 1 / 2
and  M A m ( θ A ) .
2. Set  g k G  and  s k = H k g k , which is the current approximation of the solution to the system of inequalities  s , g > 0   g G . Find a vector  u k G  such that:
H k g k , u k 0 .
If such a vector does not exist, then the solution  s k = H k g k  is found; stop the algorithm.
3. Compute vectors:
y k = g k u k ,       p k = g k + t k y k ,   where   t k = y k , H k g k y k , H k y k .
Here, the vector pk, is found from the condition of vectors  v k = H k p k  and  y k  orthogonality:
y k , H k p k = y k , v k = 0 .
Compute  θ g k ( M A ) , where:
θ g k M = 1 + y k , H y k ( M 1 ) 2 p k , H p k 1 + C k y k , H y k ( M 1 ) 2 1 ,
C k = min y k , H k g k , y k , H k u k .
Find the parameter:
θ k = θ A / q 2   ,         i f   θ g k M A θ A / q 2   , θ A ,           i f   θ g k M A θ A , θ g k M A ,               o t h e r w i s e .
Find the parameters according to (28):
α k 2 = a θ k ,   β k 2 = b ( θ k ) .
We obtain a new approximation of the metric matrix  H k + 1 = ( H k ,   α k , β k ,   y k , p k ) , where:
H ,   α , β ,   y , p = H 1 1 α 2 H y y T H T y , H y 1 1 β 2 H p p T H T p , H p .
4. Assign k = k + 1. Go to step 2.
Constraint (29) for the set G in the case of applying Algorithm 1 in the minimization method imposes restrictions on the subgradient sets of the non-smooth minimization problem. In the case of smooth minimization problems, one can arbitrarily choose the parameter satisfying (29). This parameter is selected experimentally in order to optimize the algorithm efficiency.
Denote  α A 2 = a θ A ,   β A 2 = b ( θ A ) . It was proven in [27] that Algorithm 1 converges in a finite number of iterations on a set G, satisfying Assumption 1, and for algorithm parameters V0 and θA, for which the restrictions  0 < V 0 ρ G 2 / R G 2  and (29) are satisfied. In this case, the number of iterations does not exceed k0—the minimum integer number from the range of values k satisfying the inequality:
25 k ( q 2 α A 2 1 ) n V 0 2 [ ( α A 2 β A 2 ) k / n 1 ] < 1
From the above estimate, the conclusion can be made that larger values of  α A 2 β A 2  correspond to fewer number of iterations k0, which means that the desired direction will be found in fewer number of iterations. The last estimate is based on the worst-case scenario, when all  α k 2 β k 2 = α A 2 β A 2 . In fact, according to the results of a computational experiment in [27], a minimization algorithm based on Algorithm 1 with parameter (35) is more effective than with fixed parameters  α A 2 β A 2 .
The version of the minimization algorithm presented in this section uses exact one-dimensional descent and is intended to estimate its convergence rate on smooth functions. A practically implementable version of the algorithm without exact one-dimensional descent will be presented in the next section. Here, as well as in the practically implemented version of the algorithm, there are no updates for the parameters of the algorithm for solving systems of inequalities in the form of setting Hk = I, which are used in the theoretical version of the algorithm from [27], necessary for the theoretical justification of the convergence of the minimization algorithm on non-smooth functions. In the version of the minimization algorithm used in practice when minimizing both smooth and non-smooth functions, the above update is absent, but there are minor changes to the diagonal elements of the matrix  H k H k + λ I , excluding its poor conditionality and scaling, and  H k c H k   c > 1 , excluding excessive reduction of its elements. Therefore, the described version of the algorithm is closest to the implemented versions designed to minimize smooth and non-smooth functions. As before, to denote both the gradient and the subgradient at some point xk, we will use the notation  g k = g ( x k ) f ( x k ) .
At Step 2 of Algorithm 1, vector  g k G  is given arbitrarily, vector  u k G  having property (30) is found in the set. In the minimization algorithm, we assume  g k f ( x k )  at the point of current minimum approximation, determine the descent direction  s k = H k g k , and find the new minimum approximation  x k + 1 = x k γ k s k + 1 .
In the case of exact one-dimensional minimization, the equality  g k + 1 , s k = g k + 1 , H k g k  holds for the gradient at a point  x k + 1 . Therefore, in Algorithm 1, built into the minimization algorithm, we can take the vectors  H k g k  and  u k = g k + 1  as a new pair of vectors in (30), for which, due to exact one-dimensional descent, an inequality similar to (30) will be satisfied  H k g k , g k + 1 = 0 .  Due to the arbitrary choice of vector  g k  in Algorithm 1, at the next iteration in the minimization algorithm, the vector  g k + 1  can be chosen. An idealized version of such a minimization algorithm is Algorithm 2 described below. Estimation of the convergence rate of this algorithm is the goal of our work.
In the case of inexact one-dimensional descent, it is assumed that the one-dimensional minimum has been localized, that is, a point  z k + 1 = x k γ z s k  has been obtained such that the subgradient  u k + 1  at the extreme point  z k + 1  satisfies the inequality (30)  H k g k , u k + 1 = s k , u k + 1 0   (Figure 2). Subgradient  u k + 1  will be used for the matrix  H k  transformation. Figure 2 shows the point  x k + 1  with the smallest found function value, which, at the next iteration, will become the new current minimum point with the direction of minimization  s k + 1 = H k + 1 g k + 1 . A presentation of the practical version of the algorithm and its numerical analysis will be given in subsequent sections.
The following minimization algorithm assumes exact one-dimensional descent. An infinite sequence of points  x k  is constructed until the gradient becomes zero. For this version of the algorithm, an estimate of the convergence rate has been made.
Algorithm 2. Minimization algorithm
1. Assume k = 0, H0 = I, q ≥ 1. Set θA such that:
θ A < 1 / 2
and  M A m ( θ A ) . Compute  g 0 = f ( x 0 ) , If  g 0 = 0  then the minimum point is found, stop the algorithm.
2. Find a new minimum approximation:
x k + 1 = x k γ k s k ,   s k = H k g k ,   γ k = a r g m i n γ f x k γ s k .
3. Compute the gradient  g k + 1 = f ( x k + 1 )  based on the condition:
g k + 1 s k 0 .
If  g k + 1 = 0 ,  then  x k + 1  is the minimum point; stop the algorithm.
4. Compute vectors  y k p k :
y k = g k g k + 1 ,     p k = g k + 1 + t k y k ,         t k = y k , H k g k + 1 y k , H k y k .
Here, the vector  p k  is found from (32) based on the orthogonality of  H k p k  and  y k .
Compute  θ g k ( M A )  according to formula (33), where:
C k = min y k , H k g k , y k , H k g k + 1 .
Find  θ k  according to (34) and  α k 2 = a θ k ,   β k 2 = b θ k , as in (35). We obtain a new approximation of the metric matrix  H k + 1 = ( H k ,   α k , β k ,   y k , p k ) ,
5. Assign k = k + 1. Go to step 2
Here, the built-in method for solving the system of inequalities (26) is the transformations carried out at step 4 under condition (39). The solution to system (26) at iteration is the vector  H k + 1 g k + 1 , which is used as the new descent direction.
In [27], the optimization of the parameters’  α k , β k  choice is related to the characteristics of the subgradient sets of the function. In [27], it is assumed that it is possible to choose a parameter MA corresponding to the real characteristic Mε, which is the union of subgradient sets of a certain ε-neighborhood of the current minimum point satisfies the relation:
θ ( M ε ) = θ ( M A ) < 1 / 2
In the case of smooth functions, due to the fact that the subgradient coincides with the gradient, and the subgradient set contains a single element—the gradient—it is easy to satisfy condition (42), since for small ε, the characteristics of the subgradient set:
R ( f ( x k ) ) = ρ ( f ( x k ) ) = 1 ,   M = M ( f ( x k ) ) = 1 ,
due to the fact that the gradient satisfies the Lipschitz condition, they change insignificantly. Therefore, for small ε:
M ε = M ( ε f ( x k ) ) = R S ( ε f ( x k ) ) / ρ ( ε f ( x k ) ) 1 ,   θ ( M ε ) = ( M ε 1 ) 2 / ( M ε + 1 ) 2 0 ,
which makes it possible to consider the algorithm for sufficiently large values of ε-neighborhoods that satisfy condition (42).
The smaller  θ M i s , the more efficient the algorithm for solving systems of inequalities works [27]. But for small values of  θ ( M ) , according to (35), the values of  α 2  will be very large. This will lead to large changes in the matrix H (36), which negatively affects the efficiency of the minimization method due to the difficulties that arise with the degeneration of metric matrices. Therefore, in the minimization algorithm, the smallest value  θ k  has to be limited and consistent with the accuracy of the one-dimensional search. To do this, a constraint on the parameter  θ k  is introduced into (34):
θ ( M A ) / q 2 θ k θ ( M A ) < 1 / 2 .
As a result, we obtain restrictions on the parameters of matrix transformation in (36):
α max 2 = 1 2 θ ( M A ) / q 2 α k 2 1 2 θ ( M A ) = α min 2 ,
β min 2 = 1 2 ( 1 θ ( M A ) / q 2 ) β k 2 1 2 ( 1 θ ( M A ) ) = β max 2 .
Relation  θ k ( 1 θ k )  subject to restrictions on  θ k  (43) is monotonically increasing on the segment  0 < θ k < 1 / 2 . Hence the constraints:
1 4 ( 1 θ ( M A ) / q 2 ) ( θ ( M A ) / q 2 ) α k 2 β k 2 = 1 4 θ k ( 1 θ k ) 1 4 θ ( M A ) ( 1 θ ( M A ) ) .
From here and (44), (45) the inequalities follow:
α max 2 β min 2 α k 2 β k 2 α min 2 β max 2 .
For the parameters  α k , β k  according to (44), (45), and (46), the inequalities hold:
α k > 1 , 0 < β k 1 ,     α k · β k > 1 .
The presented algorithm for solving systems of inequalities and the minimization algorithm also converge for fixed parameters  α k = α c o n s t , β k = β c o n s t  [27].
α k 2 = 1 2 θ ( M A ) = α min 2 ,       β k 2 = 1 2 ( 1 θ ( M A ) ) = β max 2 .
As a computational experiment shows, the convergence rate of the method for solving systems of inequalities and the minimization method based on it [27] is significantly higher if parameters  α k , β k  adjusted depending on the current situation (35) are used.

4. On the Convergence Rate of the Subgradient Method on Strongly Convex Functions with Lipschitz Gradient

As earlier, x* is a minimum point of the function f(x), f* = f(x*), fk = f(xk) and  g k = g x k = f ( x k )  for a differentiable function satisfying Condition 1. Denote  A k = H k 1 , Sp(A) is a trace of matrix A, det A is a determinant of matrix A. For an arbitrary matrix A > 0, we denote A1/2 as a symmetric matrix for which A1/2 > 0 and A1/2 A1/2 = A. For characteristics of matrices  A k ,   H k  we used the result from [27], valid for arbitrary parameters  α k , β k  satisfying condition (48).
Lemma 1
[27]. Let   H k > 0 , matrix  H k + 1  obtained as a result of transformation  H k + 1 = ( H k ,   α k , β k ,   y k , p k ) , where parameters  α k , β k  satisfy condition (48), and for arbitrary vectors  y k 0 ,   p k 0 , equality (38) is satisfied. Then,  H k + 1 > 0  and:
A k + 1 = A k + ( α k 2 1 ) y k y k T y k , H k y k + ( β k 2 1 ) p k p k T p k , H k p k ,
S p ( A k + 1 ) = S p ( A k ) + ( α k 2 1 ) y k , y k y k , H k y k + ( β k 2 1 ) p k , p k p k , H k p k ,
d e t H k + 1 = det H k / α k 2 β k 2 ,   d e t A k + 1 = α k 2 β k 2 d e t A k .
The following theorem shows that the presence of motion as a result of iterations (5), (6) lead to a decrease in the function.
Theorem 4.
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2… given by the process (5), (6) the following estimation takes place:
f k + 1 f ( f 0 f ) exp [ ρ 2 L 2 i = 0 k y i 2 g i 2 ] ,  
where  y i = g i + 1 g i .
Proof of Theorem 4.
For a strongly convex function, inequality (2) is satisfied. Taking this inequality into account, we obtain:
f k + 1 f = ( f k f )   f k f k + 1 = ( f k f ) 1 f k f k + 1 f k f ( f k f ) 1 2 ρ f k f k + 1 g k 2
Inequality (3) is also valid for the one-dimensional function:
ϕ ( t ) = f ( x k t s k / s k ) .
From here, taking into account the exact one-dimensional search, inequality (3) and Lipschitz condition (1), the estimate follows:
f k f k + 1 ρ | | x k x k + 1 | | 2 / 2 ρ | y k | 2 2 L 2 .
Transform (54) using the last relation and inequality  exp c 1 c ,   c 0 .
f k + 1 f ( f k f ) ( 1 ρ 2 y k 2 L 2 g k 2 ) ( f k f ) exp ( ρ 2 y k 2 L 2 g k 2 ) .
Recurrent use of the last inequality leads to estimate (53). □
Let us estimate the convergence rate of Algorithm 2 under more general restrictions on the parameters  α k 2 β k 2 .
a M α k 2 a m ,     1 β k 2 b m ,     a m 1 / b m .
This implies the constraint:
a M α k 2 β k 2 a m b m .
The following theorem substantiates the linear convergence rate of Algorithm 2 under constraints (55).
Theorem 5.
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2… given by the Algorithm 2 with limited initial matrix H0:
m 0 H 0 z , z / z , z M 0 ,
(1) with an arbitrary parameter  α k 2 β k 2  satisfying (55), the following estimation takes place:
f k + 1 f ( f 0 f ) exp { ρ 2 ( k + 1 ) L 2 n [ 2 ln ( a m b m ) ( a M 1 ) + n ln ( m 0 / M 0 ) ( k + 1 ) ( a M 1 ) ] } ,
(2) with parameters   α k 2 β k 2  specified in Algorithm 2, the estimation is:
f k + 1 f ( f 0 f ) exp { ρ 2 ( k + 1 ) L 2 n [ 2 ln ( α min 2 β max 2 ) ( α max 2 1 ) + n ln ( m 0 / M 0 ) ( k + 1 ) ( α max 2 1 ) ] } .
Proof of Theorem 5.
Based on (50), we obtain (51). Transform (51) taking into account  β k 2 1 0 , we obtain an estimate for the trace of matrices Ak:
S p ( A k + 1 ) S P ( A k ) 1 + ( α k 2 1 ) y k , y k S p ( A k ) H k y k , y k
Due to exact one-dimensional descent (38), the following condition is satisfied:
s k , g k + 1 = H k g k , g k + 1 = 0 ,
which, together with the positive definiteness of the matrices, proves the inequality:
  H k y k , y k = H k g k , g k + H k g k + 1 , g k + 1 2 H k g k , g k + 1 H k g k , g k .
Hence, taking into account  S p ( A k ) M k , where  M k  is the maximum eigenvalue of the matrix  A k , we obtain:
S p A k H k y k , y k S P A k H k g k , g k S p A k M k g k , g k g k , g k .
Based on the last estimate, inequality (59) is transformed to the form:
S p ( A k + 1 ) S P ( A k ) 1 + ( α k 2 1 ) y k 2 g ( x k ) 2 .
Based on the relationship between the arithmetic mean and geometric mean of the matrix A > 0 eigenvalues, we have    S p ( A ) / n d e t ( A ) 1 / n . From here and (60), (52) in the case of restrictions on parameters  α k 2 β k 2  (55), we obtain:
S p ( A 0 ) n i = 0 k 1 + ( α k 2 1 ) y i 2 g ( x i ) 2     S p ( A k + 1 ) n ( det A k + 1 ) 1 / n = i = 0 k α i 2 β i 2 d e t ( A 0 )     1 / n ( a m b m ) k + 1 d e t ( A 0 ) 1 / n
and in case of choosing parameters  α k 2 β k 2 , as in Algorithm 2, taking into account (47), we obtain an estimate:
S p ( A 0 ) n i = 0 k 1 + ( α k 2 1 ) y i 2 g ( x i ) 2     S p ( A k + 1 ) n ( det A k + 1 ) 1 / n = i = 0 k α i 2 β i 2 d e t ( A 0 )     1 / n ( α m i n 2 β m a x 2 ) k + 1 d e t ( A 0 ) 1 / n .
The last inequalities based on ratio  1 + p e x p ( p ) , transform to the form:
S p ( A 0 ) n e x p ( a M 1 ) i = 0 k y i 2 g ( x i ) 2 ( a m b m ) ( k + 1 ) / n d e t ( A 0 ) 1 / n ,
S p ( A 0 ) n e x p ( α m a x 2 1 ) i = 0 k y i 2 g ( x i ) 2 ( α m i n 2 β m a x 2 ) ( k + 1 ) / n d e t ( A 0 ) 1 / n .
Due to condition (55)  S p ( A 0 ) / n 1 / m 0 ,   det A 1 n 1 / M 0 . Taking logarithms of (61) and (62), taking into account the last inequalities, we find:
[ ( a M 1 ) i = 0 k y i 2 g ( x i ) 2 ] ( k + 1 ) ln ( a m b m ) / n + ln ( 1 / M 0 ) ln ( 1 m 0 ) , [ ( α max 2 1 ) i = 0 k y i 2 g ( x i ) 2 ] ( k + 1 ) ln ( α min 2 β max 2 ) / n + ln ( 1 / M 0 ) ln ( 1 m 0 ) ,
This implies:
i = 0 k y i 2 g ( x i ) 2 2 ( k + 1 ) ln ( a m b m ) n ( a M 1 ) + ln ( m 0 / M 0 ) ( a M 1 ) , i = 0 k y i 2 g ( x i ) 2 2 ( k + 1 ) ln ( α min 2 β max 2 ) n ( α max 2 1 ) + ln ( m 0 / M 0 ) ( α max 2 1 ) ,
which, together with estimate (53) of Theorem 4, proves (57) and (58). □
Estimating the convergence rate of Algorithm 2 under more general constraints (55) on parameters  α k 2 β k 2  makes it possible to use parameters different from those generated in Algorithm 2. The paper presents a computational experiment where the parameters of Algorithm 2 were changed as follows:
α k 2 α k 2 × β k 2 / c ,   β k 2 c .
Here, parameters c were set as follows: c = {0.2; 0.1; 0.05}. As a result of a computational experiment, it was revealed that in ill-conditioned problems such changes increase the efficiency of the minimization method, including in non-smooth optimization problems. For non-smooth problems, there is no theoretical justification for convergence under transformation (63).
The obtained estimates do not explain the fact of the high convergence rate the method, for example, on quadratic functions. To justify the accelerating properties of the method, we need to show its invariance with respect to the linear transformation of coordinates and then use estimate (58) in the coordinate system with maximal ratio ρ/L. A similar possibility exists, for example, in the case of quadratic functions, where this ratio will be equal to 1.
Let us establish a relation between the characteristics of Algorithm 2, used to minimize the functions  f x  and  f p x ^  from (17).
Theorem 6.
Let the initial conditions of Algorithm 2, used to minimize the functions  f x  and  f p x ^ , defined in (17), be related by the equalities:
x ^ 0 = P x 0 ,   H ^ 0 = P H 0 P T .
Then, the characteristics of these processes are related by the relations:
f P ( x ^ k ) = f ( x k ) ,   x ^ k = P x k ,   f P ( x ^ k ) = P T f ( x k ) ,   H ^ k = P H k P T ,   k = 0 , 1 , 2 ,
Proof of Theorem 6.
For derivatives of functions  f x  and  f p x ^ , relation  f p x ^ = P T f ( x )  holds. From this and assumption (64) follows (65) for k = 0. Let us assume that equalities (65) are satisfied for all k = 0,1,…,i. Let us show their feasibility for k = i + 1. From (38) with k = i after multiplication by P on the left, taking into account the proven equalities (65), we obtain:
P x i + 1 = P x i γ i P H i P T P T f ( x i ) = x ¯ i γ i H ^ i f P ( x ^ i ) .
Hence, according to the definition of the function fp, at the stage of one-dimensional minimization (38), the equality  γ i = γ ¯ i  is satisfied. Therefore, the right side of (66) is the implementation of step (38) in the new coordinate system. Hence:
x ^ i = P x i , f P ( x ^ i ) = P T f ( x i ) , y ^ i = f P ( x ^ i + 1 ) f P ( x ^ i ) = P T y i .
Multiplying (36) with the current indices on the left by P, and on the right by PT, taking into account (67), we obtain:
P H i + 1 P T = P H i P T 1 1 α i 2 P H i P T P T y i y i T P 1 P H T i P T y i , P 1 P H i P T P T y i 1 1 β i 2 P H i P T P T p i p i T P 1 P H T i P T p i , P 1 P H k P T P T p i = H ¯ i 1 1 α i 2 H ^ i y ^ i y i ^ T H ^ T i H ^ i y ^ i , y ^ i 1 1 β i 2 H ^ i p ^ i p i ^ T H ^ T i H ^ i p ^ i , p ^ i ,
where the right side is the implementation of formula (36) in the new coordinate system. The denominators of the last formula establish a relationship:
y i , P 1 P H i P T P T y i = H ^ i y ^ i , y ^ i ,       p i , P 1 P H k P T P T p i = H ^ i p ^ i , p ^ i .
Using the last equalities and formulas (41), (33) of Algorithm 2, we obtain:
α i 2 = α ^ i 2 , β i 2 = β ^ i 2 .
Finally, we obtain  P H i + 1 P T = H ^ i + 1 . Consequently, equalities (65) will also be valid for k = i + 1. Continuing the induction process, we obtain the proof of Theorem 6. □
For function  f p x ^  denote strong convexity constant by ρp, Lipschitz constant by Lp. Introduce the function K(P) = ρp/Lp. Denote by V the coordinate transformation matrix such that K(V) ≥ K(P) for an arbitrary non-singular matrices P.
Theorem 7.
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0,1,2… given by the Algorithm 2 with limited initial matrix H0 according to (56)
(1) with an arbitrary parameter  α k 2 β k 2  satisfying (55), the following estimation takes place:
f k + 1 f ( f 0 f ) exp { ρ V 2 ( k + 1 ) L V 2 n [ 2 ln ( a m b m ) ( a M 1 ) + n ln ( m 0 / M 0 ) ( k + 1 ) ( a M 1 ) ] } ,
(2) with parameters   α k 2 β k 2  specified in Algorithm 2, the estimation is:
f k + 1 f ( f 0 f ) exp { ρ V 2 ( k + 1 ) L V 2 n [ 2 ln ( α min 2 β max 2 ) ( α max 2 1 ) + n ln ( m 0 / M 0 ) ( k + 1 ) ( α max 2 1 ) ] } .
where m0, M0 are the minimum and maximum eigenvalues of the matrix   H ^ 0 = V H 0 V T  in the selected coordinate system (18) having property (19).
Proof of Theorem 7.
According to the results of Theorem 6, we can choose an arbitrary coordinate system to estimate the convergence rate of the minimization process of Algorithm 2. Therefore, we use estimates (57) and (58) in a coordinate system with the matrix P = V and obtain estimates (68) and (69). □
The first term in square brackets characterizes the constant in estimating the convergence rate of the method, and the second term characterizes the costs of setting up the metric matrix.
For the steepest descent method (scheme (5), (6) with (11)) on functions satisfying Condition 1, the order of the convergence rate is determined by expression (12)  q k = 1 ρ / L . Given that  l V 2 / L V 2 l / L  estimate for Newton’s method (20) is  q k = 1 ρ V 2 / L V 2 , for quasi-Newton method [27] is:
q k = 1 ρ V 3 / ( 2 L V 3 )
and estimates (68) and (69) for the subgradient method turn out to be preferable to (12). This situation arises, for example, when minimizing quadratic functions whose Hessians have a large spread of eigenvalues.
Thus, Algorithm 2 on strongly convex functions, without assuming the existence of second derivatives, has accelerating properties compared to the steepest descent method.
For sufficiently small values of the  l V 2 / L V 2  ratio, the average convergence rate of the subgradient method is given below:
q ¯ k 1 ρ V 2 n L V 2 × [ 2 ln ( α min 2 β max 2 ) ( α max 2 1 ) + n ln ( m / M ) ( k + 1 ) ( α max 2 1 ) ] 1 2 ρ V 2 n L V 2 × ln ( α min 2 β max 2 ) ( α max 2 1 ) .
The second term in square brackets of estimate (71) characterizes the stage of adjusting the metric matrix of Algorithm 2. From the analysis of expression (71), we can conclude that the qualitative nature of estimate (69) is similar to estimate (20) for Newton’s method, which takes into account the difference between information in the form of a matrix of second derivatives in (20) and a gradient in (69) through the presence of the factor 1/n in (69).
To test the effectiveness of the algorithm, it makes sense to implement Algorithm 2 and conduct numerical testing in order to identify its application possibilities in solving problems of minimizing smooth functions along with effective quasi-Newton methods for solving minimization problems with a high degree of conditionality.

5. Aspects of the Subgradient Method Implementation

In the case of inexact one-dimensional descent, in operation (25) of the minimization algorithm, it is assumed that the one-dimensional minimum has been localized, that is, a point  z k + 1 = x k γ z s k  has been obtained such that the subgradient uk+1 at the extreme point zk+1 satisfies inequality (30):
H k g k , u k + ! = s k , u k + 1 0 ,
which is shown in Figure 2.
The subgradient uk+1 is used to transform the matrix Hk. Figure 2 shows a point xk+1 with a smaller function value on the localization segment between points xk and uk+1
f ( x k ) f ( x k + 1 ) f ( u k + 1 ) ,
which, at the next iteration, will become the new current minimum point with the direction of minimization  s k + 1 = H k + 1 g k + 1 .
In Algorithm 1, at each iteration vector,  g k G  is chosen arbitrarily, and then, vector  u k G  such that  H k g k , u k 0 . In the minimization algorithm with one-dimensional minimization, from a point x along the direction s = Hg, when carrying out localization of the minimum, we obtain a point  x 1 = x γ 1 s  for which a similar (30) inequality is satisfied, and a point  x m = x γ m s  inside the localization segment with a smaller function value, which we take in the minimization algorithm as a new minimum approximation from which a new one-dimensional descent will then be carried out. Gradients gx = g(x) and g1 = g(x1) at the points x and x1 are used together for matrix transformation. Thus, in the practical version of the minimization algorithm, vectors gx, g1, g(xm) will be used, corresponding in meaning to the vectors  g k G , u k G ,   g k + 1 G  from Algorithm 1.
We use the one-dimensional minimization procedure based on these principles, outlined in [27,43]. Its set of input parameters is  x , s , g x , f x , h 0 , where x is the point of the current minimum approximation, s is the descent direction,  h 0  is the initial search step,  f x = f x ,   g x f ( x ) , and the necessary condition for the possibility of reducing the function along the direction  g x , s > 0  must be satisfied. Its output parameters are  γ m , f m , g m , γ 1 , g 1 , h 1 . Here,  γ m  is the step to the point of a new minimum approximation:
x m = x γ m s ,       f m = f x m ,       g m f ( x m ) ,
γ 1  is the step along s such that at the point  x 1 = x γ 1 s  for the subgradient  g 1 f ( x 1 )  inequality  g 1 , s 0  holds. This subgradient is used in the learning algorithm. The output parameter h1 is the initial descent step for the next iteration. The step h1 is adjusted to reduce the number of calls to the procedure for calculating the function and subgradient.
In the minimization algorithm, the vector  g 1 f ( x 1 )  is used to solve a system of inequalities, and the point  x m = x γ m s  as the point of a new minimum approximation.
We denote the call to the procedure as OM( x , s , g x , f x , h 0 ;   γ m , f m , g m , g 1 , h 1 ). Here is a brief description of it.
Let us introduce a one-dimensional function  φ β = f ( x β s ) . To localize its minimum, we take an increasing sequence  β 0 = 0 ,   β i = h 0 q M i 1 ,   i 1 . Here, qM > 1 is a step increasing parameter. In most cases, it is specified qM = 3. Denote  z i = x β i s ,     r f z i ,   i = 0 , 1 , 2 , ,    l is number of i at which the relation  r i , s 0  is first time satisfied. Let us determine the parameters of the localization segment  γ 0 , γ 1  of one-dimensional minimum:  γ 0 = β l 1 ,   f 0 = f z l 1 ,   g 0 = r l 1 ,     γ 1 = β l ,     f 1 = f z l ,   g 1 = r l 1  and find a minimum point  γ  through cubic approximation of the function [46] on the localization segment, using the values of the one-dimensional function and its derivative. Calculate:
γ m = { 0.1 γ 1 ,         i f     l = 1     a n d     γ 0.1 γ 1 , γ 1 ,                     i f     γ 1 γ 0.2 ( γ 1 γ 0 ) , γ 0 ,                   i f     l > 1     a n d     γ γ 0 0.2 ( γ 1 γ 0 ) , γ ,                   o t h e r w i s e .
We calculate the initial descent step for the next iteration using the rule:
h 1 = q m h 0 ( γ 1 / h 0 ) 1 / 2 .
Here, qm < 1 is descent step decreasing parameter, which, in most cases, is set as qm = 0.8. In the vast majority of applications, the set of parameters {qM = 3, qm = 0.8} is satisfactory. When solving complex problems with a high degree of level surfaces elongation, the parameter should be increased: qm → 1. Subgradient method implementation is presented in Algorithm 3.
Algorithm 3. Subgradient method implementation
1. Assume k = 0, initial matrix H0 = I, q ≥ 1, the number of iterations kmax to stop the algorithm. Set ΘA satisfying inequality (37), and parameter  M A m ( θ A ) . Compute  g 0 f ( x 0 ) . Set the initial step of a one-dimensional search h0 and small  ε = 10 10 .   If   g 0 = 0  and then the x0 is a minimum point, stop the algorithm.
2. If
s k , g k s k × g k ε ,
then correct the matrix:
H k = H k + 10 ε d m a x I ,   d m a x = max i { H i i , k } ,   i = 1 , 2 , , n .
Set 
s k = H k g k / H k g k , g k 1 / 2 .
Find a new minimum approximation:
OM ( x k , s k , g k , f k , h k ;   γ k + 1 , f k + 1 , g k + 1 , u k + 1 , h k + 1 ) .
According to the description of the OM procedure, here the subgradient vector  u k + 1  satisfies the condition  H k g k , u k + 1 0 .
If  g k + 1 = 0  then  x k + 1  is the minimum point, stop the algorithm.
If k > kmax then stop the algorithm.
3. Compute vectors  y k ,   p k  by (31).
y k = g k u k + 1 ,             t k = y k , H k g k + 1 y k , H k y k ,               p k = g k + 1 + t k y k .
Here, vector  p k  is found from the orthogonality condition (32) of the vectors  H k p k  and  y k . Then, compute  θ g k ( M A )  by (33), where  C k  is calculated by formula (41).
Find  θ k  according to (34) and parameters  α k 2 = a θ k ,   β k 2 = b ( θ k )  by (35). We obtain a new approximation of the metric matrix  H k + 1 = ( H k ,   α k , β k ,   y k , p k ) ,
If dmaxε, then carry out scaling
H k + 1 = H k + 1 / d max ,       h k + 1 = h k + 1 d max ,       d max = max i = 1 , 2 , , n { H i i , k + 1 }
4. Assign k = k + 1. Go to step 2.
Here, the built-in method for solving the system of inequalities (26) is the transformations carried out at Step 3 under condition (39). The current approximation of the solution to system (26) at the iteration is vector sk (74), which is used as the new descent direction.
The algorithm uses soft matrix updating due to small changes in diagonal elements in the case of large angles (72) between vectors sk and gk. Due to the fact that as a result of matrix transformations, its elements are reduced to compensate for this effect, a scaling transformation (75) is carried out, which does not affect the computational process. Taking into account the scaling of the descent direction (74), simultaneously with the scaling of the matrix, the one-dimensional search step is also scaled, which is adjusted in the one-dimensional minimization procedure.
Along with formula (33), we used a simplified version of calculating the value of  θ g k ( M A ) , which enables us to analyze the qualitative nature of formula (33). Using symmetric matrix  H k 1 / 2 , we form vectors  a = H k 1 / 2 y k ,   b = H k 1 / 2 g k + 1 ,   c = H k 1 / 2 g k ,   p = H k 1 / 2 p k  and assume equality  a = b . Hence, due to the equality a = b − c, the vectors a, b, c form an isosceles triangle (Figure 3).
Due to the fact that the lengths of the vectors b, c projections onto the vector a are the same, the equality  a , b = a , c = a , a / 2  holds. Therefore:
C k = min y k , H k g k , y k , H k g k + 1 = y k , H k g k = a , c = a , a / 2 = y k , H k y k / 2
and the factor from (33) can be transformed as follows:
1 + C k y k , H k y k M 1 = 1 + y k , H k y k 2 y k , H k y k M 1 = M + 1 2 .
From here and (33), we obtain:
θ g k M = 1 + y k , H y k ( M 1 ) 2 p k , H p k 1 + C k y k , H k y k ( M 1 ) 2 1 = 1 + ( M + 1 ) 2 y k , H k y k 4 ( M 1 ) 2 p k , H k p k ) 1                                                         ( M + 1 ) 2 y k , H k y k 4 ( M 1 ) 2 p k , H k p k ) 1 = ( M 1 ) 2 ( M + 1 ) 2 × 4 p k , H k p k y k , H k y k = θ M × 4 p k , H k p k y k , H k y k = θ M × 4 p , p a , a
At the last steps of the transformation in (76), we used the expression  θ M = M 1 2 / M + 1 2  introduced earlier in (27). As shown in [27], Algorithm 2 is also operable when using formula (27)  θ M A = θ A  to calculate the transformation coefficients of matrices (49) instead of  θ g k ( M A ) .
Approximate formula (76) reflects the qualitative nature of the relation  θ g k ( M A ) . According to Figure 3, larger angles between vectors b, c correspond to smaller values of the ratio  p , p / a , a , which, according to (76), reduces the value of  θ g k ( M A )  and, accordingly, leads to an increase in the parameter  α k 2  and an insignificant decrease in the parameter  β k 2  at Step 3 of Algorithm 3. We used a simplified expression for  θ g k ( M )  from (76) in Algorithm 3.
Below, we present examples of solving test problems using the quasi-Newton BFGS method Algorithms 2 and 3.

6. Results of Numerical Study on Smooth Functions

Algorithms 2 and 3 were implemented with parameters  θ A = 0.04356  and  M A = 1.52755 , providing the following product:  α 2 β 2 = 1 / ( 4 θ A ( 1 θ A ) ) = 6 . These values were used in Algorithms 2 and 3 with dynamic parameters  α k 2 β k 2  selection method. The methods used the one-dimensional search described above. For comparison, the quasi-Newtonian BFGS method was implemented with a one-dimensional search procedure using cubic interpolation [46]. In all methods, the function and gradient were calculated simultaneously.
Table 1, Table 2, Table 3, Table 4 and Table 5 show the number of calculations of function and gradient values required to achieve the designated accuracy by the function  f x k f ε . The initial point of minimization x0 and the value ε are given in the description of the function.
The purpose of testing is to experimentally study the ability of subgradient method and quasi-Newton method to eliminate the background that slows down the convergence rate, which is eliminated through some linear transformation that normalizes the elongation of function level surfaces in different directions, which is predicted theoretically by the estimate (69) of Theorem 7.
Due to the fact that the use of subgradient methods with changing the space metric and quasi-Newtonian methods is justified primarily on functions with a high degree of conditionality, where conjugate gradient methods do not work, the test functions were selected based on this position. Due to the fact that the quasi-Newton method is based on a quadratic model of a function, its local convergence rate in a certain neighborhood of the current minimum is largely determined by how effective it is in minimizing ill-conditioned quadratic functions. Therefore, research was primarily carried out on quadratic functions and functions of their derivatives.
If the function is twice differentiable, then the eigenvalues of the Hessian are limited by the interval of the strong convexity parameter and Lipschitz parameter [ρ,L]. Previously, we did not use second derivatives in our proofs. Nevertheless, when developing tests, we used the representation of a quadratic function and the analysis of its conditionality, relying on its eigenvalues. The test functions simulate the oscillatory nature of the second derivatives in two ways. The first of them is the drift of the corresponding eigenvalue from one value to another. In the second method, we imposed noise on the length of the gradient vector randomly, which is reflected in the calculations of the gradient difference in subgradient Algorithms 2 and 3 (40) and in the quasi-Newton method:
y k = f x k + 1 f ( x k ) .
With the described methods of simulating oscillations of the Hessian imposed on some basic quadratic function with given characteristics of the eigenvalues, we have characteristics of the degeneracy degree of the problem and can set, on the one hand, the scaling that the methods under study should exclude and the degree of oscillations of the scales simulating the change matrices of second derivatives within specified limits.
The following is accepted as the basic quadratic function:
f 1 ( x , [ a max ] ) = 1 2 i = 1 n a i x i 2 , a i = a max i 1 n 1
The eigenvalues ai of this function have the limits  λ m i n = 1 ,   λ m a x = a m a x . In this case, the methods under study have to remove the basic (trend) scaling specified by the coefficients of this function. To simulate random fluctuations of second derivatives, a function f2 was created. To calculate the function values, the basic function  f 2 = f 1 ( x , [ a m a x ] )  was used. Its gradients were distorted randomly according to the following scheme:
f 2 = f 1 × ( 1 + r × ξ )
where ξ ∈ [−1,1] is a random number uniformly distributed on a segment [−1,1], r = 0.3. Such function will be noted as  f 2 ( x , [ a m a x ,   r = 0.3 ] ) .
Here, the parameters are the base function parameters and the gradient distortion parameter. It should be noted that distortion of gradients significantly reduces the accuracy of one-dimensional search, where gradients are used to estimate directional derivatives in cubic approximation.
In the third function, additional variables ci were used to change the scales of ai for each of the variables.
c i = b m a x b i x i 2 1 + x i 2 + b i 1 x i 2 1 + x i 2 ,   b i = b m a x i 1 n 1 .
This function near the extremum will have the form:
f 3 x , a m a x ,   b m a x 1 2 i = 1 n a i b i x i 2 .
Far from the extremum, we obtain a function in which the coefficients bi are used in reverse order:
f 3 x , a m a x ,   b m a x 1 2 i = 1 n a i b m a x b i x i 2 .
Changes in coefficients ci scales have the following range:  λ m i n c = 1 ,   λ m a x c = b m a x .
The point x0 = (100, 100, …, 100) was chosen as initial in all the above functions. Additionally, the following nonlinear functions were also used for testing and analysis.
Function f4 has ellipsoidal level surfaces corresponding to a quadratic function.
f 4 ( x ) = ( i = 1 n x i 2 · i · i ) 2 ,   x 0 = ( 1 , 1 , , 1 ) ,
Function f5 has a multidimensional ellipsoidal ravine. Minimization occurs when moving along this curvilinear ravine to the minimum point.
The stopping criterion was:
f ( x k ) f ε = 10 10 .
Minimization results are presented in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6. Table 1, Table 2, Table 3, Table 4 and Table 5 show the results of minimizing the five presented functions for various dimensions. These tables allow us to analyze the effect of removing the basic background using subgradient and quasi-Newton methods. The cells contain: N_it—number of iterations (one-dimensional searches along the direction); nfg—number of calls to the procedure for simultaneous calculation of a function and gradient.
Table 1 shows the results of minimizing the quadratic function f1, intended for the basic scaling of variables. This function is a background that must be removed by the method’s metric matrix. The nfg costs of subgradient methods here are approximately two times higher compared to the BFGS method.
Table 1. Function  f 1 ( x , [ a m a x = 10 8 ] )  minimization results.
Table 1. Function  f 1 ( x , [ a m a x = 10 8 ] )  minimization results.
nAlgorithm 3Algorithm 2BFGS
N_itnfgN_itnfgN_itnfg
100370784331696125276
20052710705381096243523
30073814247461430348746
40093417409441779447948
50011222084113521295421146
60012982359130124746341334
70014342645145426957241525
80015642842159829658111710
90016983056172731668971884
100018213280183934299822061
Table 2 shows the results for the function f2. Algorithms 2 and 3 show approximately the same results. The results for the BFGS method are approximately two times worse. Gradient noise has a detrimental effect on the accuracy of one-dimensional search with cubic interpolation that use function gradients. Reducing the accuracy of one-dimensional descent has a negative impact on the BFGS method.
Table 2. Function  f 2 ( x , [ a m a x = 10 8 ,   r = 0.3 ] )  minimization results.
Table 2. Function  f 2 ( x , [ a m a x = 10 8 ,   r = 0.3 ] )  minimization results.
nAlgorithm 3Algorithm 2BFGS
N_itnfgN_itnfgN_itnfg
1003577713607835611321
2005681198565118510832564
3007691607761160714863514
4009521995975204118274335
500113223231152240522225279
600130626731345275325876152
700147030481489306828026652
800159933111666341931677566
900173335811783368935438442
1000187638661930399235848577
Table 3 shows the results of function f3 minimization. Algorithms 2 and 3 show approximately the same results. The results for the BFGS method are approximately five times worse. In this problem, as the extremum is approached, the variables are rescaled. Possibly, this is due to differences in the degree of inclusion of the relation  ρ V / L V . For the BFGS method according to (70) it is  ρ V 3 / 2 L V 3 , and for subgradient methods according to (71), it is  ρ V 2 / 2 L V 2 .
Table 3. Function  f 3 ( x , [ a m a x = 10 8 , b m a x = 10 2 ] )  minimization results.
Table 3. Function  f 3 ( x , [ a m a x = 10 8 , b m a x = 10 2 ] )  minimization results.
nAlgorithm 3Algorithm 2BFGS
N_itnfgN_itnfgN_itnfg
10040790041591126546901
20068114616961482478011,885
30095119509651980637315,385
4001202241512212480757117,917
5001441283714582912829719,434
6001653325716743294896820,900
7001864367218983710957222,214
8002061401621084090991422,967
900225843432288441110,39124,001
1000245746862481476110,64524,500
Table 4 shows the results of function f4 minimization. Algorithms 2 and 3 show approximately the same results. The absence of quadraticity of the function while maintaining the topology of the function level surfaces, equivalent to the topology of the quadratic function, affects the convergence rate of Algorithms 2 and 3 to a lesser extent than the BFGS method. The lack of quadraticity of the function here significantly affects the convergence rate of BFGS method.
Table 4. Function  f 4  minimization results.
Table 4. Function  f 4  minimization results.
nAlgorithm 3Algorithm 2BFGS
N_itnfgN_itnfgN_itnfg
1001542671562959532226
20026644326143820124682
30037761936260431367282
400454737456741431410,027
500556889573929552312,815
60066910786721095674715,658
70076212157781259799018,537
80087714008701413924321,430
9009681545971155810,54124,455
1000109417521089176511,74627,226
Table 5 shows the results of minimization for function f5. Algorithm 2 is slightly better than Algorithm 3. This function also turned out to be difficult for the BFGS method. This function, like f4, contains polynomials of the fourth degree, which, unlike subgradient methods, significantly affects the convergence rate of the BFGS method.
Table 5. Function  f 5  minimization results.
Table 5. Function  f 5  minimization results.
nAlgorithm 3Algorithm 2BFGS
N_itnfgN_itnfgN_itnfg
100498111643298911702847
2005581286450105114173396
3006091423496119617004118
4007051687442110018624465
500686165338898019644722
6006131499429109120814955
7005811434433110622285315
8004511176394104821805200
9005331361430113524125727
10005541430435118824905957
In Table 6, the results for functions f1–f5 at n = 1000 are presented. The results show the effectiveness of the methods on all functions under study simultaneously. Conclusions regarding the effectiveness of the methods were made earlier.
Table 6. Functions  f 1 f 5  minimization results at n = 1000.
Table 6. Functions  f 1 f 5  minimization results at n = 1000.
nAlgorithm 3Algorithm 2BFGS
N_itnfgN_itnfgN_itnfg
f118213280183934299822061
f2187638221930399235848577
f3245746862481476110,64524,500
f4109417521089176511,74627,226
f55541430435118824905957
Table 7 shows the results of Algorithm 3 on the first three functions for n = 1000 with changed parameters  α k 2 α k 2 × β k 2 / c ,   β k 2 c  according to (71) and different values of c = {0.2; 0.1; 0.05}. It is shown here that, on ill-conditioned problems, such changes increase the efficiency of the minimization method. These examples show the possibility of setting the method parameters for a certain fixed set of optimization problems.
Regarding the convergence rate of minimization methods, the following conclusions can be drawn:
  • For functions close in properties to quadratic (f1), the quasi-Newton BFGS method significantly exceeds subgradient Algorithms 2 and 3 in terms of convergence rate.
  • In the case of significant interference imposed on the gradients of the function (f2), subgradient Algorithms 2 and 3 are more effective than the BFGS method.
  • Variability of scales across variables (f2) affects the convergence rate of subgradient methods to a lesser extent than the BFGS method.
  • The presence of polynomial degrees higher than 2 in the minimized function affects the convergence rate of subgradient methods to a lesser extent than the BFGS method.
  • A computational experiment showed the possibility of adjusting the parameters of the method in accordance with theoretical principles. Therefore, the efficiency of the method can be increased on a certain fixed set of optimization problems.
  • Based on the performed computational experiment, it can be seen that the theoretically predicted ability of subgradient methods to exclude the background that slows down the convergence rate has been confirmed by the computational experiment.
Based on the theoretical principles and experimental results, we can conclude that the presented subgradient methods complement quasi-Newton methods when solving smooth optimization problems.

7. Conclusions

The conditionality of the minimization problem determines the spread of the elongation of level surfaces in different directions, which determines the complexity of solving the problem. In minimization practice, in many cases, it turns out to be possible to reduce the elongation of level surfaces due to some linear transformation of coordinates. The paper studies the possibility of Newton’s method and the subgradient method with parameter optimization by changing the space metric to eliminate the conditionality of the problem using a linear transformation.
The paper proves that under conditions of instability of the second derivatives of the function in the minimization domain, the estimate of the convergence rate of Newton’s method is determined by the strong convexity parameter and Lipschitz parameter in the coordinate system where their ratio is maximum. This means the method’s ability to exclude the linear background, which increases the conditionality degree of the problem. The estimate of convergence rate serves as a standard for assessing the capabilities of the subgradient method being studied.
The paper studies RSM with parameters optimization of the rank-two correction of metric matrices on smooth, strongly convex functions with a Lipschitz gradient without assumptions about the existence of second derivatives of the function. Under broad assumptions on the transformation parameters of metric matrices, an estimate of the convergence rate of the studied RSM and an estimate of its ability to exclude removable linear background are obtained. The obtained estimates turn out to be qualitatively similar to estimates for Newton’s method.
A practical version of RSM and test functions have been developed that simulate the presence of a removable linear background. A computational experiment was carried out in which the quasi-Newton BFGS method and the subgradient method under study were compared on various types of smooth functions. The testing results indicate the effectiveness of the subgradient method in minimizing smooth functions with a high degree of conditionality of the problem and its ability to eliminate the linear background that worsens the convergence.
Depending on the type of function, one or another method dominates, which allows us to conclude that the subgradient method is applicable along with quasi-Newton methods when solving problems of minimizing smooth functions with a high degree of conditionality.

Author Contributions

Conceptualization, V.K. and E.T.; methodology, V.K. and E.T.; software, V.K.; validation, L.K. and E.T.; formal analysis, E.T.; investigation, E.T.; resources, L.K.; data curation, V.K.; writing—original draft preparation, E.T. and V.K.; writing—review and editing, E.T. and L.K.; visualization, V.K. and E.T.; supervision, L.K.; project administration, L.K.; funding acquisition L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation (State Contract FEFE-2023-0004).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Jensen, T.L.; Diehl, M. An Approach for Analyzing the Global Rate of Convergence of Quasi-Newton and Truncated-Newton Methods. J. Optim. Theory Appl. 2017, 172, 206–221. [Google Scholar] [CrossRef]
  2. Nesterov, Y. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady 1983, 27, 372–376. [Google Scholar]
  3. Rodomanov, A.; Nesterov, Y. Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 2022, 194, 159–190. [Google Scholar] [CrossRef]
  4. Rodomanov, A.; Nesterov, Y. New Results on Superlinear Convergence of Classical Quasi-Newton Methods. J. Optim. Theory Appl. 2021, 188, 744–769. [Google Scholar] [CrossRef]
  5. Jin, Q.; Mokhtari, A. Non-asymptotic superlinear convergence of standard quasi-Newton methods. Math. Program. 2023, 200, 425–473. [Google Scholar] [CrossRef]
  6. Davis, K.; Schulte, M.; Uekermann, B. Enhancing Quasi-Newton Acceleration for Fluid-Structure Interaction. Math. Comput. Appl. 2022, 27, 40. [Google Scholar] [CrossRef]
  7. Hong, D.; Li, G.; Wei, L.; Li, D.; Li, P.; Yi, Z. A self-scaling sequential quasi-Newton method for estimating the heat transfer coefficient distribution in the air jet impingement. Int. J. Therm. Sci. 2023, 185, 108059. [Google Scholar] [CrossRef]
  8. Argyros, I.; George, S. On a unified convergence analysis for Newton-type methods solving generalized equations with the Aubin property. J. Complex. 2024, 81, 101817. [Google Scholar] [CrossRef]
  9. Dennis, J.E.; Schnabel, R.B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; SIAM: Philadelphia, PA, USA, 1996. [Google Scholar]
  10. Polak, E. Computational Methods in Optimization; Mir: Russia, Moscow, 1974. [Google Scholar]
  11. Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Kamio, T.; Asai, H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms 2022, 15, 6. [Google Scholar] [CrossRef]
  12. Mokhtari, A.; Eisen, M.; Ribeiro, A. An incremental quasi-Newton method with a local superlinear convergence rate. In Proceedings of the EEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4039–4043. [Google Scholar] [CrossRef]
  13. Mokhtari, A.; Eisen, M.; Ribeiro, A. IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 2018, 28, 1670–1698. [Google Scholar] [CrossRef]
  14. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
  15. Berahas, A.S.; Jahani, M.; Richtárik, P.; Takác, M. Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample. Optim. Methods Softw. 2022, 37, 1668–1704. [Google Scholar] [CrossRef]
  16. Mokhtari, A.; Ribeiro, A. Regularized stochastic BFGS algorithm. IEEE Trans. Signal Proc. 2014, 62, 1109–1112. [Google Scholar] [CrossRef]
  17. Gower, R.; Richtárik, P. Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 2017, 38, 1380–1409. [Google Scholar] [CrossRef]
  18. Gao, W.; Goldfarb, D. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 2019, 34, 194–217. [Google Scholar] [CrossRef]
  19. Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
  20. Meng, S.; Vaswani, S.; Laradji, I.; Schmidt, M.; Lacoste-Julien, S. Fast and Furious Convergence: Stochastic Second Order Methods Under Interpolation. 2019. Available online: https://arxiv.org/pdf/1910.04920.pdf (accessed on 30 March 2024).
  21. Zhou, C.; Gao, W.; Goldfarb, D. Stochastic adaptive quasi-Newton methods for minimizing expected values. In Proceedings of the 34th ICML (PMLR), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 4150–4159. [Google Scholar]
  22. Makmuang, D.; Suppalap, S.; Wangkeeree, R. The regularized stochastic Nesterov’s accelerated Quasi-Newton method with applications. J. Comput. Appl. Math. 2023, 428, 115190. [Google Scholar] [CrossRef]
  23. Rodomanov, A.; Nesterov, Y. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 2021, 31, 785–811. [Google Scholar] [CrossRef]
  24. Lin, D.; Ye, H.; Zhang, Z. Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods. J. Mach. Learn. Res. 2022, 23, 1–40. [Google Scholar]
  25. Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
  26. Karmanov, V. Mathematical Programming; Mir: Moscow, Russia, 1989. [Google Scholar]
  27. Krutikov, V.N.; Stanimirović, P.S.; Indenko, O.N.; Tovbis, E.M.; Kazakovtsev, L.A. Optimization of Subgradient Method Parameters Based on Rank-Two Correction of Metric Matrices. J. Appl. Ind. Math. 2022, 16, 427–439. [Google Scholar] [CrossRef]
  28. Krutikov, V.; Gutova, S.; Tovbis, E.; Kazakovtsev, L.; Semenkin, E. Relaxation Subgradient Algorithms with Machine Learning Procedures. Mathematics 2022, 10, 3959. [Google Scholar] [CrossRef]
  29. Krutikov, V.; Tovbis, E.; Stanimirović, P.; Kazakovtsev, L. On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient. Mathematics 2023, 11, 4715. [Google Scholar] [CrossRef]
  30. Shor, N.Z. Application of the gradient descent method for solving network transportation problems. In Scientific Seminar on Theoretic and Applied Problems of Cybernetics and Operations Research; Nauch. Sovet po Kibernetike Akad. Nauk: Kiev, Ukraine, 1962; pp. 9–17. [Google Scholar]
  31. Polyak, B. A general method for solving extremum problems. Sov. Math. Dokl. 1967, 8, 593–597. [Google Scholar]
  32. Gol’shtein, E.G.; Nemirovskii, A.S.; Nesterov, Y.E. The level method and its generalizations and applications. Ekon. Mat. Metody 1983, 31, 164–180. (In Russian) [Google Scholar]
  33. Nesterov, Y. Universal gradient methods for convex optimization problems. Math. Program. Ser. A. 2015, 152, 381–404. [Google Scholar] [CrossRef]
  34. Gasnikov, A.V.; Nesterov, Y.E. Universal Method for Stochastic Composite Optimization. arXiv 2016, arXiv:1604.05275. [Google Scholar] [CrossRef]
  35. Nemirovskii, A.S.; Yudin, D.B. Complexity of Problems and Efficiency of Methods in Optimization; Nauka: Moscow, Russia, 1979. [Google Scholar]
  36. Shor, N. Minimization Methods for Nondifferentiable Functions; Springer: Berlin, Germany, 1985. [Google Scholar]
  37. Polyak, B.T. Minimization of nonsmooth functional. Zh. Vychisl. Mat. Mat. Fiz. 1969, 9, 509–521. [Google Scholar]
  38. Krutikov, V.N.; Samoilenko, N.S.; Meshechkin, V.V. On the Properties of the Method of Minimization for Convex Functions with Relaxation on the Distance to Extremum. Autom. Remote Contro 2019, 80, 102–111. [Google Scholar] [CrossRef]
  39. Wolfe, P. Note on a method of conjugate subgradients for minimizing nondifferentiable functions. Math. Program. 1974, 7, 380–383. [Google Scholar] [CrossRef]
  40. Lemarechal, C. An extension of Davidon methods to non-differentiable problems. Math. Program. Study 1975, 3, 95–109. [Google Scholar]
  41. Dem’yanov, V.F.; Vasil’ev, L.V. Non-Differentiable Optimization; Nauka: Moscow, Russia, 1981. (In Russian) [Google Scholar]
  42. Skokov, V.A. Note on minimization methods employing space stretching. Cybern. Syst. Anal. 1974, 10, 689–692. [Google Scholar] [CrossRef]
  43. Krutikov, V.N.; Gorskaya, T.A. A family of subgradient relaxation methods with rank 2 correction of metric matrices. Ekon. Mat. Metody 2009, 45, 37–80. [Google Scholar]
  44. Tsypkin, Y.Z. Foundations of the Theory of Learning Systems; Academic Press: New York, NY, USA, 1973. [Google Scholar]
  45. Nurminsky, E.A.; Tien, D. Method of conjugate subgradients with constrained memory. Autom. Remote Control 2014, 75, 646–656. [Google Scholar] [CrossRef]
  46. Bunday, B.D. Basic Optimization Methods; Edward Arnold: London, UK, 1984. [Google Scholar]
Figure 1. The set G and its characteristics [28].
Figure 1. The set G and its characteristics [28].
Mathematics 12 01618 g001
Figure 2. Selection of subgradient vectors for inexact one-dimensional descent in the method of solving systems of inequalities.
Figure 2. Selection of subgradient vectors for inexact one-dimensional descent in the method of solving systems of inequalities.
Mathematics 12 01618 g002
Figure 3. Properties of vectors a, b, c, p.
Figure 3. Properties of vectors a, b, c, p.
Mathematics 12 01618 g003
Table 7. Functions  f 1 f 3  minimization results at n = 1000, Algorithm 3, changed parameters  α k 2 ,   β k 2 , variants of parameter c.
Table 7. Functions  f 1 f 3  minimization results at n = 1000, Algorithm 3, changed parameters  α k 2 ,   β k 2 , variants of parameter c.
f1f2f3
N_itnfgN_itnfgN_itnfg
no changes182132801876386624574686
c = 0.2165529411800380022404215
c = 0.1156127981796375620363815
c = 0.05162529741877390521754038
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tovbis, E.; Krutikov, V.; Kazakovtsev, L. Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction. Mathematics 2024, 12, 1618. https://doi.org/10.3390/math12111618

AMA Style

Tovbis E, Krutikov V, Kazakovtsev L. Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction. Mathematics. 2024; 12(11):1618. https://doi.org/10.3390/math12111618

Chicago/Turabian Style

Tovbis, Elena, Vladimir Krutikov, and Lev Kazakovtsev. 2024. "Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction" Mathematics 12, no. 11: 1618. https://doi.org/10.3390/math12111618

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop