Next Article in Journal
Nonparametric Functions Estimation Using Biased Data
Previous Article in Journal
Fracture Evolution in Rocks with a Hole and Symmetric Edge Cracks Under Biaxial Compression: An Experimental and Numerical Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Properties and Application of Incomplete Orthogonalization in the Directions of Gradient Difference in Optimization Methods

1
Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, 31, Krasnoyarskii Rabochii Prospekt, Krasnoyarsk 660037, Russia
2
Department of Applied Mathematics, Kemerovo State University, 6 Krasnaya Street, Kemerovo 650043, Russia
3
Laboratory “Hybrid Methods of Modeling and Optimization in Complex Systems”, Siberian Federal University, 79 Svobodny Prospekt, Krasnoyarsk 660041, Russia
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(24), 4036; https://doi.org/10.3390/math13244036
Submission received: 10 November 2025 / Revised: 8 December 2025 / Accepted: 16 December 2025 / Published: 18 December 2025

Abstract

This paper considers the problem of unconstrained minimization of smooth functions. Despite the high efficiency of quasi-Newton methods such as BFGS, their performance degrades in ill-conditioned problems with unstable or rapidly varying Hessians—for example, in functions with curved ravine structures. This necessitates alternative approaches that rely not on second-derivative approximations but on the topological properties of level surfaces. As a new methodological framework, we propose using a procedure of incomplete orthogonalization in the directions of gradient differences, implemented through the iterative least-squares method (ILSM). Two new methods are constructed based on this approach: a gradient method with the ILSM metric (HY_g) and a modification of the Hestenes–Stiefel conjugate gradient method with the same metric (HY_XS). Both methods are shown to have linear convergence on strongly convex functions and finite convergence on quadratic functions. A numerical experiment was conducted on a set of test functions. The results show that the proposed methods significantly outperform BFGS (2 times for HY_g and 3.5 times for HY_XS in terms of iterations number) when solving ill-posed problems with varying Hessians or complex level topologies, while providing comparable or better performance even in high-dimensional problems. This confirms the potential of using topology-based metrics alongside classical quasi-Newton strategies.

1. Introduction

We consider the problem of minimizing a differentiable function f(x) on a finite-dimensional Euclidean space. This problem can be stated as
f(x) → min, xRn.
Well-known methods [1,2] were developed to solve an unconstrained minimization problem, including the gradient method, which is based on the idea of function local linear approximation. Conjugate gradient methods (CGM) generate search directions that are consistent with the geometry of the minimized function. In practice, the CGM shows faster convergence rates than gradient descent algorithms, so CGM is widely used in machine learning.
Quasi-Newton methods (QNM) are based on the idea of using a matrix of second derivatives reconstructed from the gradients of a function. The commonly used matrix update formula for quasi-Newton methods is BFGS [3,4,5,6]. Two quasi-Newton algorithms for solving unconstrained optimization problems based on two modified secant relations to achieve reliable approximations of the Hessian matrices of the objective function were presented in [7].
Ultra-high dimensions and strong nonlinearity can lead to extremely complex optimization landscapes, for which gradient-based optimization solvers will perform poorly or fail easily [8]. Without denying the merits of the gradient descent method, it must be said that it turns out to be very slow when moving along a ravine, and as the number of variables of the objective function increases, such behavior of the method becomes typical. Performance of QNM degrades in ill-conditioned problems with unstable or rapidly varying Hessians—for example, in functions with curved ravine structures.
Relaxation subgradient methods (RSM) have been widely used in optimization practice for many years and have found practical application in such areas as signal and image processing [9,10], classification [11], network design [12], maintenance routing [13], dynamic process modeling [14], and many others.
In our earlier studies, it turned out that the problem of finding the direction of descent in the RSM can be reduced to the problem of solving a system of inequalities on subgradient sets and mathematically formulated as a solution to the problem of minimizing some quality functional. In this case, the properties of the learning algorithm determined the convergence rate of the minimization method.
In relaxation processes of ε-subgradient type, successive approximations are constructed as follows:
x k + 1 = x k γ k s k ,   γ k = arg   m i n γ f ( x k γ s k ) ,   k = 0 , 1 , 2   .
Here, k is the iteration number, γk is the stepsize, and descent direction sk is selected from the set S ( ε f ( x i ) ) [1,15,16,17,18,19], ε f ( x i ) is ε-subgradient set at a point xk, and S ( G ) = { s R n | m i n g G ( s , g ) > 0 } ,   G R n is a set of feasible directions, g are gradients.
Denote by f ( x ) f ε = 0 ( x ) a subgradient set at a point x. If set S(G) is not empty, then any vector s S ( G ) is a solution to the set of inequalities:
( s , g ) > 0 , g G ,
in other words, it sets the normal of the dividing plane of the origin and the set G. Here, (s, g) is a dot product of vectors. One of the solutions to (2) is a vector of minimal length from G denoted as η(G).
The subgradient method was first proposed by Shor [15,20]. A number of effective approaches arose as a result of the development of the first subgradient methods with a space dilation [21,22]. The first relaxation subgradient methods were suggested in [23,24,25].
In Ref. [26], a method to solve the convex problems of nondifferentiable optimization relying on the basic philosophy of the conjugate gradient method and coinciding with it in the case of quadratic functions was presented. Authors in [27] propose a family of adaptive subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Spectral-step subgradient method for solving nonsmooth unconstrained optimization problems was demonstrated in [28]. In Ref. [29], a subgradient method based on step size adjusting is developed to solve the system of nonsmooth equations. Subgradient projection methods are often applied to large-scale problems with decomposition techniques. Adaptive projection subgradient methods was proposed in [30,31]. Authors in [32] consider a projected-subgradient type method, derived from using a general distance-like function instead of the usual Euclidean squared distance.
As in QNM, RSM uses ratios to calculate the required characteristics. These relations represent equalities, solving which it is possible to determine the desired values. Due to the multiplicity of these equalities, it is necessary to resort to their solution by gradient methods as information becomes available. Thus, the use of the iterative least squares method (ILSM) is possible in this case.
In this paper, we consider modifications of subgradient methods applied to solving smooth minimization problems. In this case, the system of inequalities (2) will be formed by a set of gradients of some neighborhood of the current minimum. The descent direction s satisfying (2) will allow the method to go beyond this neighborhood as a result of the minimization step along this direction.
The main contributions of this paper are as follows:
(1) A methodological approach was proposed based on a procedure of incomplete orthogonalization in the directions of gradient differences, implemented through the iterative least-squares method. This approach uses the structural characteristics of the level surfaces instead of second-derivative approximations.
(2) Two methods were constructed based on this approach: a gradient method with the ILSM metric and a modification of the Hestenes–Stiefel conjugate gradient method with the same metric. Both methods were implemented and numerically studied comparing with the quasi-Newton BFGS method on various types of smooth functions. The test results indicate the efficiency of the proposed methods, especially when solving poorly conditioned problems with complex curved ravines and unstable characteristics of the second derivatives.
(3) The noted linear transformation of coordinates eliminates the linear background that worsens the convergence of the gradient method. In this work, we proved that algorithm HY_g, where the gradient method acquires accelerating abilities due to the use of the proposed metric transformation, has similar properties to Newton’s method, quasi-Newton methods, and subgradient methods with a change in the space metric. The qualitative nature of the convergence rate estimates for algorithm HY_g and Newton’s method coincide. Also, the equivalence of the conjugate gradient method with the new metric (HY_XS) to the conjugate gradient method on quadratic functions is proven.
The results obtained allow us to conclude that it is possible to use the studied methods along with quasi-Newton methods to solve smooth optimization problems with a high degree of conditionality.
The rest of the paper is organized as follows. Section 2 describes our variation in the gradient minimization method. In Section 3, we analyze the properties and convergence rate of the proposed algorithm. In Section 4, we study acceleration properties of the space dilation algorithm in the direction of the gradient difference. In Section 5, we propose the Hestenes–Stiefel method in a metric with incomplete orthogonalization in the direction of the gradient difference. In Section 6, we present the results of numerical experiments. Section 7 concludes the work.

2. Gradient Minimization Method with Incomplete Orthogonalization in the Direction of the Gradient Difference

For the gradients on the descent trajectory of (1), we use the notation f ( x k ) = g ( x k ) = g k . The idea of creating an iterative method for solving the system of inequalities (2) is based on its transformation into a system of equalities [17,18,19].
To derive the formulas of the iterative process, we use an idealized model of the set G [17,18,19]. Let G R n belongs to a certain hyperplane, and the vector η(G) is also the vector of minimal length of this hyperplane. Then there is a solution to the system of equalities:
( s , g ) = 1 , g G ,
which simultaneously satisfies (2). Here, s is a descent direction, g is the gradient. The solution of system (3) can be obtained as the solution of the system of equalities:
s , g i = 1 ,   i = 0 , 1 , , k .
One of the possible solutions to the system (4) can be found in the form of s k + 1 = arg   m i n s F k ( s ) , where F k s is the sum-of-squares function of the residuals:
F k ( s ) = i = 0 k w i Q i ( s ) + 1 2 i = 1 n s i 2 ,   Q i ( s ) = 1 2 ( 1 ( s , g i ) ) 2 .
Here, wi are weighting factors. Such a solution, taking into account the regularizing component 1 2 i = 1 n s i 2 , can be obtained by the iterative least squares method (ILSM) [33].
In Ref. [19], based on ILSM, an iterative process of finding the descent direction for the minimization method (1) using training information (4) was obtained using special weighting factors wi in F k ( s ) :
s k + 1 = s k + H k g k ( 1 ( s k , g k ) ) ( g k H k g k ) ,   s 0 = 0 .
H k + 1 = H k ( 1 1 α k 2 ) H k g k g k T H k T ( g k , H k g k ) ,   H 0 = I ,
where αk > 1 is space dilation parameter, Hk is metric matrix. In (5), the direction adjustment is set so that the learning ratio ( s k + 1 , g k ) = 1 is fulfilled.
Using (5), (6), we derive the iterative process obtained earlier on the basis of heuristic considerations for solving problems of non-smooth optimization (r-algorithm [15]) in the form proposed in [16].
To obtain the formulas of the iterative process at iteration k, we transform the data (4) by subtracting adjacent equalities. We will obtain a new system:
s , y i = 0 ,   i = 0 , 1 , k 1 ,   ( s , g k ) = 1 ,   y i = g i g i + 1 .
For a part of the data at i = 0, 1, …k − 1, we perform transformations (5), (6). Transformation (5) for ( s , y i ) = 0 will take the following form:
s i + 1 = s i + H i y i ( 0 ( s i , y i ) ) ( y i H i y i ) ,   s 0 = 0 .
As a result of transformations (8), regardless of the values of the matrices Hi, by virtue of equality s0 = 0 we obtain si = 0, i = 0, 1, …k − 1. Together with (8), we will carry out the transformation (6) sequentially:
H i + 1 = H i ( 1 1 α k 2 ) H i y i y i T H i T ( y i , H i y i ) ,   H 0 = I ,   i = 0 , 1 , k 1 .
Transformation (6) for the last equality ( s , g k ) = 1 from (7) can be omitted, because, as a result of transformation (5) for the last equality ( s , g k ) = 1 , a vector sk collinear to the vector Hkgk will be obtained. Therefore, the iterative minimization process using transformations (5), (6) for data (7) at k = 0, 1, 2, …, will have the form (1), where the descent direction and the metric matrix are iterating using the following formulas:
s k = H k g k ,
H k + 1 = H k ( 1 1 α k 2 ) H k y k y k T H k T ( y k , H k y k ) ,   H 0 = I ,   y k = g k g k + 1 ,
with fixed values of α k 2 = α 2 > 1 .
Thus, using the idea of obtaining the descent direction satisfying the system of inequalities (2), we derived the iterative process of the r-algorithm [15] in the form of [16], which was previously obtained based on heuristic considerations.
In Ref. [4], to speed up the process of finding the descent direction in method (5), (6), one of its special cases is presented. An algorithm for obtaining a direction sk satisfying, in contrast to (5), simultaneously the last two learning relations ( s k + 1 , g k ) = 1 and ( s k + 1 , g k 1 ) = 1 is proposed and theoretically justified. In this algorithm, the formulas for obtaining the direction of descent are as follows:
p k = { g k ,   i f   ( k 1   a n d   ( g k , H k g k 1 ) > 0 )   o r   k = 0 , g k g k 1 ( g k , H k g k 1 ) g k 1 , H k g k 1 ,   o t h e r w i s e .
s k + 1 = s k + H k p k [ 1 ( s k , g k ) ] / ( g k , H k p k ) .
In this case, the metric matrix is adjusted according to formulas (10) with a varying space dilation parameter α k 2 > 1 . Note that in order to ensure convergence when solving non-smooth problems, the choice of the descent direction is not always carried out according to Formulas (11) and (12).
Here and below, we will denote an algorithm by specifying the sequence of actions indicated by the corresponding equations. In Ref. [17], estimates of optimal parameter α k 2 > 1   values are obtained, and it is shown that the algorithm (1)–(11)–(12)–(10) in the quadratic case generates a sequence of conjugate descent vectors. In the case of smooth functions, it becomes possible to use in the algorithm (1)–(11)–(12)–(10) the conjugate gradient method instead of formulas for generating the descent direction (11), (12).
The transformation of the ILSM metric (10) in the minimization algorithm has the property of partially orthogonalizing the descent vectors to the directions of the adjacent gradients difference on the descent trajectory. In the case of minimizing the quadratic function, the use of transformations (10) to form the descent direction increases the degree of conjugacy.
In Ref. [18], estimates of the linear convergence rate for the algorithm (1)–(9)–(10) are obtained on strongly convex functions with a Lipschitz gradient. It is proven that on such functions, this method has accelerating properties qualitatively similar to those of Newton’s method. The fact that the Newton method [18] and quasi-Newton methods have accelerating properties is explained by using either matrices of second derivatives or their approximations. Accelerating properties of the algorithm (1)–(9)–(10) due to the property of matrices (10) in the form of partial orthogonalization of the descent vectors to the directions of the adjacent gradients difference on the descent trajectory.
It was shown in [18] that the minimization algorithm (1) with the calculation of the descent direction based on (11), (12), and the transformation of matrices (10) is identical to the conjugate gradient method when minimizing quadratic functions. In this paper, the Hestenes–Stiefel conjugate gradient method with metric matrices (10) was developed to solve problems of minimizing smooth functions. The finite convergence of such a method on quadratic functions is proved.

3. Properties and Convergence Rate of the Gradient Minimization Method with Incomplete Orthogonalization in the Direction of the Gradient Difference

Here, we analyze the properties of the algorithm (1)–(9)–(10). The algorithm consists of the following three steps:
1. Compute the next minimum point;
x k + 1 = x k γ k s k ,   γ k = arg   m i n γ f ( x k γ s k ) ,   k = 0 , 1 , 2 .
2. Compute descent direction s k = H k g k ;
3. Compute metric matrix update:
H k + 1 = H k ( 1 1 α k 2 ) H k y k y k T H k T ( y k , H k y k ) ,   H 0 = I ,   y k = g k g k + 1 .
Formulas for correction vector pk, and metric matrix update Hk+1 are as follows:
p k = x k + 1 = x k γ k s k ,   s k = H k g k ,   γ k = a r g m i n γ f ( x k γ s k ) , k = 0 , 1 , 2 , ,
H k + 1 = H k ( 1 1 α k 2 ) H k y k y k T H k T ( y k , H k y k ) ,   α M α k α m > 1 ,   y k = g k g k + 1 ,
where k is iteration number, xk is current minimum point, sk is descent direction, Hk is metric matrix at iteration k, gk is gradient at iteration k, γ is stepsize, α k   is space dilation parameter, yk is gradient difference, x0 is initial point and H0 is initial matrix.
As will be shown below, method (13)–(14) is invariant with respect to linear transformation of coordinates. By invariance, we mean the similarity of the process type in different systems. This means that when the iterations of the method are transferred to a new coordinate system, the iterative process is preserved. In this case, no conditions are imposed on the function, the only condition is its differentiability. For instance, in the case of a quadratic function, after transferring to a coordinate system where the Hessian has eigenvalues equal to one, the gradient differences y k = g k g k + 1 will coincide with the method offsets v k = x k x k + 1 during the iteration. Consequently, in the new coordinate system, the choice of the descent direction in (13) looks like the process of partial orthogonalization of the gradient to the previous descent directions using matrices (14). With complete orthogonalization, such a minimization process along the orthogonal system of directions for a quadratic function with unit Hessian will be finite [1].
The limiting variant of algorithm (13)–(14) for α k 2 = takes the form of the previously known conjugate gradient method [1]. In this method, the descent directions are orthogonalized to all previous differences in adjacent gradients. This property does not allow the method to optimize in a subspace defined by gradient differences, that is, it excludes it from the optimization process. Such a method will not be effective, especially for large dimensions. In algorithm (13)–(14), only a partial reduction in the components of the descent direction in the subspace of the preceding differences in adjacent gradients is performed. This does not block the possibility of minimization in the examined subspace. Transformation (14) suppresses to a greater extent the gradient components directed orthogonally to the ravine directions. This is one of the qualitative justifications explaining the accelerating properties of the transformation (14), on the basis of which the r-algorithm was developed [15].
Cases of low efficiency of quasi-Newton methods are associated with functions in which the Hessian eigenvalues decrease as they approach the minimum. In this case, the inverse matrices of quasi-Newton methods increase in the examined subspace, which makes it difficult for the method to enter the unexplored subspace. In algorithms (13)–(14), by reducing the influence of the examined subspace, the unexplored subspace always has an advantage. For this reason, for some classes of functions, algorithm (13)–(14) may be more effective than quasi-Newton methods, which will be confirmed by numerical tests.
Transformation (14) in algorithm (13)–(14) can be understood as a scaling transformation for the gradient method. This transformation may be useful as a scaler in other methods. In this paper, we will consider the use of transformation (14) in the Hestenes–Stiefel conjugate gradient method.
We investigate the convergence rate of algorithm (13)–(14). Let us consider the conditions under which the convergence rate and accelerating properties of the r-algorithm are estimated.
Condition 1. 
We will assume that the function f(x), x∈ Rn differentiable and strongly convex in Rn and is minimized, i.e., there exists ρ > 0 such that∀x, y∈ Rn and the following inequality holds:
f ( α x + ( 1 α ) y ) α f ( x ) + ( 1 α ) f ( y ) α ( 1 α ) ρ x y 2 / 2 ,
and its gradient g(x) =∇ f(x) satisfies the Lipschitz condition:
g ( x ) g ( y ) L x y x , y R n ,   L > 0 .
Here, L is Lipschitz parameter.
Denote by x* the minimum point of function f(x), f* = f(x*), fk = f(xk). The iteration of the gradient-matched method for exact one-dimensional descent has the form:
x k + 1 = x k β k s k ,   β k = a r g m i n β 0 f x k β s k ,   ( s k , g k ) > 0 , k = 0 , 1 , 2 , ,
where the initial point x0 is set. Formula (15) does not specify the descent direction S, they only impose the condition ( s k , g k ) > 0 on it. Although the direction s k = H k g k from (13) also satisfies the condition from (1) due to the positive definiteness of the metric matrices.
The following theorem shows that changes in the gradient as a result of method (15) iterations with an exact one-dimensional descent lead to a decrease.
Theorem 1 
[18]. Let the function satisfy Condition 1. Then, for the sequence fk, k = 0, 1, 2… specified by (15), the following estimate takes place:
f k + 1 f * ( f 0 f * ) exp ρ 2 L 2 i = 0 k y i 2 g i 2 ,
where  y i = g i + 1 g i , f* is optimal function value, f0 is initial function value, L is Lipschitz parameter, ρ is defined in Condition 1.
Here we derive the recurrent formulas for converting the determinants and trace of the metric matrices under consideration. Denote A k = H k 1 . Denote by Sp(A) a trace of matrix A and by det(A) a determinant of matrix A. For an arbitrary matrix A > 0, we denote by A 1 / 2 a symmetric matrix for which A 1 / 2 > 0 and A 1 / 2 A 1 / 2 = A .
Lemma 1. 
Let Hk > 0 and matrix Hk+1 is obtained as a result of the transformation (14):
H k + 1 = H k ( 1 1 α k 2 ) H k y k y k T H k T ( y k , H k y k ) ,   α M α k α m > 1 ,   y k = g k g k + 1 .
Then, Hk+1 > 0 and
A k + 1 = A k + ( α k 2 1 ) y k y k T ( y k , H k y k ) ,
S p A k + 1 = S p A k + ( α k 2 1 ) ( y k , y k ) ( y k , H k y k ) ,
detH k + 1 = detH k / α k 2 ,   detA k + 1 = α k 2 d e t A k .
Proof of Lemma 1. 
We transform the right-hand side of expression (14) using the notation introduced earlier:
H k + 1 = H k 1 / 2 I ( 1 1 α k 2 ) H k 1 / 2 y k y k T H k 1 / 2 ( H k 1 / 2 y k , H k 1 / 2 y k ) H k 1 / 2 = H k 1 / 2 I ( 1 1 α k 2 ) v k v k T ( v k , v k ) H k 1 / 2 , v k = H k 1 / 2 y k
Inverting the matrix (20) element by element, we obtain (17). Formula (18) follows from (17). Calculating the determinant element by element for the matrices in (20), we obtain (19). □
The following theorem substantiates the linear convergence rate of algorithm (13)–(14).
Theorem 2. 
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2, … given by the algorithm (13)–(14), with a limited initial matrix H0:
m 0 ( H 0 z , z ) / ( z , z ) M 0 , z R n , z 0 ,
where M0 and m0 are maximal and minimal eigenvalues of the matrix H0, respectively, the following estimation takes place:
f k + 1 f * ( f 0 f * ) exp ρ 2 ( k + 1 ) L 2 n ln ( α m 2 ) ( α M 2 1 ) + n ln ( m 0 / M 0 ) ( k + 1 ) ( α M 2 1 ) .
Proof of Theorem 2. 
We transform (18):
S p ( A k + 1 ) = S p ( A k ) 1 + ( α k 2 1 ) ( y k , y k ) S p ( A k ) ( H k y k , y k ) .
Due to the exact one-dimensional descent in (13), the following condition is satisfied:
( s k , g k + 1 ) = ( H k g k , g k + 1 ) = 0 .
The matrix Hk is positive definite, therefore, using the last equality, we obtain the following:
( H k y k , y k ) = ( H k g k , g k ) + ( H k g k + 1 , g k + 1 ) 2 ( H k g k , g k + 1 ) ( H k g k , g k ) .
Hence, taking into account the inequality
S p ( A k ) M k ,
where Mk is the maximum eigenvalue of the matrix Ak, we find:
S p ( A k ) ( H k y k , y k ) S p ( A k ) ( H k g k , g k ) S p ( A k ) M k ( g k , g k ) ( g k , g k ) .
The last estimate allows us to reduce inequality (23) to the following form:
S p ( A k + 1 ) S p ( A k ) 1 + ( α k 2 1 ) y k 2 g ( x k ) 2 .
Based on the ratio between the arithmetic mean and geometric mean of the eigenvalues of the matrix A > 0, we have
S p ( A ) / n [ det ( A ) ] 1 / n .
Using the last inequality in (24), taking into account (19), we obtain:
S p ( A 0 ) n i = 0 k 1 + ( α M 2 1 ) y i 2 g ( x i ) 2 S p ( A k + 1 ) n ( det ( A k + 1 ) ) 1 / n = [ ( α m 2 ) k + 1 det ( A 0 ) ] 1 / n .
The latter ratio, based on inequality
1 + p exp ( p ) ,
we convert to the following form:
S p ( A 0 ) n exp ( α M 2 1 ) i = 0 k y i 2 g ( x i ) 2 ( α m 2 ) ( k + 1 ) / n ( det ( A 0 ) ) 1 / n .
By virtue of condition (21), inequalities hold:
S p ( A 0 ) / n 1 / m 0 , ( det ( A 0 ) ) 1 / n 1 / M 0 .
Taking these relations into account, we transform inequality (25):
1 m 0 exp ( α M 2 1 ) i = 0 k y i 2 g ( x i ) 2 α m 2 ( k + 1 ) / n M 0 .
Taking the logarithm of the latter, we find:
( α M 2 1 ) i = 0 k y i 2 g ( x i ) 2 ( k + 1 ) ln ( α m 2 ) n + ln ( 1 M 0 ) ln 1 m 0 .
From this follows the inequality:
i = 0 k y i 2 g ( x i ) 2 ( k + 1 ) ln ( α m 2 ) n ( α M 2 1 ) + ln ( m 0 / M 0 ) ( α M 2 1 ) ,
which, together with estimate (16) of Theorem 1, proves (22). □
The obtained convergence rate estimates do not explain the method’s high convergence rate, for example, on quadratic functions. To justify the method’s acceleration properties, we have to demonstrate its invariance under a linear coordinate transformation and then use estimate (22) in the coordinate system in which the ρ/L ratio is maximal. It is possible to increase this ratio, for example, in the case of quadratic functions, where its maximum value is 1.

4. Acceleration Properties of the Space Dilation Algorithm in the Direction of the Gradients Difference

To estimate the convergence rate greater than linear in minimization methods, the minor variability of the Hessian in the extremum region is primarily used, which can be estimated based on the Hessian properties. We examine the acceleration properties of the algorithm without assumptions about the function’s second derivative matrices.
According to estimate (22), the convergence rate of the r-algorithm is determined by the magnitude of the ρ/L ratio. In our study, similar to study [18], we will assume the existence of a linear transformation of coordinates that increases the magnitude of this ratio.
Let a function f(x) satisfy Condition 1. We define a linear transformation of variables
x ^ = P x ,
where P R n is a non-singular matrix. In the new coordinate system, the function being minimized will have the following form:
f P ( x ^ ) = f P 1 x ^ = f ( x ) .
The function (27) formed in this way also satisfies Condition 1 with parameter of strong convexity ρ and Lipschitz parameter L.
Denote V R n × n a non-singular matrix such that for parameters ρ and L of functions f V ( x ^ ) with
x ^ = V x
and f P ( x ^ ) for an arbitrary non-singular matrix P R n × n in (26), the inequality holds
ρ V L V ρ p L p .
As was established in [18], transformation (28) plays the role of a distinguished coordinate system, which is the best from the point of view of the convergence rate of gradient methods. For a non-degenerate quadratic function, we can consider a coordinate system where ρ/L = 1 and the maximum and minimum eigenvalues are equal. In the general case, we only assume the possibility of eliminating the differences in the elongation of the level surfaces.
Next, we will show that algorithm (13)–(14), applied to the function f(x), and algorithm (13)–(14), applied to the function f P ( x ^ ) defined in (22), under appropriate initial conditions, construct sequences of minimization points related by the transformation (26). This will allow us to use the estimates of the convergence rate of algorithm (13)–(14) in the preferred coordinate system.
Theorem 3. 
Let the initial conditions of the algorithm (13)(14), applied to minimize the functions f(x) and  f P ( x ^ ) , defined in (27), be related by the equalities:
x ^ 0 = P x 0 ,   H ^ 0 = P H 0 P T .
Then the characteristics of these processes are related by the following relations:
f P ( x ^ k ) = f ( x k ) ,   x ^ k = P x k ,   f P ( x ^ k ) = P T f ( x k ) ,   H ^ k = P H k P T , ( k = 0 , 1 , 2 ) .
Proof of Theorem 3. 
For the gradients of functions f P x ^ and f(x) the following relation holds:
f P ( x ^ ) = P T f ( x ) .
From this and (30) follows (31) for k = 0. Let us assume that equalities (31) are satisfied ∀k = 0, 1, …, i. Let us show their feasibility for k = i + 1. From (13) for k = i after left-hand multiplying by P, taking into account the proved equalities (31), we obtain:
P x i + 1 = P x i γ i P H i P T P T f x i = x ¯ i γ i H ^ i f P x ^ i .
Hence, according to the definition of the function fp, at the stage of one-dimensional minimization the equality γ i = γ ¯ i is satisfied. Therefore, the right-hand side of (32) is the implementation of the step in the new coordinate system. Consequently:
x ^ i = P x i , f P ( x ^ i ) = P T f ( x i ) , y ^ i = f P ( x ^ i + 1 ) f P ( x ^ i ) = P T y i .
Multiplying (14) with the current indices on the left by P, and on the right by PT, taking into account (33) and the equality:
( y i , P 1 P H i P T P T y i ) = ( H ^ i y ^ i , y ^ i ) ,
we get
P H i + 1 P T = P H i P T 1 1 α i 2 P H i P T P T y i y i T P 1 P H i T P T y i , P 1 P H i P T P T y i = H ^ i 1 1 α i 2 H ^ i y ^ i y ^ i T H ^ i T H ^ i y ^ i , y ^ i ,
where the right side is the implementation of (14) in the new coordinate system.
Finally, we get P H i + 1 P T = H ^ i + 1 . Therefore, equalities (31) will also be valid for k = i + 1. Continuing the process of induction, we obtain a proof of the theorem. □
In the following theorem, we use algorithms (13)–(14) in a distinguished coordinate system (28) with property (29).
Theorem 4. 
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2, … given by the algorithm (13)(14), with a limited initial matrix H0, according to (21) the following estimation takes place:
f k + 1 f * ( f 0 f * ) exp ρ V 2 ( k + 1 ) L V 2 n ln ( α m 2 ) ( α M 2 1 ) + n ln ( m ^ 0 / M ^ 0 ) ( k + 1 ) ( α M 2 1 ) .
where  m ^ 0 , M ^ 0  are the minimum and maximum eigenvalues of the matrix  H ^ 0 = V H 0 V T , respectively, in the selected coordinate system (28) having the property (29).
Proof of Theorem 4.
According to the results of Theorem 3, we can choose an arbitrary coordinate system to estimate the convergence rate of the minimization process in the algorithm. Therefore, using estimate (22) in a coordinate system with matrix P = V, we obtain estimates (34). □
The first term in square brackets (34) characterizes the constant in the method’s convergence rate estimate, and the second term represents the cost of adjusting the metric matrix.
Let us consider the acceleration effect of algorithm (13)–(14) compared to well-known methods: steepest descent and Newton’s method. For these methods, the convergence rate estimate is as follows:
f k + 1 f * q k + 1 ( f 0 f * ) .
Under Condition 1 imposed on the function, for the steepest descent, where sk = gk, the decay rate is as follows [1]:
q = 1 ρ L .
For Newton’s method s k = [ 2 f ( x k ) ] 1 f x k and decay rate is [18]:
q = 1 ρ 2 L 2 .
The convergence rate for Newton’s method, due to its invariance with respect to the linear transformation of coordinates, has the form [18]:
q = 1 ρ V 2 / L V 2 .
Given that
l V / L V > > l / L ,
Taking into account (36)–(38), the convergence rate of Newton’s method will be significantly higher than that of the steepest descent.
Estimate (34) for the algorithm (13)–(14) is equivalent to the estimate for Newton’s method (37) in terms of the influence of constants on the convergence rate. Under condition (39), the convergence rate estimate for the algorithm (13)–(14) is preferable to that for the steepest descent method, which is confirmed further by a computational experiment.
Thus, algorithm (13)–(14) on strongly convex functions, without assuming the existence of second derivatives, exhibits acceleration properties compared to the steepest descent method.

5. Hestenes–Stiefel Method in a Metric with Incomplete Orthogonalization in the Direction of Gradient Difference

Transformation (14) in the algorithm (13)–(14) can act as a scaling transformation for the steepest descent. This transformation may also be useful in other methods.
In Ref. [17], it is shown that the sequence of approximations of the minimum generated by the algorithm (1)–(11)–(12), with the transformation of matrices (10), or equivalently with (14), coincides with the sequence generated by the conjugate gradient method. Transformations (11), (12) are based on the simultaneous solution of the system of equalities s k + 1 , g k = 1 and ( s k + 1 , g k 1 ) = 1 . The equality s k + 1 , g k s k + 1 , g k 1 = s k + 1 , g k g k 1 = 0 holds for their difference. In the Hestenes–Stiefel method [30], the new direction is also chosen based on the equality ( s k + 1 , g k g k 1 ) = 0 . In the quadratic case, the new direction in the Hestenes–Stiefel method will be conjugate to the current descent direction. In the case of smooth functions, we can replace the transformations (11), (12) for obtaining the descent direction with the transformations in the Hestenes–Stiefel method.
The Hestens–Stiefel conjugate gradient method [34] has the form:
x k + 1 = x k γ k s k ,   γ k = a r g m i n γ f ( x k γ s k ) , k = 0 , 1 , 2 ,
s 0 = g 0 ,   s k + 1 = g k + 1 ( g k + 1 , y k ) ( s k , y k ) s k ,   y k = g k g k + 1 .
From (41), it follows that the new descent direction of the current gradient difference is orthogonal ( s k + 1 , y k ) = 0 . For a quadratic function, this implies conjugacy of adjacent descent vectors, regardless of the accuracy of the one-dimensional search.
In the Hestenes–Stiefel method with a space metric transformation (14), we also use the property ( s k + 1 , y k ) = 0 to obtain the descent vector transformation formulas:
x k + 1 = x k γ k s k ,   γ k = a r g m i n γ f ( x k γ s k ) , k = 0 , 1 , 2 ,
s 0 = H 0 g 0 ,   s k + 1 = H k g k + 1 ( H k g k + 1 , y k ) ( s k , y k ) s k ,   y k = g k + 1 g k ,
H 0 = I ,   H k + 1 = H k ( 1 1 α k 2 ) H k y k y k T H k T ( y k , H k y k ) ,   α M α k α m > 1 .
Algorithms (42)–(44) has the properties of the conjugate gradient method. To justify this property, we need the following theorem.
Theorem 5. 
Let the matrices Hi, i = 1, 2, …k, be obtained as a result of transformations (44), and let the vectors yi, i = 0, 1, …k − 1, used in (44), be orthogonal to the vector gk+1. Then, H k g k + 1 = g k + 1 .
Proof of Theorem 5. 
We will carry out the proof by induction. Let H i g k + 1 = g k + 1 for i < k − 1. Then, taking into account the orthogonality of the vectors yi, i = 0, 1, …, k − 1, to the vector gk+1, we obtain
( H i + 1 , g k + 1 ) = H i ( 1 1 α k 2 ) H i y i y i T H i T ( y i , H i y i ) , g k + 1 = g k + 1 ( 1 1 α k 2 ) ( y i , H i g k + 1 ) ( y i , H i y i ) H i y i = g k + 1 ( 1 1 α 2 ) ( y i , g k + 1 ) ( y i , H i y i ) H i y i = g k + 1 .
Continuing the process of induction, we obtain the statement of the theorem. □
Regarding the convergence of algorithms (42)–(44) on quadratic functions, the following theorem holds.
Theorem 6. 
When minimizing quadratic functions, the sequences of points xk, k = 0, 1, 2, …, of algorithms (40)(41), (42)(44) with the same initial points and initial matrix  H 0 = I  coincide.
Proof of Theorem 6. 
We will prove the theorem by induction. Let the points xk of algorithms (40)–(41), (42)–(44) coincide for i < k. In conjugate gradient methods, when minimizing a quadratic function using exact one-dimensional descent, the gradient vectors are mutually orthogonal. Therefore, the current gradient gi at a new point is the same for both algorithms and orthogonal to all previous gradients. The vector si is obtained by formula (43) using the matrix H i 1 g i . The matrix Hi−1 is obtained using vectors yj, j = 0, 1, …i − 2, in which the gradient vectors gj, j = 0, 1, …i − 1, orthogonal to the vector gi, are used. From here, according to Theorem 5, we obtain H i 1 g i = g i . Using this equality in (43), we obtain the coincidence of the descent direction with the Hestenes–Stiefel method
s i = H i 1 g i ( H i 1 g i , y i 1 ) ( s i 1 , y i 1 ) s i 1 = g i ( g i , y i 1 ) ( s i 1 , y i 1 ) s i 1 .
Consequently, the new iteration of each process will be performed by minimization along the same directions and from the same point. Continuing the induction process further, we obtain a proof of the theorem. □
The level surface topology of the function being minimized may be similar in many ways to the topology of the quadratic function in a neighborhood significantly larger than the neighborhood in which the function and its quadratic representation are similar. In such a case, it can be assumed that algorithms (13)–(14) and (42)–(44), which do not use approximations of the Hessians, will be more effective than quasi-Newton methods.

6. Numerical Experiment

To identify the potential of transformation (14) for solving problems of minimizing smooth functions and to test the effectiveness of the presented algorithms (13)–(14) and (42)–(44), a numerical study was conducted. The methods are compared with the quasi-Newton BFGS method.
All methods used in the study utilized the one-dimensional search described in [35], where the gradient and function value are used as information for organizing the method. This is particularly advantageous in conditions where the costs of computing the gradient and function are comparable.
In the following tables we will denote:
(1)
BFGS—quasi-Newtonian BFGS method;
(2)
HY_g—algorithm (13)–(14);
(3)
HY_XS—algorithm (42)–(44).
When solving an ill-conditioned minimization problem with high accuracy, the quasi-Newton method BFGS is typically used, or, if possible, Newton’s method. For this reason, we chose the BFGS method as the benchmark for comparison with the methods under study.
The set of test functions includes a quadratic function. Since we know the function’s condition number, we can estimate its complexity for minimization. Also, based on the results of minimizing the quadratic function, we obtain information about the method’s behavior in a certain neighborhood of the current minimization point of the real function, where its quadratic representation is valid.
The tests include functions with both linear and curvilinear ravines. With curvilinear ravines, the Hessian eigenvectors change as we move toward the minimum, which leads to obsolescence of the metric matrices and a decrease in the convergence rate. Further successful progress requires reconfiguring the method’s parameters. On such functions we will observe the behavior of the method with the variability of the quadratic representation of the function as we move towards the minimum.
We also consider a function whose level surface topology matches that of the quadratic function. In this case, it is of interest to compare the algorithms (13)–(14) and (42)–(44) with quasi-Newton methods, which rely less on the properties of the Hessian matrices.
The final test problem reflects the variability of scaling across variables as we move towards the extremum. In this case, there is an active change in the level surface topology due to changes in the scales along the coordinate axes.
In all methods, the function and gradient were calculated simultaneously. Tables show the number of iterations and the total number of function and gradient calculations for the selected methods. The problem dimension in each experiment varies from 100 to 1000. The stopping criterion was f ( x k ) f * ε .

6.1. Quadratic Function

We use the following quadratic function:
f Q ( x , [ a max ] ) = 1 2 i = 1 n a i x i 2 , a i = a m a x i 1 n 1 .
The limits of eigenvalues ai of this function are λmin = 1 and λmax = amax. Starting point is x0 = (100,100,…, 100). The stopping criterion was f ( x k ) f * 1 0 10 .
Table 1 and Table 2 show the results of function minimization for different degrees of conditionality amax = 104 and amax = 108. Best results are given in bold.
For function with a low degree of conditionality the BFGS quasi-Newton method does not have enough number of iterations to create a suitable metric matrix. Perhaps for this reason, the HY_XS algorithm is equivalent in performance to the BFGS. On this test, HY_XS method outperforms the HY_g method.
In Table 2, the problem conditionality is higher, and the required number of iterations to solve the problem is commensurate with the problem dimension. BFGS manages to construct the metric matrix and, as a result, achieves a high convergence rate in the final iterations and outperforms other algorithms. Here, HY_XS outperforms HY_g.
This is an example of an ill-conditioned problem with a fixed Hessian, where the quasi-Newton method (BFGS) is significantly more efficient than other methods. Based on the results of this example, we can conclude that if the matrix of second derivatives in a real-world problem is ill-conditioned and does not change significantly over a sufficiently wide range, then the quasi-Newton method will outperform the HY_g and HY_XS algorithms.

6.2. Function with Multidimensional Ellipsoidal Ravine

The following function has a multidimensional ellipsoidal ravine. Minimization occurs when moving along a curvilinear ravine to the minimum point.
f E ( x , [ a max , b max ] ) = ( 1 x 1 ) 2 + a   max 1 i = 1 n x i 2 / b i 2 , b i = b max i 1 n 1
The stopping criterion was f ( x k ) f * 1 0 4 .
Table 3 and Table 4 demonstrate results of function fE minimization. In Table 3 starting point is x01 = (−1, 0.1, …, 0.1), in Table 4 starting point is x02 = (−1, 2, 3, …, n). Best results are given in bold.
For this starting point, the BFGS and HY_XS methods reach a minimum almost immediately, where the conditioning level increases as the minimum is approached. Therefore, the BFGS and HY_XS algorithms are more efficient than the HY_g method.
At the minimum point, the function is degenerate. Therefore, as the degree of conditioning increases while the minimum is approached, the HY_g method, which does not use conjugate directions, turns out to be the worst performer.
For this starting point, the methods initially enter a ravine far from the minimum point and move along the bottom of the curvilinear ravine. The elongation of the isosurfaces and the directions of elongation in the curvilinear ravine change, preventing the BFGS method from effectively and quickly adjusting the metric matrix. Here, the HY_XS and HY_g algorithms have an advantage. Moreover, the conjugacy of the descent directions in the HY_XS algorithm allows for obtaining a solution much faster in the neighborhood of a degenerate minimum point.

6.3. Function with Multidimensional Ellipsoidal Ravine and Non-Degenerate Minimum Point

The next function also has a multi-dimensional ellipsoidal ravine.
f E X ( x , [ a max , b max ] ] ) = ( 1 x 1 ) 2 + a   max 1 i = 1 n x i 2 b i 2 + 1 2 i = 1 n x i 2 b i , b i = b   m a x i 1 n 1 .
The starting point is x0 = (−1, 2, 3, …, n). Stopping criterion is f ( x k ) f * 1 0 10 . The function has an additional quadratic term, so the function ceases to be degenerate at the minimum point. Due to this, gradient methods are able to find the minimum of function fEX with higher accuracy than for function fE. Table 5 shows the results of function fEX minimizing. Best results are given in bold.
Despite the fact that the minimum point is non-degenerate, BFGS performs worse due to its slow progress along the ravine bottom. The HY_g algorithm spends significant time reaching the minimum in the neighborhood of the minimum. In the HY_XS algorithm, the minimization stage in the minimum region is completed more quickly by using conjugate directions. Here, HY_XS, which combines the ability to more efficiently progress along the bottom of a curvilinear ravine and the conjugacy of descent vectors, proves more effective than other algorithms.

6.4. Non-Quadratic Function

The following function has topological similarity of level surfaces to the quadratic function.
f Q ^ 2 ( x , [ a max ] ) = i = 1 n a i x i 2 2 ,   a i = a   max i 1 n 1 .
Starting point is x 0 = ( 1 ,   1 ,   ,   1 ) . Stopping criterion is f ( x k ) f * 1 0 10 . This is equivalent to reducing the quadratic term i = 1 n a i x i 2 to a value smaller than 10−5. Table 6 demonstrates the results of function fQ^2 minimization. Best results are given in bold.
Here, a low convergence rate of the BFGS method is observed. As the method approaches a minimum, the Hessian elements tend to zero. Consequently, the elements of the inverse matrix increase. In Ref. [17], it was noted that as the matrix of second derivatives in the surveyed space grows, the approximated matrices in quasi-Newton methods also grow. This complicates the exit to the unexplored subspace, which slows the convergence rate. In the HY_g and HY_XS algorithms, on the contrary, the possibility of entering the unexplored subspace always increases, which explains the advantage of these algorithms for this function. The results for the HY_XS algorithm confirm the advantages of using the metric in the conjugate gradient method.

6.5. Function with Scaling by Variables

This function used additional variables ci to change the scales ai for each variable.
f a b c ( x , [ a max , b max ] ) = 1 2 i = 1 n a i c i x i 2 ,   a i = a   m a x i 1 n 1 ,   c i = b   m a x b i x i 2 1 + x i 2 + b i 1 x i 2 1 + x i 2 , b i = b   m a x i 1 n 1
This function near the extremum will have the form
f a b c ( x , [ a max , b   max ] ) 1 2 i = 1 n a i b i x i 2 .
Far from the extremum, we obtain a function in which the coefficients are used in reverse order:
f a b c ( x , [ a max , b   max ] ) 1 2 i = 1 n a i b   m a x b i x i 2 .
Changes in the scales of the coefficients according to the presented limiting variants of the function can be represented by the following transition:
a i b   m a x b i a i b i .
It follows that the first coefficient decreases by a factor of b   m a x , and the last coefficient increases by a factor of b   m a x . The starting point is x 0 = ( 100 ,   100 ,   ,   100 ) . The stopping criterion is f ( x k ) f * 1 0 10 . Table 7 demonstrates the results of function fabc minimization. Best results are given in bold.
The BFGS method performs the worst on this function. This is due to the high variability of the Hessian. Here, the space scanning strategy with excludes previously explored subspaces, which is used in the HY_g and HY_XS algorithms, proves suitable. The use of vector conjugacy in the HY_XS algorithm has little effect on the algorithm’s performance compared to the HY_g method.
It should be noted that the metrics in QNM and in RSM have different meanings. There are obstacles to good convergence in QNM: the first is a rapid change in the directions of the Hessian eigenvectors, which is observed in tests, Table 2 and Table 3; the second is a rapid change in the scales of variables (Table 6 and Table 7). Another case of poor convergence is the tendency of the second derivatives to zero as they approach the minimum (Table 5 illustrates this case).
The following Table 8 summarizes the results of minimization for all functions with n = 1000.
Let us draw the main conclusions based on the results of Table 8.
  • The BFGS method significantly outperforms the HY_g and HY_XS algorithms in ill-posed problems, where the Hessian does not undergo significant changes depending on the position in the minimization space. This thesis is confirmed by the data for the function fQ with amax = 108. For such a function, BFGS manages to construct a suitable coordinate transformation, after which the algorithm’s convergence rate significantly increases. In such situations, the HY_XS method has advantages over the HY_g algorithm due to the conjugate gradient method.
  • Function fE has an ellipsoidal ravine and a degenerate minimum point. From the first starting point, with a steeper ravine than from the second starting point, the main obstacle is the degeneracy of the minimum. Algorithms BFGS and HY_XS, which use the properties of a quadratic function, complete the minimization stage in the extremum region significantly faster than HY_g. As the degeneracy of the ravine increases, BFGS’s progress along the ravine becomes slow due to the high variability of the Hessian. Algorithms HY_g and HY_XS are more successful here, and the properties of the conjugate gradient method of HY_XS lead to a significant acceleration of the convergence rate. A similar situation arises when minimizing the function fEX.
  • The level surfaces of the function fQ^2 are typologically equivalent to the level surfaces of a quadratic function. In this situation, based on an analysis of the data in Table 7, we can conclude that the metric transformation in algorithms HY_g and HY_XS is more effective than the metric transformation of the quasi-Newton method.
  • It can be seen that the overheads of the HY_g method are reduced by approximately 2 times compared to the quasi-Newton method, and the overheads of the HY_XS method were 3.5 times lower on average than those of the quasi-Newton method. Overall, the computational experiment confirms the acceleration properties of the algorithms HY_g and HY_XS, obtained by using the metric transformation (14), and their applicability alongside quasi-Newton methods in solving complex, ill-conditioned minimization problems.

7. Conclusions

The conditionality of the minimization problem determines the spread of the isosurface elongations in different directions, which in turn determines the complexity of the problem’s solution. In minimization practice, it is often possible to reduce the isosurface elongations through a linear coordinate transformation, thereby increasing the convergence rate of the gradient method used in the new coordinate system. For strongly convex functions with a Lipschitz gradient, such a coordinate transformation reduces the difference between the strong convexity and Lipschitz constants. If the function is twice differentiable, such a transformation reduces the spread of the Hessian’s eigenvalues.
The noted linear transformation of coordinates eliminates the linear background that worsens the convergence of the gradient method. The property of eliminating such background is possessed by Newton’s method, quasi-Newton methods, and subgradient methods with a change in the space metric. In this work, we proved that algorithm HY_g, where the gradient method acquires accelerating abilities due to the use of the metric transformation (14), has similar properties. The qualitative nature of the convergence rate estimates for algorithm HY_g and Newton’s method coincide.
Taking into account the efficiency of transformation (14), we propose the conjugate gradient method HY_XS, also based on this transformation. The equivalence of this method to the conjugate gradient method on quadratic functions is proven. To evaluate the efficiency of the proposed method on test functions, we compare it with the quasi-Newton BFGS and the algorithm HY_g.
The test results demonstrate the efficiency of algorithm HY_XS. On almost all test problems, its convergence rate is significantly higher than that of the HY_g method. A comparison with the quasi-Newton method shows that method HY_XS is more efficient in the case of a high degree of variability of the Hessian matrices of the function being minimized.
A computational experiment shows that methods HY_g and HY_XS are effective in solving ill-conditioned problems with complex curvilinear ravines and unstable characteristics of the second derivatives. The main conclusion is that these methods, along with quasi-Newton methods, are applicable to solving problems of minimizing smooth functions with a high degree of conditioning.
The present study was conducted on smooth functions. It is probably of interest to extend these metric transformations to increase the convergence rate of the method on non-smooth functions. It is clear, purely speculatively, that by converting the metric, the elongation of the level surfaces in different directions can be equalized in the non-smooth case. However, we retain a detailed study of this issue for future research.

Author Contributions

Conceptualization, V.K.; methodology, V.K., E.T. and S.G.; software, V.K.; validation, L.K., E.T. and S.G.; formal analysis, L.K. and S.G.; investigation, V.K.; resources, I.R. and L.K.; data curation, S.G.; writing—original draft preparation, V.K. and S.G.; writing—review and editing, E.T. and L.K.; visualization, V.K. and E.T.; supervision, V.K., I.R. and L.K.; project administration, L.K. and I.R.; funding acquisition, L.K. and I.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Russian Federation, project no. FEFE-2023-0004.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
  2. Nocedal, J.; Wright, S. Numerical Optimization; Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006. [Google Scholar]
  3. Broyden, C.G. The convergence of a class of double−rank minimization algorithms. J. Inst. Math. Appl. 1970, 6, 76–79. [Google Scholar] [CrossRef]
  4. Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
  5. Goldfarb, D. A family of variable metric methods derived by variational means. Math. Comput. 1970, 24, 23–26. [Google Scholar] [CrossRef]
  6. Shanno, D.F. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970, 24, 647–656. [Google Scholar] [CrossRef]
  7. Javad Ebadi, M.; Fahs, A.; Fahs, H.; Dehghani, R. Competitive secant (BFGS) methods based on modified secant relations for unconstrained optimization. Optimization 2023, 72, 1691–1706. [Google Scholar] [CrossRef]
  8. Vaziri, A.; Fang, H. Optimal inferential control of convolutional neural networks. In Proceedings of the 2025 American Control Conference (ACC), Denver, CO, USA, 8–10 July 2025; pp. 2603–2610. [Google Scholar] [CrossRef]
  9. Rehman, H.; Peng, Z.-Y.; Yao, J.-C. Approximate subgradient extragradient methods for solving variational inequality problems: Convergence analysis and applications in signal and image processing. Commun. Nonlinear Sci. Numer. Simulat. 2026, 152, 109211. [Google Scholar] [CrossRef]
  10. Xie, Z.; Li, M. Accelerated subgradient extragradient methods with increasing self-adaptive step size for variational inequalities. Commun. Nonlinear Sci. Numer. Simulat. 2025, 147, 108794. [Google Scholar] [CrossRef]
  11. Sunthrayuth, P.; Adamu, A.; Muangchoo, K.; Ekvittayaniphon, S. Strongly convergent two-step inertial subgradient extragradient methods for solving quasi-monotone variational inequalities with applications. Commun. Nonlinear Sci. Numer. Simulat. 2025, 150, 108959. [Google Scholar] [CrossRef]
  12. Carrabs, F.; Gaudioso, M.; Miglionico, G. A two-point heuristic to calculate the stepsize in subgradient method with application to a network design problem. EURO J. Comput. Optim. 2024, 12, 100092. [Google Scholar] [CrossRef]
  13. Bulbul, K.G.; Kasimbeyli, R. Augmented Lagrangian based hybrid subgradient method for solving aircraft maintenance routing problem. Comput. Oper. Res. 2021, 132, 105294. [Google Scholar] [CrossRef]
  14. Zhang, Y.; Khan, K. Evaluating subgradients for convex relaxations of dynamic process models by adapting current tools. Comput. Chem. Eng. 2024, 180, 108462. [Google Scholar] [CrossRef]
  15. Shor, N.Z. Methods of Minimization of Non-Differentiable Functions and Their Applications; Naukova Dumka: Kiev, Ukraine, 1979. (In Russian) [Google Scholar]
  16. Skokov, V.A. Note on minimization methods employing space stretching. Cybern. Syst. Anal. 1974, 10, 689–692. [Google Scholar] [CrossRef]
  17. Krutikov, V.; Gutova, S.; Tovbis, E.; Kazakovtsev, L.; Semenkin, E. Relaxation Subgradient Algorithms with Machine Learning Procedures. Mathematics 2022, 10, 3959. [Google Scholar] [CrossRef]
  18. Tovbis, E.; Krutikov, V.; Kazakovtsev, L. Newtonian Property of Subgradient Method with Optimization of Metric Matrix Parameter Correction. Mathematics 2024, 12, 1618. [Google Scholar] [CrossRef]
  19. Krutikov, V.N.; Petrova, T. Relaxation method of minimization with space extension in the subgradient direction. Ekon. Mat. Met. 2003, 39, 106–119. [Google Scholar]
  20. Shor, N. Minimization Methods for Nondifferentiable Functions; Springer: Berlin/Heidelberg, Germany, 1985. [Google Scholar]
  21. Nemirovskii, A.S.; Yudin, D.B. Problem Complexity and Method Efficiency in Optimization; Wiley: Chichester, UK, 1983. [Google Scholar]
  22. Cao, H.; Song, Y.; Khan, K. Convergence of subtangent-based relaxations of nonlinear programs. Processes 2019, 7, 221. [Google Scholar] [CrossRef]
  23. Wolfe, P. Note on a method of conjugate subgradients for minimizing nondifferentiable functions. Math. Program. 1974, 7, 380–383. [Google Scholar] [CrossRef]
  24. Lemarechal, C. An extension of Davidon methods to non-differentiable problems. Math. Program. Study 1975, 3, 95–109. [Google Scholar]
  25. Demyanov, V.F. Nonsmooth optimization. In Nonlinear Optimization; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1981; Volume 1989, pp. 55–163. [Google Scholar]
  26. Nurminskii, E.A.; Thien, D. Method of conjugate subgradients with constrained memory. Autom. Remote Control 2014, 75, 646–656. [Google Scholar] [CrossRef]
  27. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  28. Loreto, M.; Xu, Y.; Kotval, D. A numerical study of applying spectral-step subgradient method for solving nonsmooth unconstrained optimization problems. Comput. Oper. Res. 2019, 104, 90–97. [Google Scholar] [CrossRef]
  29. Long, Q.; Wu, C.; Wang, X. A system of nonsmooth equations solver based upon subgradient method. Appl. Math. Comput. 2015, 251, 284–299. [Google Scholar] [CrossRef]
  30. Combettes, P.L. Convex set theoretic image recovery by extrapolated iterations of parallel subgradient projections. IEEE Trans. Image Process. 1997, 6, 493–506. [Google Scholar] [CrossRef] [PubMed]
  31. Yukawa, M.; Slavakis, K.; Yamada, I. Adaptive parallel quadraticmetric projection algorithms. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1665–1680. [Google Scholar] [CrossRef]
  32. Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 2003, 31, 167–175. [Google Scholar] [CrossRef]
  33. Zhukovskii, E.L.; Liptser, R.S. On a recurrence method for computing normal solutions of linear algebraic equations. USSR Comp. Math. Math. Phys. 1972, 12, 1–18. (In Russian) [Google Scholar] [CrossRef]
  34. Hestenes, M.R.; Stiefel, E. Method of Conjugate Gradients for Solving Linear Systems. J. Res. Natl. Bur. Stand. 1952, 49, 409–436. [Google Scholar] [CrossRef]
  35. Bunday, B.D. Basic Optimization Methods; Edward Arnold Limited: London, UK, 1984. [Google Scholar]
Table 1. Results of function fQ(x, [amax = 104]) minimization.
Table 1. Results of function fQ(x, [amax = 104]) minimization.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
100107324197 463166 408
200176464304 666203 472
300234575404 865254 573
400284672494 1045292 650
500328 760577 1211327720
600367 837656 1370360787
700403 909728 1515392851
800437 978795 1649424915
900466 1036856 1771454975
1000495 1093912 18844821032
Table 2. Results of function fQ(x, [amax = 108]) minimization.
Table 2. Results of function fQ(x, [amax = 108]) minimization.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
100136443474 1111476 1155
200263785725 1668651 1559
3003771086967 2219796 1838
40048113421217 2745934 2118
50057615471455 33031080 2429
60067517831695 38631253 2818
70076920181932 44091414 3169
80085321982164 49421551 3484
90094424282389 54321686 3778
1000102626082620 59961813 4067
Table 3. Results of function fE(x, [amax = 102, bmax = 103]) minimization from starting point x01.
Table 3. Results of function fE(x, [amax = 102, bmax = 103]) minimization from starting point x01.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
10058 196354 100846125
200114 372682 183758150
300135 463949 249537118
400154 4851052 26913795
500154 4961162 299149132
600185 5991017 252439118
700185 573917 220039104
800207 6751168 285842117
900224 739490 117140110
1000236 7501042 245736104
Table 5. Results of function fEX(x, [amax = 102, bmax = 103]) minimization.
Table 5. Results of function fEX(x, [amax = 102, bmax = 103]) minimization.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
100862 2198281817295 1005
2001354 34364331097349 1180
3001704 4356 617 14384901257
4001913 4909816 18486231561
5002167 5581879 19464541339
6002315 5967992 21736491572
7002471 63971154 25115631540
8002576 66831333 28886791724
9002683 69691441 31058021934
10002765 71841601 34418041951
Table 6. Results of function fQ2(x, [amax = 104]) minimization.
Table 6. Results of function fQ2(x, [amax = 104]) minimization.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
1001028 2639141381171 544
2001848 4787224565201 599
3002511 6553297 701234646
4003056 7982359 814337696
5003527 9232420 933346714
6003973 10,429474 1040355732
7004335 11,378520 1120365752
8004644 12,204567 1209378778
9004951 13,034611 1286390802
10005204 13,697653 1354402826
Table 7. Results of function fabc(x, [amax = 104, bmax = 103]) minimization.
Table 7. Results of function fabc(x, [amax = 104, bmax = 103]) minimization.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
1001819 47355021407587 1635
2002992 74507162033763 2016
3004004 10,0309222430956 2498
4004483 11,193115429871124 2856
5004961 12,3711431 365112623205
6005263 13,1301397 386214223590
7005577 13,9491554 424115263842
8005850 14,7371858 476016774189
9006026 15,1071917 494918024489
10006154 15,4392124 537619144783
Table 8. Results of function minimization, n = 1000.
Table 8. Results of function minimization, n = 1000.
FunctionBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
fQ, amax = 104495 1093912 18844821032
fQ, amax = 108102626082620 59961813 4067
fE, x01236 7501042 245736104
fE, x022631 69321540 3697248817
fEX2765 71841601 34418041951
fQ^25204 13,697653 1354402826
fabc6154 15,4392124 537619144783
Table 4. Results of function fE(x, [amax = 102, bmax = 103]) minimization from starting point x02.
Table 4. Results of function fE(x, [amax = 102, bmax = 103]) minimization from starting point x02.
nBFGSHY_gHY_XS
Number of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient CalculationsNumber of IterationsNumber of Function and Gradient Calculations
100937 2465242 68974288
2001434 3729364 1008139507
3001729 4526603 1575146513
4001929 5042661 1661258857
5002127 5590819 2019180602
6002267 5945840 2034204679
7002356 61691017 2454261852
8002428 63811160 2769271924
9002532 66761355 3249207697
10002631 69321540 3697248817
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Krutikov, V.; Tovbis, E.; Gutova, S.; Rozhnov, I.; Kazakovtsev, L. Properties and Application of Incomplete Orthogonalization in the Directions of Gradient Difference in Optimization Methods. Mathematics 2025, 13, 4036. https://doi.org/10.3390/math13244036

AMA Style

Krutikov V, Tovbis E, Gutova S, Rozhnov I, Kazakovtsev L. Properties and Application of Incomplete Orthogonalization in the Directions of Gradient Difference in Optimization Methods. Mathematics. 2025; 13(24):4036. https://doi.org/10.3390/math13244036

Chicago/Turabian Style

Krutikov, Vladimir, Elena Tovbis, Svetlana Gutova, Ivan Rozhnov, and Lev Kazakovtsev. 2025. "Properties and Application of Incomplete Orthogonalization in the Directions of Gradient Difference in Optimization Methods" Mathematics 13, no. 24: 4036. https://doi.org/10.3390/math13244036

APA Style

Krutikov, V., Tovbis, E., Gutova, S., Rozhnov, I., & Kazakovtsev, L. (2025). Properties and Application of Incomplete Orthogonalization in the Directions of Gradient Difference in Optimization Methods. Mathematics, 13(24), 4036. https://doi.org/10.3390/math13244036

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop