1. Introduction
We consider the problem of minimizing a differentiable function
f(
x) on a finite-dimensional Euclidean space. This problem can be stated as
Well-known methods [
1,
2] were developed to solve an unconstrained minimization problem, including the gradient method, which is based on the idea of function local linear approximation. Conjugate gradient methods (CGM) generate search directions that are consistent with the geometry of the minimized function. In practice, the CGM shows faster convergence rates than gradient descent algorithms, so CGM is widely used in machine learning.
Quasi-Newton methods (QNM) are based on the idea of using a matrix of second derivatives reconstructed from the gradients of a function. The commonly used matrix update formula for quasi-Newton methods is BFGS [
3,
4,
5,
6]. Two quasi-Newton algorithms for solving unconstrained optimization problems based on two modified secant relations to achieve reliable approximations of the Hessian matrices of the objective function were presented in [
7].
Ultra-high dimensions and strong nonlinearity can lead to extremely complex optimization landscapes, for which gradient-based optimization solvers will perform poorly or fail easily [
8]. Without denying the merits of the gradient descent method, it must be said that it turns out to be very slow when moving along a ravine, and as the number of variables of the objective function increases, such behavior of the method becomes typical. Performance of QNM degrades in ill-conditioned problems with unstable or rapidly varying Hessians—for example, in functions with curved ravine structures.
Relaxation subgradient methods (RSM) have been widely used in optimization practice for many years and have found practical application in such areas as signal and image processing [
9,
10], classification [
11], network design [
12], maintenance routing [
13], dynamic process modeling [
14], and many others.
In our earlier studies, it turned out that the problem of finding the direction of descent in the RSM can be reduced to the problem of solving a system of inequalities on subgradient sets and mathematically formulated as a solution to the problem of minimizing some quality functional. In this case, the properties of the learning algorithm determined the convergence rate of the minimization method.
In relaxation processes of ε-subgradient type, successive approximations are constructed as follows:
Here,
k is the iteration number, γ
k is the stepsize, and descent direction
sk is selected from the set
[
1,
15,
16,
17,
18,
19],
is ε-subgradient set at a point
xk, and
is a set of feasible directions,
g are gradients.
Denote by
a subgradient set at a point
x. If set
S(
G) is not empty, then any vector
is a solution to the set of inequalities:
in other words, it sets the normal of the dividing plane of the origin and the set
G. Here, (
s,
g) is a dot product of vectors. One of the solutions to (2) is a vector of minimal length from
G denoted as
η(
G).
The subgradient method was first proposed by Shor [
15,
20]. A number of effective approaches arose as a result of the development of the first subgradient methods with a space dilation [
21,
22]. The first relaxation subgradient methods were suggested in [
23,
24,
25].
In Ref. [
26], a method to solve the convex problems of nondifferentiable optimization relying on the basic philosophy of the conjugate gradient method and coinciding with it in the case of quadratic functions was presented. Authors in [
27] propose a family of adaptive subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Spectral-step subgradient method for solving nonsmooth unconstrained optimization problems was demonstrated in [
28]. In Ref. [
29], a subgradient method based on step size adjusting is developed to solve the system of nonsmooth equations. Subgradient projection methods are often applied to large-scale problems with decomposition techniques. Adaptive projection subgradient methods was proposed in [
30,
31]. Authors in [
32] consider a projected-subgradient type method, derived from using a general distance-like function instead of the usual Euclidean squared distance.
As in QNM, RSM uses ratios to calculate the required characteristics. These relations represent equalities, solving which it is possible to determine the desired values. Due to the multiplicity of these equalities, it is necessary to resort to their solution by gradient methods as information becomes available. Thus, the use of the iterative least squares method (ILSM) is possible in this case.
In this paper, we consider modifications of subgradient methods applied to solving smooth minimization problems. In this case, the system of inequalities (2) will be formed by a set of gradients of some neighborhood of the current minimum. The descent direction s satisfying (2) will allow the method to go beyond this neighborhood as a result of the minimization step along this direction.
The main contributions of this paper are as follows:
(1) A methodological approach was proposed based on a procedure of incomplete orthogonalization in the directions of gradient differences, implemented through the iterative least-squares method. This approach uses the structural characteristics of the level surfaces instead of second-derivative approximations.
(2) Two methods were constructed based on this approach: a gradient method with the ILSM metric and a modification of the Hestenes–Stiefel conjugate gradient method with the same metric. Both methods were implemented and numerically studied comparing with the quasi-Newton BFGS method on various types of smooth functions. The test results indicate the efficiency of the proposed methods, especially when solving poorly conditioned problems with complex curved ravines and unstable characteristics of the second derivatives.
(3) The noted linear transformation of coordinates eliminates the linear background that worsens the convergence of the gradient method. In this work, we proved that algorithm HY_g, where the gradient method acquires accelerating abilities due to the use of the proposed metric transformation, has similar properties to Newton’s method, quasi-Newton methods, and subgradient methods with a change in the space metric. The qualitative nature of the convergence rate estimates for algorithm HY_g and Newton’s method coincide. Also, the equivalence of the conjugate gradient method with the new metric (HY_XS) to the conjugate gradient method on quadratic functions is proven.
The results obtained allow us to conclude that it is possible to use the studied methods along with quasi-Newton methods to solve smooth optimization problems with a high degree of conditionality.
The rest of the paper is organized as follows.
Section 2 describes our variation in the gradient minimization method. In
Section 3, we analyze the properties and convergence rate of the proposed algorithm. In
Section 4, we study acceleration properties of the space dilation algorithm in the direction of the gradient difference. In
Section 5, we propose the Hestenes–Stiefel method in a metric with incomplete orthogonalization in the direction of the gradient difference. In
Section 6, we present the results of numerical experiments.
Section 7 concludes the work.
2. Gradient Minimization Method with Incomplete Orthogonalization in the Direction of the Gradient Difference
For the gradients on the descent trajectory of (1), we use the notation
. The idea of creating an iterative method for solving the system of inequalities (2) is based on its transformation into a system of equalities [
17,
18,
19].
To derive the formulas of the iterative process, we use an idealized model of the set
G [
17,
18,
19]. Let
belongs to a certain hyperplane, and the vector
η(G) is also the vector of minimal length of this hyperplane. Then there is a solution to the system of equalities:
which simultaneously satisfies (2). Here,
s is a descent direction,
g is the gradient. The solution of system (3) can be obtained as the solution of the system of equalities:
One of the possible solutions to the system (4) can be found in the form of
, where
is the sum-of-squares function of the residuals:
Here,
wi are weighting factors. Such a solution, taking into account the regularizing component
, can be obtained by the iterative least squares method (ILSM) [
33].
In Ref. [
19], based on ILSM, an iterative process of finding the descent direction for the minimization method (1) using training information (4) was obtained using special weighting factors
wi in
:
where
αk > 1 is space dilation parameter,
Hk is metric matrix. In (5), the direction adjustment is set so that the learning ratio
is fulfilled.
Using (5), (6), we derive the iterative process obtained earlier on the basis of heuristic considerations for solving problems of non-smooth optimization (r-algorithm [
15]) in the form proposed in [
16].
To obtain the formulas of the iterative process at iteration
k, we transform the data (4) by subtracting adjacent equalities. We will obtain a new system:
For a part of the data at
i = 0, 1, …
k − 1, we perform transformations (5), (6). Transformation (5) for
will take the following form:
As a result of transformations (8), regardless of the values of the matrices
Hi, by virtue of equality
s0 = 0 we obtain
si = 0,
i = 0, 1, …
k − 1. Together with (8), we will carry out the transformation (6) sequentially:
Transformation (6) for the last equality
from (7) can be omitted, because, as a result of transformation (5) for the last equality
, a vector
sk collinear to the vector
Hkgk will be obtained. Therefore, the iterative minimization process using transformations (5), (6) for data (7) at
k = 0, 1, 2, …, will have the form (1), where the descent direction and the metric matrix are iterating using the following formulas:
with fixed values of
.
Thus, using the idea of obtaining the descent direction satisfying the system of inequalities (2), we derived the iterative process of the r-algorithm [
15] in the form of [
16], which was previously obtained based on heuristic considerations.
In Ref. [
4], to speed up the process of finding the descent direction in method (5), (6), one of its special cases is presented. An algorithm for obtaining a direction
sk satisfying, in contrast to (5), simultaneously the last two learning relations
and
is proposed and theoretically justified. In this algorithm, the formulas for obtaining the direction of descent are as follows:
In this case, the metric matrix is adjusted according to formulas (10) with a varying space dilation parameter . Note that in order to ensure convergence when solving non-smooth problems, the choice of the descent direction is not always carried out according to Formulas (11) and (12).
Here and below, we will denote an algorithm by specifying the sequence of actions indicated by the corresponding equations. In Ref. [
17], estimates of optimal parameter
values are obtained, and it is shown that the algorithm (1)–(11)–(12)–(10) in the quadratic case generates a sequence of conjugate descent vectors. In the case of smooth functions, it becomes possible to use in the algorithm (1)–(11)–(12)–(10) the conjugate gradient method instead of formulas for generating the descent direction (11), (12).
The transformation of the ILSM metric (10) in the minimization algorithm has the property of partially orthogonalizing the descent vectors to the directions of the adjacent gradients difference on the descent trajectory. In the case of minimizing the quadratic function, the use of transformations (10) to form the descent direction increases the degree of conjugacy.
In Ref. [
18], estimates of the linear convergence rate for the algorithm (1)–(9)–(10) are obtained on strongly convex functions with a Lipschitz gradient. It is proven that on such functions, this method has accelerating properties qualitatively similar to those of Newton’s method. The fact that the Newton method [
18] and quasi-Newton methods have accelerating properties is explained by using either matrices of second derivatives or their approximations. Accelerating properties of the algorithm (1)–(9)–(10) due to the property of matrices (10) in the form of partial orthogonalization of the descent vectors to the directions of the adjacent gradients difference on the descent trajectory.
It was shown in [
18] that the minimization algorithm (1) with the calculation of the descent direction based on (11), (12), and the transformation of matrices (10) is identical to the conjugate gradient method when minimizing quadratic functions. In this paper, the Hestenes–Stiefel conjugate gradient method with metric matrices (10) was developed to solve problems of minimizing smooth functions. The finite convergence of such a method on quadratic functions is proved.
3. Properties and Convergence Rate of the Gradient Minimization Method with Incomplete Orthogonalization in the Direction of the Gradient Difference
Here, we analyze the properties of the algorithm (1)–(9)–(10). The algorithm consists of the following three steps:
1. Compute the next minimum point;
2. Compute descent direction ;
3. Compute metric matrix update:
Formulas for correction vector
pk, and metric matrix update
Hk+1 are as follows:
where
k is iteration number,
xk is current minimum point,
sk is descent direction,
Hk is metric matrix at iteration
k,
gk is gradient at iteration
k,
is stepsize,
is space dilation parameter,
yk is gradient difference,
x0 is initial point and
H0 is initial matrix.
As will be shown below, method (13)–(14) is invariant with respect to linear transformation of coordinates. By invariance, we mean the similarity of the process type in different systems. This means that when the iterations of the method are transferred to a new coordinate system, the iterative process is preserved. In this case, no conditions are imposed on the function, the only condition is its differentiability. For instance, in the case of a quadratic function, after transferring to a coordinate system where the Hessian has eigenvalues equal to one, the gradient differences
will coincide with the method offsets
during the iteration. Consequently, in the new coordinate system, the choice of the descent direction in (13) looks like the process of partial orthogonalization of the gradient to the previous descent directions using matrices (14). With complete orthogonalization, such a minimization process along the orthogonal system of directions for a quadratic function with unit Hessian will be finite [
1].
The limiting variant of algorithm (13)–(14) for
takes the form of the previously known conjugate gradient method [
1]. In this method, the descent directions are orthogonalized to all previous differences in adjacent gradients. This property does not allow the method to optimize in a subspace defined by gradient differences, that is, it excludes it from the optimization process. Such a method will not be effective, especially for large dimensions. In algorithm (13)–(14), only a partial reduction in the components of the descent direction in the subspace of the preceding differences in adjacent gradients is performed. This does not block the possibility of minimization in the examined subspace. Transformation (14) suppresses to a greater extent the gradient components directed orthogonally to the ravine directions. This is one of the qualitative justifications explaining the accelerating properties of the transformation (14), on the basis of which the r-algorithm was developed [
15].
Cases of low efficiency of quasi-Newton methods are associated with functions in which the Hessian eigenvalues decrease as they approach the minimum. In this case, the inverse matrices of quasi-Newton methods increase in the examined subspace, which makes it difficult for the method to enter the unexplored subspace. In algorithms (13)–(14), by reducing the influence of the examined subspace, the unexplored subspace always has an advantage. For this reason, for some classes of functions, algorithm (13)–(14) may be more effective than quasi-Newton methods, which will be confirmed by numerical tests.
Transformation (14) in algorithm (13)–(14) can be understood as a scaling transformation for the gradient method. This transformation may be useful as a scaler in other methods. In this paper, we will consider the use of transformation (14) in the Hestenes–Stiefel conjugate gradient method.
We investigate the convergence rate of algorithm (13)–(14). Let us consider the conditions under which the convergence rate and accelerating properties of the r-algorithm are estimated.
Condition 1.
We will assume that the function f(x), x∈ Rn differentiable and strongly convex in Rn and is minimized, i.e., there exists ρ > 0 such that∀x, y∈ Rn and the following inequality holds:and its gradient g(x) =∇ f(x) satisfies the Lipschitz condition: Here, L is Lipschitz parameter.
Denote by
x* the minimum point of function
f(
x),
f* =
f(
x*),
fk =
f(
xk). The iteration of the gradient-matched method for exact one-dimensional descent has the form:
where the initial point
x0 is set. Formula (15) does not specify the descent direction
S, they only impose the condition
on it. Although the direction
from (13) also satisfies the condition from (1) due to the positive definiteness of the metric matrices.
The following theorem shows that changes in the gradient as a result of method (15) iterations with an exact one-dimensional descent lead to a decrease.
Theorem 1
[18]. Let the function satisfy Condition 1. Then, for the sequence fk, k = 0, 1, 2… specified by (15), the following estimate takes place:where , f* is optimal function value, f0 is initial function value, L is Lipschitz parameter, ρ is defined in Condition 1. Here we derive the recurrent formulas for converting the determinants and trace of the metric matrices under consideration. Denote . Denote by Sp(A) a trace of matrix A and by det(A) a determinant of matrix A. For an arbitrary matrix A > 0, we denote by a symmetric matrix for which and .
Lemma 1.
Let Hk > 0 and matrix Hk+1 is obtained as a result of the transformation (14):Then, Hk+1 > 0 and Proof of Lemma 1.
We transform the right-hand side of expression (14) using the notation introduced earlier:
Inverting the matrix (20) element by element, we obtain (17). Formula (18) follows from (17). Calculating the determinant element by element for the matrices in (20), we obtain (19). □
The following theorem substantiates the linear convergence rate of algorithm (13)–(14).
Theorem 2.
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2, … given by the algorithm (13)–(14), with a limited initial matrix H0:where M0 and m0 are maximal and minimal eigenvalues of the matrix H0,
respectively, the following estimation takes place: Proof of Theorem 2.
Due to the exact one-dimensional descent in (13), the following condition is satisfied:
The matrix
Hk is positive definite, therefore, using the last equality, we obtain the following:
Hence, taking into account the inequality
where
Mk is the maximum eigenvalue of the matrix
Ak, we find:
The last estimate allows us to reduce inequality (23) to the following form:
Based on the ratio between the arithmetic mean and geometric mean of the eigenvalues of the matrix
A > 0, we have
Using the last inequality in (24), taking into account (19), we obtain:
The latter ratio, based on inequality
we convert to the following form:
By virtue of condition (21), inequalities hold:
Taking these relations into account, we transform inequality (25):
Taking the logarithm of the latter, we find:
From this follows the inequality:
which, together with estimate (16) of Theorem 1, proves (22). □
The obtained convergence rate estimates do not explain the method’s high convergence rate, for example, on quadratic functions. To justify the method’s acceleration properties, we have to demonstrate its invariance under a linear coordinate transformation and then use estimate (22) in the coordinate system in which the ρ/L ratio is maximal. It is possible to increase this ratio, for example, in the case of quadratic functions, where its maximum value is 1.
4. Acceleration Properties of the Space Dilation Algorithm in the Direction of the Gradients Difference
To estimate the convergence rate greater than linear in minimization methods, the minor variability of the Hessian in the extremum region is primarily used, which can be estimated based on the Hessian properties. We examine the acceleration properties of the algorithm without assumptions about the function’s second derivative matrices.
According to estimate (22), the convergence rate of the r-algorithm is determined by the magnitude of the
ρ/L ratio. In our study, similar to study [
18], we will assume the existence of a linear transformation of coordinates that increases the magnitude of this ratio.
Let a function
f(
x) satisfy Condition 1. We define a linear transformation of variables
where
is a non-singular matrix. In the new coordinate system, the function being minimized will have the following form:
The function (27) formed in this way also satisfies Condition 1 with parameter of strong convexity ρ and Lipschitz parameter L.
Denote
a non-singular matrix such that for parameters
ρ and
L of functions
with
and
for an arbitrary non-singular matrix
in (26), the inequality holds
As was established in [
18], transformation (28) plays the role of a distinguished coordinate system, which is the best from the point of view of the convergence rate of gradient methods. For a non-degenerate quadratic function, we can consider a coordinate system where ρ/L = 1 and the maximum and minimum eigenvalues are equal. In the general case, we only assume the possibility of eliminating the differences in the elongation of the level surfaces.
Next, we will show that algorithm (13)–(14), applied to the function f(x), and algorithm (13)–(14), applied to the function defined in (22), under appropriate initial conditions, construct sequences of minimization points related by the transformation (26). This will allow us to use the estimates of the convergence rate of algorithm (13)–(14) in the preferred coordinate system.
Theorem 3.
Let the initial conditions of the algorithm (13)–
(14), applied to minimize the functions f(x) and , defined in (27), be related by the equalities: Then the characteristics of these processes are related by the following relations: Proof of Theorem 3.
For the gradients of functions
and
f(
x) the following relation holds:
From this and (30) follows (31) for
k = 0. Let us assume that equalities (31) are satisfied ∀
k = 0, 1, …,
i. Let us show their feasibility for
k =
i + 1. From (13) for
k =
i after left-hand multiplying by
P, taking into account the proved equalities (31), we obtain:
Hence, according to the definition of the function
fp, at the stage of one-dimensional minimization the equality
is satisfied. Therefore, the right-hand side of (32) is the implementation of the step in the new coordinate system. Consequently:
Multiplying (14) with the current indices on the left by
P, and on the right by
PT, taking into account (33) and the equality:
we get
where the right side is the implementation of (14) in the new coordinate system.
Finally, we get . Therefore, equalities (31) will also be valid for k = i + 1. Continuing the process of induction, we obtain a proof of the theorem. □
In the following theorem, we use algorithms (13)–(14) in a distinguished coordinate system (28) with property (29).
Theorem 4.
Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2, … given by the algorithm (13)–
(14), with a limited initial matrix H0, according to (21) the following estimation takes place:where are the minimum and maximum eigenvalues of the matrix , respectively, in the selected coordinate system (28) having the property (29). Proof of Theorem 4. According to the results of Theorem 3, we can choose an arbitrary coordinate system to estimate the convergence rate of the minimization process in the algorithm. Therefore, using estimate (22) in a coordinate system with matrix P = V, we obtain estimates (34). □
The first term in square brackets (34) characterizes the constant in the method’s convergence rate estimate, and the second term represents the cost of adjusting the metric matrix.
Let us consider the acceleration effect of algorithm (13)–(14) compared to well-known methods: steepest descent and Newton’s method. For these methods, the convergence rate estimate is as follows:
Under Condition 1 imposed on the function, for the steepest descent, where
sk =
gk, the decay rate is as follows [
1]:
For Newton’s method
and decay rate is [
18]:
The convergence rate for Newton’s method, due to its invariance with respect to the linear transformation of coordinates, has the form [
18]:
Taking into account (36)–(38), the convergence rate of Newton’s method will be significantly higher than that of the steepest descent.
Estimate (34) for the algorithm (13)–(14) is equivalent to the estimate for Newton’s method (37) in terms of the influence of constants on the convergence rate. Under condition (39), the convergence rate estimate for the algorithm (13)–(14) is preferable to that for the steepest descent method, which is confirmed further by a computational experiment.
Thus, algorithm (13)–(14) on strongly convex functions, without assuming the existence of second derivatives, exhibits acceleration properties compared to the steepest descent method.
5. Hestenes–Stiefel Method in a Metric with Incomplete Orthogonalization in the Direction of Gradient Difference
Transformation (14) in the algorithm (13)–(14) can act as a scaling transformation for the steepest descent. This transformation may also be useful in other methods.
In Ref. [
17], it is shown that the sequence of approximations of the minimum generated by the algorithm (1)–(11)–(12), with the transformation of matrices (10), or equivalently with (14), coincides with the sequence generated by the conjugate gradient method. Transformations (11), (12) are based on the simultaneous solution of the system of equalities
and
. The equality
holds for their difference. In the Hestenes–Stiefel method [
30], the new direction is also chosen based on the equality
. In the quadratic case, the new direction in the Hestenes–Stiefel method will be conjugate to the current descent direction. In the case of smooth functions, we can replace the transformations (11), (12) for obtaining the descent direction with the transformations in the Hestenes–Stiefel method.
The Hestens–Stiefel conjugate gradient method [
34] has the form:
From (41), it follows that the new descent direction of the current gradient difference is orthogonal . For a quadratic function, this implies conjugacy of adjacent descent vectors, regardless of the accuracy of the one-dimensional search.
In the Hestenes–Stiefel method with a space metric transformation (14), we also use the property
to obtain the descent vector transformation formulas:
Algorithms (42)–(44) has the properties of the conjugate gradient method. To justify this property, we need the following theorem.
Theorem 5.
Let the matrices Hi, i = 1, 2, …k, be obtained as a result of transformations (44), and let the vectors yi, i = 0, 1, …k − 1, used in (44), be orthogonal to the vector gk+1. Then,.
Proof of Theorem 5.
We will carry out the proof by induction. Let
for
i <
k − 1. Then, taking into account the orthogonality of the vectors
yi,
i = 0, 1, …,
k − 1, to the vector
gk+1, we obtain
Continuing the process of induction, we obtain the statement of the theorem. □
Regarding the convergence of algorithms (42)–(44) on quadratic functions, the following theorem holds.
Theorem 6.
When minimizing quadratic functions, the sequences of points xk, k = 0, 1, 2, …, of algorithms (40)–(41), (42)–(44) with the same initial points and initial matrix coincide.
Proof of Theorem 6.
We will prove the theorem by induction. Let the points
xk of algorithms (40)–(41), (42)–(44) coincide for
i <
k. In conjugate gradient methods, when minimizing a quadratic function using exact one-dimensional descent, the gradient vectors are mutually orthogonal. Therefore, the current gradient
gi at a new point is the same for both algorithms and orthogonal to all previous gradients. The vector
si is obtained by formula (43) using the matrix
. The matrix
Hi−1 is obtained using vectors
yj,
j = 0, 1, …
i − 2, in which the gradient vectors
gj,
j = 0, 1, …
i − 1, orthogonal to the vector
gi, are used. From here, according to Theorem 5, we obtain
. Using this equality in (43), we obtain the coincidence of the descent direction with the Hestenes–Stiefel method
Consequently, the new iteration of each process will be performed by minimization along the same directions and from the same point. Continuing the induction process further, we obtain a proof of the theorem. □
The level surface topology of the function being minimized may be similar in many ways to the topology of the quadratic function in a neighborhood significantly larger than the neighborhood in which the function and its quadratic representation are similar. In such a case, it can be assumed that algorithms (13)–(14) and (42)–(44), which do not use approximations of the Hessians, will be more effective than quasi-Newton methods.
6. Numerical Experiment
To identify the potential of transformation (14) for solving problems of minimizing smooth functions and to test the effectiveness of the presented algorithms (13)–(14) and (42)–(44), a numerical study was conducted. The methods are compared with the quasi-Newton BFGS method.
All methods used in the study utilized the one-dimensional search described in [
35], where the gradient and function value are used as information for organizing the method. This is particularly advantageous in conditions where the costs of computing the gradient and function are comparable.
In the following tables we will denote:
- (1)
BFGS—quasi-Newtonian BFGS method;
- (2)
HY_g—algorithm (13)–(14);
- (3)
HY_XS—algorithm (42)–(44).
When solving an ill-conditioned minimization problem with high accuracy, the quasi-Newton method BFGS is typically used, or, if possible, Newton’s method. For this reason, we chose the BFGS method as the benchmark for comparison with the methods under study.
The set of test functions includes a quadratic function. Since we know the function’s condition number, we can estimate its complexity for minimization. Also, based on the results of minimizing the quadratic function, we obtain information about the method’s behavior in a certain neighborhood of the current minimization point of the real function, where its quadratic representation is valid.
The tests include functions with both linear and curvilinear ravines. With curvilinear ravines, the Hessian eigenvectors change as we move toward the minimum, which leads to obsolescence of the metric matrices and a decrease in the convergence rate. Further successful progress requires reconfiguring the method’s parameters. On such functions we will observe the behavior of the method with the variability of the quadratic representation of the function as we move towards the minimum.
We also consider a function whose level surface topology matches that of the quadratic function. In this case, it is of interest to compare the algorithms (13)–(14) and (42)–(44) with quasi-Newton methods, which rely less on the properties of the Hessian matrices.
The final test problem reflects the variability of scaling across variables as we move towards the extremum. In this case, there is an active change in the level surface topology due to changes in the scales along the coordinate axes.
In all methods, the function and gradient were calculated simultaneously. Tables show the number of iterations and the total number of function and gradient calculations for the selected methods. The problem dimension in each experiment varies from 100 to 1000. The stopping criterion was .
6.1. Quadratic Function
We use the following quadratic function:
The limits of eigenvalues ai of this function are λmin = 1 and λmax = amax. Starting point is x0 = (100,100,…, 100). The stopping criterion was .
Table 1 and
Table 2 show the results of function minimization for different degrees of conditionality
amax = 10
4 and
amax = 10
8. Best results are given in bold.
For function with a low degree of conditionality the BFGS quasi-Newton method does not have enough number of iterations to create a suitable metric matrix. Perhaps for this reason, the HY_XS algorithm is equivalent in performance to the BFGS. On this test, HY_XS method outperforms the HY_g method.
In
Table 2, the problem conditionality is higher, and the required number of iterations to solve the problem is commensurate with the problem dimension. BFGS manages to construct the metric matrix and, as a result, achieves a high convergence rate in the final iterations and outperforms other algorithms. Here, HY_XS outperforms HY_g.
This is an example of an ill-conditioned problem with a fixed Hessian, where the quasi-Newton method (BFGS) is significantly more efficient than other methods. Based on the results of this example, we can conclude that if the matrix of second derivatives in a real-world problem is ill-conditioned and does not change significantly over a sufficiently wide range, then the quasi-Newton method will outperform the HY_g and HY_XS algorithms.
6.2. Function with Multidimensional Ellipsoidal Ravine
The following function has a multidimensional ellipsoidal ravine. Minimization occurs when moving along a curvilinear ravine to the minimum point.
The stopping criterion was .
Table 3 and
Table 4 demonstrate results of function
fE minimization. In
Table 3 starting point is
x01 = (−1, 0.1, …, 0.1), in
Table 4 starting point is
x02 = (−1, 2, 3, …,
n). Best results are given in bold.
For this starting point, the BFGS and HY_XS methods reach a minimum almost immediately, where the conditioning level increases as the minimum is approached. Therefore, the BFGS and HY_XS algorithms are more efficient than the HY_g method.
At the minimum point, the function is degenerate. Therefore, as the degree of conditioning increases while the minimum is approached, the HY_g method, which does not use conjugate directions, turns out to be the worst performer.
For this starting point, the methods initially enter a ravine far from the minimum point and move along the bottom of the curvilinear ravine. The elongation of the isosurfaces and the directions of elongation in the curvilinear ravine change, preventing the BFGS method from effectively and quickly adjusting the metric matrix. Here, the HY_XS and HY_g algorithms have an advantage. Moreover, the conjugacy of the descent directions in the HY_XS algorithm allows for obtaining a solution much faster in the neighborhood of a degenerate minimum point.
6.3. Function with Multidimensional Ellipsoidal Ravine and Non-Degenerate Minimum Point
The next function also has a multi-dimensional ellipsoidal ravine.
The starting point is
x0 = (−1, 2, 3, …,
n). Stopping criterion is
. The function has an additional quadratic term, so the function ceases to be degenerate at the minimum point. Due to this, gradient methods are able to find the minimum of function
fEX with higher accuracy than for function
fE.
Table 5 shows the results of function
fEX minimizing. Best results are given in bold.
Despite the fact that the minimum point is non-degenerate, BFGS performs worse due to its slow progress along the ravine bottom. The HY_g algorithm spends significant time reaching the minimum in the neighborhood of the minimum. In the HY_XS algorithm, the minimization stage in the minimum region is completed more quickly by using conjugate directions. Here, HY_XS, which combines the ability to more efficiently progress along the bottom of a curvilinear ravine and the conjugacy of descent vectors, proves more effective than other algorithms.
6.4. Non-Quadratic Function
The following function has topological similarity of level surfaces to the quadratic function.
Starting point is
. Stopping criterion is
. This is equivalent to reducing the quadratic term
to a value smaller than 10
−5.
Table 6 demonstrates the results of function
fQ^2 minimization. Best results are given in bold.
Here, a low convergence rate of the BFGS method is observed. As the method approaches a minimum, the Hessian elements tend to zero. Consequently, the elements of the inverse matrix increase. In Ref. [
17], it was noted that as the matrix of second derivatives in the surveyed space grows, the approximated matrices in quasi-Newton methods also grow. This complicates the exit to the unexplored subspace, which slows the convergence rate. In the HY_g and HY_XS algorithms, on the contrary, the possibility of entering the unexplored subspace always increases, which explains the advantage of these algorithms for this function. The results for the HY_XS algorithm confirm the advantages of using the metric in the conjugate gradient method.
6.5. Function with Scaling by Variables
This function used additional variables
ci to change the scales
ai for each variable.
This function near the extremum will have the form
Far from the extremum, we obtain a function in which the coefficients are used in reverse order:
Changes in the scales of the coefficients according to the presented limiting variants of the function can be represented by the following transition:
It follows that the first coefficient decreases by a factor of
, and the last coefficient increases by a factor of
. The starting point is
. The stopping criterion is
.
Table 7 demonstrates the results of function
fabc minimization. Best results are given in bold.
The BFGS method performs the worst on this function. This is due to the high variability of the Hessian. Here, the space scanning strategy with excludes previously explored subspaces, which is used in the HY_g and HY_XS algorithms, proves suitable. The use of vector conjugacy in the HY_XS algorithm has little effect on the algorithm’s performance compared to the HY_g method.
It should be noted that the metrics in QNM and in RSM have different meanings. There are obstacles to good convergence in QNM: the first is a rapid change in the directions of the Hessian eigenvectors, which is observed in tests,
Table 2 and
Table 3; the second is a rapid change in the scales of variables (
Table 6 and
Table 7). Another case of poor convergence is the tendency of the second derivatives to zero as they approach the minimum (
Table 5 illustrates this case).
The following
Table 8 summarizes the results of minimization for all functions with
n = 1000.
Let us draw the main conclusions based on the results of
Table 8.
The BFGS method significantly outperforms the HY_g and HY_XS algorithms in ill-posed problems, where the Hessian does not undergo significant changes depending on the position in the minimization space. This thesis is confirmed by the data for the function fQ with amax = 108. For such a function, BFGS manages to construct a suitable coordinate transformation, after which the algorithm’s convergence rate significantly increases. In such situations, the HY_XS method has advantages over the HY_g algorithm due to the conjugate gradient method.
Function fE has an ellipsoidal ravine and a degenerate minimum point. From the first starting point, with a steeper ravine than from the second starting point, the main obstacle is the degeneracy of the minimum. Algorithms BFGS and HY_XS, which use the properties of a quadratic function, complete the minimization stage in the extremum region significantly faster than HY_g. As the degeneracy of the ravine increases, BFGS’s progress along the ravine becomes slow due to the high variability of the Hessian. Algorithms HY_g and HY_XS are more successful here, and the properties of the conjugate gradient method of HY_XS lead to a significant acceleration of the convergence rate. A similar situation arises when minimizing the function fEX.
The level surfaces of the function
fQ^2 are typologically equivalent to the level surfaces of a quadratic function. In this situation, based on an analysis of the data in
Table 7, we can conclude that the metric transformation in algorithms HY_g and HY_XS is more effective than the metric transformation of the quasi-Newton method.
It can be seen that the overheads of the HY_g method are reduced by approximately 2 times compared to the quasi-Newton method, and the overheads of the HY_XS method were 3.5 times lower on average than those of the quasi-Newton method. Overall, the computational experiment confirms the acceleration properties of the algorithms HY_g and HY_XS, obtained by using the metric transformation (14), and their applicability alongside quasi-Newton methods in solving complex, ill-conditioned minimization problems.
7. Conclusions
The conditionality of the minimization problem determines the spread of the isosurface elongations in different directions, which in turn determines the complexity of the problem’s solution. In minimization practice, it is often possible to reduce the isosurface elongations through a linear coordinate transformation, thereby increasing the convergence rate of the gradient method used in the new coordinate system. For strongly convex functions with a Lipschitz gradient, such a coordinate transformation reduces the difference between the strong convexity and Lipschitz constants. If the function is twice differentiable, such a transformation reduces the spread of the Hessian’s eigenvalues.
The noted linear transformation of coordinates eliminates the linear background that worsens the convergence of the gradient method. The property of eliminating such background is possessed by Newton’s method, quasi-Newton methods, and subgradient methods with a change in the space metric. In this work, we proved that algorithm HY_g, where the gradient method acquires accelerating abilities due to the use of the metric transformation (14), has similar properties. The qualitative nature of the convergence rate estimates for algorithm HY_g and Newton’s method coincide.
Taking into account the efficiency of transformation (14), we propose the conjugate gradient method HY_XS, also based on this transformation. The equivalence of this method to the conjugate gradient method on quadratic functions is proven. To evaluate the efficiency of the proposed method on test functions, we compare it with the quasi-Newton BFGS and the algorithm HY_g.
The test results demonstrate the efficiency of algorithm HY_XS. On almost all test problems, its convergence rate is significantly higher than that of the HY_g method. A comparison with the quasi-Newton method shows that method HY_XS is more efficient in the case of a high degree of variability of the Hessian matrices of the function being minimized.
A computational experiment shows that methods HY_g and HY_XS are effective in solving ill-conditioned problems with complex curvilinear ravines and unstable characteristics of the second derivatives. The main conclusion is that these methods, along with quasi-Newton methods, are applicable to solving problems of minimizing smooth functions with a high degree of conditioning.
The present study was conducted on smooth functions. It is probably of interest to extend these metric transformations to increase the convergence rate of the method on non-smooth functions. It is clear, purely speculatively, that by converting the metric, the elongation of the level surfaces in different directions can be equalized in the non-smooth case. However, we retain a detailed study of this issue for future research.