1. Introduction
We consider a problem of minimization of convex differentiable function f(x), x ∈ Rn, where Rn is finite-dimensional Euclidean space. Under conditions of a high degree of a function degeneracy, it is necessary to use Newton-type minimization methods, for example, modifications of Newton’s method, or quasi-Newton methods.
While in the minimum neighborhood, there is a stable quadratic representation of the function, most iterations of the minimization method take place outside the extremum area, so it seems relevant to study the accelerating properties of methods with changing the space metric under conditions of instability of the quadratic properties of the function.
Numerous studies on the convergence rate of Newton methods and quasi-Newton methods in the extremum region have been conducted, and some of them are given in [
1,
2,
3,
4,
5,
6,
7,
8]. The results obtained in [
9,
10] refer to the convergence rate of quasi-Newton minimization methods under the assumption that the method operates in the extremum area of the function. The authors of [
11] aimed at accelerating the symmetric rank-1 quasi-Newton method with Nesterov’s gradient. A convergence rate of incremental quasi-Newton method was investigated in [
12,
13]. Large-scale optimization through the sampled versions of quasi-Newton method was considered in [
14,
15]. Also, the convergence rate of randomized and greedy variants of Newtonian methods and quasi-Newton methods were presented in [
16,
17,
18,
19,
20,
21,
22,
23,
24].
As an object of minimization, we use strongly convex functions with a Lipschitz gradient [
25]. In the case of the existence of second derivatives, these constants limit the spread of the Hessian eigenvalues in the minimization region [
25]. The ratio
ρ/
L ≤ 1 of the strong convexity constant
ρ and Lipschitz constant
L determines the convergence rate of gradient minimization methods with an indicator
q ≈ 1 −
ρ/
L of approach to the extremum by function [
26].
As the presence of a removable linear background, we will understand the existence of a linear coordinate transformation V ∈ Rnxn that allows us to significantly increase the ratio of constants in the new coordinate system ρV/LV >> ρ/L. The advantages of the gradient method in the new coordinate system with the indicator q ≈ 1 − ρV/LV are obvious. However, this estimate is not feasible, since the transformation V is not known.
This research is a continuation of previous studies [
27,
28] and aimed at studying the capabilities of the Newton’s method and the relaxation subgradient method with optimization of the parameters of rank-two correction of metric matrices [
27] to eliminate the linear background that worsens the convergence in the conditions of the existence of transformation
V with the properties noted above. Similar studies for quasi-Newton methods were carried out in [
29].
Newton’s method is invariant with respect to linear coordinate transformation and allows one to obtain an estimate of the convergence rate for Newton’s method with indicator
q ≈ 1 −
ρ2V/
L2V. This makes it possible to draw a conclusion about the ability of Newton’s method to exclude from the function being minimized a linear background that worsens the convergence, which is eliminated using a linear transformation of coordinates. In what follows, this estimate serves as a standard, and the ability of a certain method, like Newton’s method, to exclude linear background will be called its Newtonian property. The main goal of the work is to substantiate the presence of the Newtonian property in RSM with a change in the space metric [
27]. As shown in [
29], the noted Newtonian property is inherent in quasi-Newton methods.
There are a number of directions for constructing non-smooth optimization methods, some of which are given in [
25,
30,
31]. The works [
32,
33,
34] considered an approach to creating smooth approximations for non-smooth functions. Methods of this class are applicable to a wide range of problems. A number of effective approaches in the field of non-smooth optimization arose as a result of the creation of the first subgradient methods with space dilation [
35,
36], in the class of minimization methods relaxing both in function and in distance to the extremum [
25,
37,
38].
The first RSMs were proposed in [
39,
40,
41]. In [
36], an effective RSM with space dilation in the direction of the subgradient difference (RSMSD) was developed. Subsequent work on the creation of effective RSMs is associated with identifying the origin of RSMSD and its theoretical justification [
42,
43]. Formalization of the model of subgradient sets and the use of ideas and machine learning algorithms [
44] made it possible to identify the principles of organizing RSM with space dilation [
43] and obtain a theoretical basis for their creation. It turned out that the problem of finding the descent direction in RSM can be reduced to the problem of solving a system of inequalities on subgradient sets and mathematically formulated as a solution to the problem of minimizing a quality functional. In this case, the convergence rate of the minimization method is determined by the properties of the learning algorithm.
The principle of RSM organizing does not rely on second derivative of functions. The method under study is similar in structure to the quasi-Newton methods, and its formulas for transforming metric matrices are similar in structure to the formulas of the quasi-Newton DFP method. The purpose of converting metric matrices in RSM is to find a metric matrix that transforms subgradients into a direction that forms an acute angle with all subgradients in the neighborhood of the current minimum approximation. Using this direction enables us to go beyond this neighborhood.
Studied RSM with optimization of the parameters of rank-two metric matrix correction [
27] is the result of RSM improvement from [
43]. The problem of finding the descent direction in the RSM from [
43] is reduced to the problem of solving a system of inequalities to develop a descent direction that forms an acute angle with the set of subgradients of a certain neighborhood of the current minimum. In this case, the descent direction is found similarly to how it is done in quasi-Newton methods, by multiplying the matrix by the subgradient. In [
27], compared with the algorithm from [
43], a faster algorithm for solving systems of inequalities was proposed, which was confirmed by a computational experiment in [
27] for RSM on this basis.
In this work, a qualitative analysis of formulas for choosing algorithm parameters from [
27] is carried out, and on this basis, a new method is proposed for finding matrix transformation parameters. In contrast to RSM from [
27,
42], where the algorithm convergence is justified under strict restrictions on the transformation parameters of metric matrices, in this work, estimates of the algorithm convergence rate on smooth functions are obtained for a wide range of matrix transformation parameters. Therefore, one can customize the method to solve problems of a certain class by selecting parameters for converting metric matrices.
For the studied RSM, it is shown that the method is invariant under linear coordinate transformation. An estimate of its convergence rate on strongly convex functions with a Lipschitz gradient is obtained. The property of Newton’s method is to eliminate the high degree of conditionality of the minimization problem caused by the linear background, which is also inherent in the subgradient method under study. At the same time, estimates of the convergence rate of Newton’s method and the method under study are qualitatively similar in reflecting the influence of the characteristics of the ill-conditioned problem.
To solve both smooth and non-smooth problems, universal algorithms have been developed and implemented, which are the practical implementation of an idealized version of the method. To detect the Newtonian property in the proposed methods, special test functions have been developed. The first of them simulates the random nature of changes in the properties of a function. In another function, a targeted change is made in the elongation of the function level lines along the coordinate axes as it approaches the extremum. In one of the functions, the axes of level lines elongation change due to movement along an ellipsoidal ravine.
In the computational experiment, a comparison is made of the quasi-Newtonian BFGS method and the investigated universal subgradient methods on the proposed test functions. The testing results indicate the effectiveness of the developed methods in minimizing smooth functions with a high degree of conditionality and their ability to exclude linear background that worsens convergence. Depending on the type of function, different methods dominate, which allows us to conclude that the subgradient method is applicable along with quasi-Newton methods in solving problems of minimizing smooth functions with a high degree of conditionality.
The rest of the paper is organized as follows. In
Section 2, the accelerating properties of Newton’s method under conditions of instability of second derivatives of the function are considered. In
Section 3, a subgradient method is presented that solves the problem of forming the direction of descent. The convergence rate of the subgradient method on strongly convex functions with Lipschitz gradient is discussed in
Section 4. Features of the implementation of the subgradient method are presented in
Section 5. The results of a numerical study on smooth functions are shown in
Section 6.
Section 7 concludes the work.
2. Accelerating Properties of Newton’s Method under Conditions of Instability of Second Derivatives
Denote . For non-smooth functions, we will denote a vector from the subgradient set . Due to the coincidence of the gradient and subgradient on smooth functions, we will also use this notation for smooth functions .
Condition 1. We will assume that the function being minimized f(x), x ∈ is differentiable and strongly convex in , i.e., there exists ρ > 0 such that the inequality:holds for all x,y ∈ and α ∈ [0, 1], and the gradient satisfies the Lipschitz condition Functions which fulfill Condition 1 satisfy the relations [
25]:
where
x* is the minimum point and
f* =
f(
x*) is the function value at the minimum point.
The iteration of the gradient-consistent method with exact one-dimensional descent has the form:
where the initial point is
x0 and
sk is a search direction.
Theorem 1. Let the function satisfy Condition 1. Then, the sequence of iterations j = 0, 1,…, k of the process (5), (6) is estimated as: Proof of Theorem 1. We present the exact value of the function reduction indicator
at iteration in the form:
Let us make estimates for numerator and denominator in (9). According to (2), for the denominator, we obtain:
According to (6),
fk+1 is the minimum of a one-dimensional function whose gradient is
. Due to the fact that this one-dimensional function also satisfies Condition 1, to estimate the numerator in (9), taking into account inequality (4), we obtain:
Using (9) and (10), we obtain (8):
□
Based on Theorem 1, the convergence rate indicator for the gradient method (5), (6) with a choice of descent direction
according to (8) and (11), will have the form:
Let us consider an estimate of the convergence rate of Newton’s method under Condition 1 and the assumption of the existence of second derivatives of the function.
Theorem 2. Let the function be twice differentiable and satisfy Condition 1. Then, for a sequence of iterations j = 0, 1,…, k of process (5), (6) with the choice of Newton’s method directionthe estimation takes place: Proof of Theorem 2. Hessian
under Condition 1 satisfies the constraints [
25]:
Denote , and is a symmetric matrix such that , .
To use Theorem 1, we estimate
for direction (13) subject to constraints (15):
Using the last estimate in (8), we obtain estimate (14). □
Let function
f(
x) satisfy Condition 1. Define the transformation of variables:
where
P ∈
Rn·n is a non-singular matrix. In the new coordinate system, the function to be minimized takes the form:
The resulting function also satisfies Condition 1 with the strong convexity constant ρp and Lipschitz constants Lp.
Let
V ∈
Rn·n be a non-singular matrix such that for the strong convexity and Lipschitz constants of functions
with
and
P ∈
Rn·n, the inequality takes place:
Transformation (18) subsequently plays the role of a selected coordinate system, the best in terms of the convergence rate of gradient methods. Due to the fact that the gradient method, unlike Newton’s method, is not invariant under a linear coordinate transformation, we cannot use the strong convexity and Lipschitz constants in the preferred coordinate system (18).
Theorem 3. Let the function be twice differentiable and satisfy Condition 1. Then, for the sequence of iterations j = 0, 1,…, k process (5), (6) with the choice of the Newton’s method direction (13), the following estimate holds:corresponding to the selected coordinate system (18), which has property (19). Proof of Theorem 3. The iteration of Newton’s method (5), (6), (13) with exact one-dimensional descent (6), has the form:
Characteristics of functions
and
taking into account (16) and (17), are related by:
After transferring process (21) to a new coordinate system, we obtain its coincidence with the method in the new coordinate system:
In the case of the relation of the initial points
for Newton’s method in different coordinate systems, according to (23), at
sequences of points related by the
and equal values of the functions
are generated. Moreover, taking into account the fact that the method with exact one-dimensional minimization (6) is considered, due to the extremum condition:
equality
will hold. Due to the invariance of Newton’s method with respect to the linear transformation of coordinates (16), when the initial conditions
are related, Newton’s method generates identical sequences of function values in different coordinate systems. Applying the estimate in the coordinate system
, taking into account the results of Theorem 2, we obtain estimate (20). □
The last estimate according to (19) determines the advantages of Newton’s method compared to the gradient method in the case of:
Taking into account the fact that when solving practical problems, most of the iterations of the method often occur under conditions of significant Hessian variation (15), estimate (20), subject to condition (24), explains the advantages of Newton’s method. In this case, no additional restrictions on the second derivatives under smoothness conditions are required.
3. Subgradient Minimization Method
Here, we will give an exposition of the subgradient method [
27], which solves the problem of forming the descent direction, which makes it possible to obtain a new point of the current minimum approximation by means of one-dimensional minimization along it outside a certain neighborhood of the current minimum. In this case, the appropriate direction is a vector consistent with all subgradients at points in a certain neighborhood of the current minimum approximation. In the case of smooth functions, the descent direction is matched with a set of neighborhood gradients obtained at iterations of the method.
In relaxation processes of the
ε-subgradient type, successive approximations are constructed according to the formulas [
39,
40,
41,
43,
45]:
The descent direction
sk+1 is selected from a set
, where
is
ε-subgradient set at a point
and
is a set of feasible directions. Denote a subgradient set at a point
x by
. If the set
S(
G) is not empty, then, according to its definition, any vector
s ∈
S(
G) is a solution to the set of inequalities:
that is, it specifies the normal of the separating plane of the origin and the set
G. One of the solutions to (26) is a vector
η(
G) of minimal length from
G. For example, in the
ε-steepest descent method,
[
41]. Due to the absence of an explicit definition of the
ε-subgradient set, in (25), the vector
s that satisfies condition (26) is used as the descent direction, and the set
G here is the shell of subgradients obtained on the descent trajectory [
39,
40,
41].
The elements of the set
G on smooth functions are the gradients of the current minimum neighborhood.
Figure 1 shows the set
G with the designations of its elements, which will be given below.
Denote by ηG a vector of minimum length from the set G, , , , , , . For a certain set G, we will also use the noted characteristics indicating the set as an argument, for example, η(G), r(G).
We will assume that the following assumption holds for the set G.
Assumption 1. Set G is convex, closed, limited ( < ∞), and satisfies the separability condition, i.e., .
Let us introduce the relation
θ(
M) and its inverse function
m(
θ).
Thus,
. For some limited
θ, define the relations:
Vector s* is a solution to the system of inequalities (26). Parameters ρ and RS characterize the thickness of the set G in the direction μ. The quantity RS, according to its definition, determines the thickness of the set G and significantly affects the convergence rate of learning algorithms with space dilation. When the thickness of the set is zero, when , we have the case of a flat set.
The quantity determines the complexity of solving system (26). The transformation parameters of the metric matrices of the subgradient method are found according to expression (28).
In this work, two versions of the subgradient method will be presented. The first of them involves an exact one-dimensional search. For this version of the algorithm, estimates of the convergence rate on smooth functions will be obtained. The second version of the minimization algorithm is intended for practical implementation, where a rough one-dimensional search is used. To correctly integrate a method for solving systems of inequalities into minimization algorithms and comply with the restrictions imposed on its actions, we need to outline it. In the subgradient methods under study, the following Algorithm 1 for solving the system of inequalities (26) is used to estimate the separating plane parameters.
Algorithm 1 [27]. Algorithm for solving a system of inequalities |
1. Assume k = 0, H0 = I, q ≥ 1. Set θA such that:
and . 2. Set and , which is the current approximation of the solution to the system of inequalities . Find a vector such that: If such a vector does not exist, then the solution is found; stop the algorithm. 3. Compute vectors: Here, the vector pk, is found from the condition of vectors and orthogonality: Compute , where:
Find the parameter: Find the parameters according to (28):We obtain a new approximation of the metric matrix , where:4. Assign k = k + 1. Go to step 2. |
Constraint (29) for the set G in the case of applying Algorithm 1 in the minimization method imposes restrictions on the subgradient sets of the non-smooth minimization problem. In the case of smooth minimization problems, one can arbitrarily choose the parameter satisfying (29). This parameter is selected experimentally in order to optimize the algorithm efficiency.
Denote
. It was proven in [
27] that Algorithm 1 converges in a finite number of iterations on a set
G, satisfying Assumption 1, and for algorithm parameters
V0 and
θA, for which the restrictions
and (29) are satisfied. In this case, the number of iterations does not exceed
k0—the minimum integer number from the range of values
k satisfying the inequality:
From the above estimate, the conclusion can be made that larger values of
correspond to fewer number of iterations
k0, which means that the desired direction will be found in fewer number of iterations. The last estimate is based on the worst-case scenario, when all
. In fact, according to the results of a computational experiment in [
27], a minimization algorithm based on Algorithm 1 with parameter (35) is more effective than with fixed parameters
.
The version of the minimization algorithm presented in this section uses exact one-dimensional descent and is intended to estimate its convergence rate on smooth functions. A practically implementable version of the algorithm without exact one-dimensional descent will be presented in the next section. Here, as well as in the practically implemented version of the algorithm, there are no updates for the parameters of the algorithm for solving systems of inequalities in the form of setting
Hk =
I, which are used in the theoretical version of the algorithm from [
27], necessary for the theoretical justification of the convergence of the minimization algorithm on non-smooth functions. In the version of the minimization algorithm used in practice when minimizing both smooth and non-smooth functions, the above update is absent, but there are minor changes to the diagonal elements of the matrix
, excluding its poor conditionality and scaling, and
, excluding excessive reduction of its elements. Therefore, the described version of the algorithm is closest to the implemented versions designed to minimize smooth and non-smooth functions. As before, to denote both the gradient and the subgradient at some point
xk, we will use the notation
.
At Step 2 of Algorithm 1, vector is given arbitrarily, vector having property (30) is found in the set. In the minimization algorithm, we assume at the point of current minimum approximation, determine the descent direction , and find the new minimum approximation .
In the case of exact one-dimensional minimization, the equality holds for the gradient at a point . Therefore, in Algorithm 1, built into the minimization algorithm, we can take the vectors and as a new pair of vectors in (30), for which, due to exact one-dimensional descent, an inequality similar to (30) will be satisfied Due to the arbitrary choice of vector in Algorithm 1, at the next iteration in the minimization algorithm, the vector can be chosen. An idealized version of such a minimization algorithm is Algorithm 2 described below. Estimation of the convergence rate of this algorithm is the goal of our work.
In the case of inexact one-dimensional descent, it is assumed that the one-dimensional minimum has been localized, that is, a point
has been obtained such that the subgradient
at the extreme point
satisfies the inequality (30)
(
Figure 2). Subgradient
will be used for the matrix
transformation.
Figure 2 shows the point
with the smallest found function value, which, at the next iteration, will become the new current minimum point with the direction of minimization
. A presentation of the practical version of the algorithm and its numerical analysis will be given in subsequent sections.
The following minimization algorithm assumes exact one-dimensional descent. An infinite sequence of points is constructed until the gradient becomes zero. For this version of the algorithm, an estimate of the convergence rate has been made.
Algorithm 2. Minimization algorithm |
1. Assume k = 0, H0 = I, q ≥ 1. Set θA such that: and . Compute , If then the minimum point is found, stop the algorithm. 2. Find a new minimum approximation:
3. Compute the gradient based on the condition:
If then is the minimum point; stop the algorithm. 4. Compute vectors , :
Here, the vector is found from (32) based on the orthogonality of and . Compute according to formula (33), where:
Find according to (34) and , as in (35). We obtain a new approximation of the metric matrix , 5. Assign k = k + 1. Go to step 2 |
Here, the built-in method for solving the system of inequalities (26) is the transformations carried out at step 4 under condition (39). The solution to system (26) at iteration is the vector , which is used as the new descent direction.
In [
27], the optimization of the parameters’
choice is related to the characteristics of the subgradient sets of the function. In [
27], it is assumed that it is possible to choose a parameter
MA corresponding to the real characteristic
Mε, which is the union of subgradient sets of a certain ε-neighborhood of the current minimum point satisfies the relation:
In the case of smooth functions, due to the fact that the subgradient coincides with the gradient, and the subgradient set contains a single element—the gradient—it is easy to satisfy condition (42), since for small ε, the characteristics of the subgradient set:
due to the fact that the gradient satisfies the Lipschitz condition, they change insignificantly. Therefore, for small ε:
which makes it possible to consider the algorithm for sufficiently large values of ε-neighborhoods that satisfy condition (42).
The smaller
, the more efficient the algorithm for solving systems of inequalities works [
27]. But for small values of
, according to (35), the values of
will be very large. This will lead to large changes in the matrix
H (36), which negatively affects the efficiency of the minimization method due to the difficulties that arise with the degeneration of metric matrices. Therefore, in the minimization algorithm, the smallest value
has to be limited and consistent with the accuracy of the one-dimensional search. To do this, a constraint on the parameter
is introduced into (34):
As a result, we obtain restrictions on the parameters of matrix transformation in (36):
Relation
subject to restrictions on
(43) is monotonically increasing on the segment
. Hence the constraints:
From here and (44), (45) the inequalities follow:
For the parameters
according to (44), (45), and (46), the inequalities hold:
The presented algorithm for solving systems of inequalities and the minimization algorithm also converge for fixed parameters
[
27].
As a computational experiment shows, the convergence rate of the method for solving systems of inequalities and the minimization method based on it [
27] is significantly higher if parameters
adjusted depending on the current situation (35) are used.
4. On the Convergence Rate of the Subgradient Method on Strongly Convex Functions with Lipschitz Gradient
As earlier,
x* is a minimum point of the function
f(
x),
f* =
f(
x*),
fk =
f(
xk) and
for a differentiable function satisfying Condition 1. Denote
,
Sp(
A) is a trace of matrix
A, det
A is a determinant of matrix
A. For an arbitrary matrix
A > 0, we denote
A1/2 as a symmetric matrix for which
A1/2 > 0 and
A1/2 A1/2 =
A. For characteristics of matrices
we used the result from [
27], valid for arbitrary parameters
satisfying condition (48).
Lemma 1 [
27]
. Let
, matrix obtained as a result of transformation , where parameters satisfy condition (48), and for arbitrary vectors , equality (38) is satisfied. Then, and: The following theorem shows that the presence of motion as a result of iterations (5), (6) lead to a decrease in the function.
Theorem 4. Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2… given by the process (5), (6) the following estimation takes place:where .
Proof of Theorem 4. For a strongly convex function, inequality (2) is satisfied. Taking this inequality into account, we obtain:
Inequality (3) is also valid for the one-dimensional function:
From here, taking into account the exact one-dimensional search, inequality (3) and Lipschitz condition (1), the estimate follows:
Transform (54) using the last relation and inequality
.
Recurrent use of the last inequality leads to estimate (53). □
Let us estimate the convergence rate of Algorithm 2 under more general restrictions on the parameters
.
This implies the constraint:
The following theorem substantiates the linear convergence rate of Algorithm 2 under constraints (55).
Theorem 5. Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0, 1, 2… given by the Algorithm 2 with limited initial matrix H0: (1) with an arbitrary parameter satisfying (55),
the following estimation takes place: (2) with parameters
specified in Algorithm 2, the estimation is: Proof of Theorem 5. Based on (50), we obtain (51). Transform (51) taking into account
, we obtain an estimate for the trace of matrices
Ak:
Due to exact one-dimensional descent (38), the following condition is satisfied:
which, together with the positive definiteness of the matrices, proves the inequality:
Hence, taking into account
, where
is the maximum eigenvalue of the matrix
, we obtain:
Based on the last estimate, inequality (59) is transformed to the form:
Based on the relationship between the arithmetic mean and geometric mean of the matrix
A > 0 eigenvalues, we have
. From here and (60), (52) in the case of restrictions on parameters
(55), we obtain:
and in case of choosing parameters
, as in Algorithm 2, taking into account (47), we obtain an estimate:
The last inequalities based on ratio
, transform to the form:
Due to condition (55)
. Taking logarithms of (61) and (62), taking into account the last inequalities, we find:
This implies:
which, together with estimate (53) of Theorem 4, proves (57) and (58). □
Estimating the convergence rate of Algorithm 2 under more general constraints (55) on parameters
makes it possible to use parameters different from those generated in Algorithm 2. The paper presents a computational experiment where the parameters of Algorithm 2 were changed as follows:
Here, parameters c were set as follows: c = {0.2; 0.1; 0.05}. As a result of a computational experiment, it was revealed that in ill-conditioned problems such changes increase the efficiency of the minimization method, including in non-smooth optimization problems. For non-smooth problems, there is no theoretical justification for convergence under transformation (63).
The obtained estimates do not explain the fact of the high convergence rate the method, for example, on quadratic functions. To justify the accelerating properties of the method, we need to show its invariance with respect to the linear transformation of coordinates and then use estimate (58) in the coordinate system with maximal ratio ρ/L. A similar possibility exists, for example, in the case of quadratic functions, where this ratio will be equal to 1.
Let us establish a relation between the characteristics of Algorithm 2, used to minimize the functions and from (17).
Theorem 6. Let the initial conditions of Algorithm 2, used to minimize the functions and
, defined in (17), be related by the equalities:
Then, the characteristics of these processes are related by the relations: Proof of Theorem 6. For derivatives of functions
and
, relation
holds. From this and assumption (64) follows (65) for
k = 0. Let us assume that equalities (65) are satisfied for all
k = 0,1,…,
i. Let us show their feasibility for
k =
i + 1. From (38) with
k =
i after multiplication by
P on the left, taking into account the proven equalities (65), we obtain:
Hence, according to the definition of the function
fp, at the stage of one-dimensional minimization (38), the equality
is satisfied. Therefore, the right side of (66) is the implementation of step (38) in the new coordinate system. Hence:
Multiplying (36) with the current indices on the left by
P, and on the right by
PT, taking into account (67), we obtain:
where the right side is the implementation of formula (36) in the new coordinate system. The denominators of the last formula establish a relationship:
Using the last equalities and formulas (41), (33) of Algorithm 2, we obtain:
Finally, we obtain . Consequently, equalities (65) will also be valid for k = i + 1. Continuing the induction process, we obtain the proof of Theorem 6. □
For function denote strong convexity constant by ρp, Lipschitz constant by Lp. Introduce the function K(P) = ρp/Lp. Denote by V the coordinate transformation matrix such that K(V) ≥ K(P) for an arbitrary non-singular matrices P.
Theorem 7. Let the function f(x) satisfy Condition 1. Then, for the sequence {fk}, k = 0,1,2… given by the Algorithm 2 with limited initial matrix H0 according to (56)
(1) with an arbitrary parameter satisfying (55),
the following estimation takes place: (2) with parameters
specified in Algorithm 2, the estimation is:where m0, M0 are the minimum and maximum eigenvalues of the matrix
in the selected coordinate system (18) having property (19). Proof of Theorem 7. According to the results of Theorem 6, we can choose an arbitrary coordinate system to estimate the convergence rate of the minimization process of Algorithm 2. Therefore, we use estimates (57) and (58) in a coordinate system with the matrix P = V and obtain estimates (68) and (69). □
The first term in square brackets characterizes the constant in estimating the convergence rate of the method, and the second term characterizes the costs of setting up the metric matrix.
For the steepest descent method (scheme (5), (6) with (11)) on functions satisfying Condition 1, the order of the convergence rate is determined by expression (12)
. Given that
estimate for Newton’s method (20) is
, for quasi-Newton method [
27] is:
and estimates (68) and (69) for the subgradient method turn out to be preferable to (12). This situation arises, for example, when minimizing quadratic functions whose Hessians have a large spread of eigenvalues.
Thus, Algorithm 2 on strongly convex functions, without assuming the existence of second derivatives, has accelerating properties compared to the steepest descent method.
For sufficiently small values of the
ratio, the average convergence rate of the subgradient method is given below:
The second term in square brackets of estimate (71) characterizes the stage of adjusting the metric matrix of Algorithm 2. From the analysis of expression (71), we can conclude that the qualitative nature of estimate (69) is similar to estimate (20) for Newton’s method, which takes into account the difference between information in the form of a matrix of second derivatives in (20) and a gradient in (69) through the presence of the factor 1/n in (69).
To test the effectiveness of the algorithm, it makes sense to implement Algorithm 2 and conduct numerical testing in order to identify its application possibilities in solving problems of minimizing smooth functions along with effective quasi-Newton methods for solving minimization problems with a high degree of conditionality.
5. Aspects of the Subgradient Method Implementation
In the case of inexact one-dimensional descent, in operation (25) of the minimization algorithm, it is assumed that the one-dimensional minimum has been localized, that is, a point
has been obtained such that the subgradient
uk+1 at the extreme point
zk+1 satisfies inequality (30):
which is shown in
Figure 2.
The subgradient
uk+1 is used to transform the matrix
Hk.
Figure 2 shows a point
xk+1 with a smaller function value on the localization segment between points
xk and
uk+1
which, at the next iteration, will become the new current minimum point with the direction of minimization
.
In Algorithm 1, at each iteration vector, is chosen arbitrarily, and then, vector such that . In the minimization algorithm with one-dimensional minimization, from a point x along the direction s = Hg, when carrying out localization of the minimum, we obtain a point for which a similar (30) inequality is satisfied, and a point inside the localization segment with a smaller function value, which we take in the minimization algorithm as a new minimum approximation from which a new one-dimensional descent will then be carried out. Gradients gx = g(x) and g1 = g(x1) at the points x and x1 are used together for matrix transformation. Thus, in the practical version of the minimization algorithm, vectors gx, g1, g(xm) will be used, corresponding in meaning to the vectors from Algorithm 1.
We use the one-dimensional minimization procedure based on these principles, outlined in [
27,
43]. Its set of input parameters is
, where
x is the point of the current minimum approximation,
s is the descent direction,
is the initial search step,
, and the necessary condition for the possibility of reducing the function along the direction
must be satisfied. Its output parameters are
. Here,
is the step to the point of a new minimum approximation:
is the step along
s such that at the point
for the subgradient
inequality
holds. This subgradient is used in the learning algorithm. The output parameter
h1 is the initial descent step for the next iteration. The step
h1 is adjusted to reduce the number of calls to the procedure for calculating the function and subgradient.
In the minimization algorithm, the vector is used to solve a system of inequalities, and the point as the point of a new minimum approximation.
We denote the call to the procedure as OM(;). Here is a brief description of it.
Let us introduce a one-dimensional function
. To localize its minimum, we take an increasing sequence
. Here,
qM > 1 is a step increasing parameter. In most cases, it is specified
qM = 3. Denote
l is number of
i at which the relation
is first time satisfied. Let us determine the parameters of the localization segment
of one-dimensional minimum:
and find a minimum point
through cubic approximation of the function [
46] on the localization segment, using the values of the one-dimensional function and its derivative. Calculate:
We calculate the initial descent step for the next iteration using the rule:
Here, qm < 1 is descent step decreasing parameter, which, in most cases, is set as qm = 0.8. In the vast majority of applications, the set of parameters {qM = 3, qm = 0.8} is satisfactory. When solving complex problems with a high degree of level surfaces elongation, the parameter should be increased: qm → 1. Subgradient method implementation is presented in Algorithm 3.
Algorithm 3. Subgradient method implementation |
1. Assume k = 0, initial matrix H0 = I, q ≥ 1, the number of iterations kmax to stop the algorithm. Set ΘA satisfying inequality (37), and parameter . Compute . Set the initial step of a one-dimensional search h0 and small and then the x0 is a minimum point, stop the algorithm. 2. If
then correct the matrix:
Set
Find a new minimum approximation: According to the description of the OM procedure, here the subgradient vector satisfies the condition . If then is the minimum point, stop the algorithm. If k > kmax then stop the algorithm. 3. Compute vectors by (31).
Here, vector is found from the orthogonality condition (32) of the vectors and . Then, compute by (33), where is calculated by formula (41). Find according to (34) and parameters by (35). We obtain a new approximation of the metric matrix , If dmax ≤ ε, then carry out scaling 4. Assign k = k + 1. Go to step 2. |
Here, the built-in method for solving the system of inequalities (26) is the transformations carried out at Step 3 under condition (39). The current approximation of the solution to system (26) at the iteration is vector sk (74), which is used as the new descent direction.
The algorithm uses soft matrix updating due to small changes in diagonal elements in the case of large angles (72) between vectors sk and gk. Due to the fact that as a result of matrix transformations, its elements are reduced to compensate for this effect, a scaling transformation (75) is carried out, which does not affect the computational process. Taking into account the scaling of the descent direction (74), simultaneously with the scaling of the matrix, the one-dimensional search step is also scaled, which is adjusted in the one-dimensional minimization procedure.
Along with formula (33), we used a simplified version of calculating the value of
, which enables us to analyze the qualitative nature of formula (33). Using symmetric matrix
, we form vectors
and assume equality
. Hence, due to the equality
a =
b − c, the vectors
a,
b,
c form an isosceles triangle (
Figure 3).
Due to the fact that the lengths of the vectors
b,
c projections onto the vector
a are the same, the equality
holds. Therefore:
and the factor from (33) can be transformed as follows:
From here and (33), we obtain:
At the last steps of the transformation in (76), we used the expression
introduced earlier in (27). As shown in [
27], Algorithm 2 is also operable when using formula (27)
to calculate the transformation coefficients of matrices (49) instead of
.
Approximate formula (76) reflects the qualitative nature of the relation
. According to
Figure 3, larger angles between vectors
b,
c correspond to smaller values of the ratio
, which, according to (76), reduces the value of
and, accordingly, leads to an increase in the parameter
and an insignificant decrease in the parameter
at Step 3 of Algorithm 3. We used a simplified expression for
from (76) in Algorithm 3.
Below, we present examples of solving test problems using the quasi-Newton BFGS method Algorithms 2 and 3.
6. Results of Numerical Study on Smooth Functions
Algorithms 2 and 3 were implemented with parameters
and
, providing the following product:
. These values were used in Algorithms 2 and 3 with dynamic parameters
selection method. The methods used the one-dimensional search described above. For comparison, the quasi-Newtonian BFGS method was implemented with a one-dimensional search procedure using cubic interpolation [
46]. In all methods, the function and gradient were calculated simultaneously.
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5 show the number of calculations of function and gradient values required to achieve the designated accuracy by the function
. The initial point of minimization
x0 and the value
ε are given in the description of the function.
The purpose of testing is to experimentally study the ability of subgradient method and quasi-Newton method to eliminate the background that slows down the convergence rate, which is eliminated through some linear transformation that normalizes the elongation of function level surfaces in different directions, which is predicted theoretically by the estimate (69) of Theorem 7.
Due to the fact that the use of subgradient methods with changing the space metric and quasi-Newtonian methods is justified primarily on functions with a high degree of conditionality, where conjugate gradient methods do not work, the test functions were selected based on this position. Due to the fact that the quasi-Newton method is based on a quadratic model of a function, its local convergence rate in a certain neighborhood of the current minimum is largely determined by how effective it is in minimizing ill-conditioned quadratic functions. Therefore, research was primarily carried out on quadratic functions and functions of their derivatives.
If the function is twice differentiable, then the eigenvalues of the Hessian are limited by the interval of the strong convexity parameter and Lipschitz parameter [
ρ,
L]. Previously, we did not use second derivatives in our proofs. Nevertheless, when developing tests, we used the representation of a quadratic function and the analysis of its conditionality, relying on its eigenvalues. The test functions simulate the oscillatory nature of the second derivatives in two ways. The first of them is the drift of the corresponding eigenvalue from one value to another. In the second method, we imposed noise on the length of the gradient vector randomly, which is reflected in the calculations of the gradient difference in subgradient Algorithms 2 and 3 (40) and in the quasi-Newton method:
With the described methods of simulating oscillations of the Hessian imposed on some basic quadratic function with given characteristics of the eigenvalues, we have characteristics of the degeneracy degree of the problem and can set, on the one hand, the scaling that the methods under study should exclude and the degree of oscillations of the scales simulating the change matrices of second derivatives within specified limits.
The following is accepted as the basic quadratic function:
The eigenvalues
ai of this function have the limits
. In this case, the methods under study have to remove the basic (trend) scaling specified by the coefficients of this function. To simulate random fluctuations of second derivatives, a function
f2 was created. To calculate the function values, the basic function
was used. Its gradients were distorted randomly according to the following scheme:
where
ξ ∈ [−1,1] is a random number uniformly distributed on a segment [−1,1],
r = 0.3. Such function will be noted as
.
Here, the parameters are the base function parameters and the gradient distortion parameter. It should be noted that distortion of gradients significantly reduces the accuracy of one-dimensional search, where gradients are used to estimate directional derivatives in cubic approximation.
In the third function, additional variables
ci were used to change the scales of
ai for each of the variables.
This function near the extremum will have the form:
Far from the extremum, we obtain a function in which the coefficients
bi are used in reverse order:
Changes in coefficients ci scales have the following range: .
The point x0 = (100, 100, …, 100) was chosen as initial in all the above functions. Additionally, the following nonlinear functions were also used for testing and analysis.
Function
f4 has ellipsoidal level surfaces corresponding to a quadratic function.
Function f5 has a multidimensional ellipsoidal ravine. Minimization occurs when moving along this curvilinear ravine to the minimum point.
The stopping criterion was:
Minimization results are presented in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6.
Table 1,
Table 2,
Table 3,
Table 4 and
Table 5 show the results of minimizing the five presented functions for various dimensions. These tables allow us to analyze the effect of removing the basic background using subgradient and quasi-Newton methods. The cells contain:
N_it—number of iterations (one-dimensional searches along the direction);
nfg—number of calls to the procedure for simultaneous calculation of a function and gradient.
Table 1 shows the results of minimizing the quadratic function
f1, intended for the basic scaling of variables. This function is a background that must be removed by the method’s metric matrix. The
nfg costs of subgradient methods here are approximately two times higher compared to the BFGS method.
Table 1.
Function minimization results.
Table 1.
Function minimization results.
n | Algorithm 3 | Algorithm 2 | BFGS |
---|
N_it | nfg | N_it | nfg | N_it | nfg |
---|
100 | 370 | 784 | 331 | 696 | 125 | 276 |
200 | 527 | 1070 | 538 | 1096 | 243 | 523 |
300 | 738 | 1424 | 746 | 1430 | 348 | 746 |
400 | 934 | 1740 | 944 | 1779 | 447 | 948 |
500 | 1122 | 2084 | 1135 | 2129 | 542 | 1146 |
600 | 1298 | 2359 | 1301 | 2474 | 634 | 1334 |
700 | 1434 | 2645 | 1454 | 2695 | 724 | 1525 |
800 | 1564 | 2842 | 1598 | 2965 | 811 | 1710 |
900 | 1698 | 3056 | 1727 | 3166 | 897 | 1884 |
1000 | 1821 | 3280 | 1839 | 3429 | 982 | 2061 |
Table 2 shows the results for the function
f2. Algorithms 2 and 3 show approximately the same results. The results for the BFGS method are approximately two times worse. Gradient noise has a detrimental effect on the accuracy of one-dimensional search with cubic interpolation that use function gradients. Reducing the accuracy of one-dimensional descent has a negative impact on the BFGS method.
Table 2.
Function minimization results.
Table 2.
Function minimization results.
n | Algorithm 3 | Algorithm 2 | BFGS |
---|
N_it | nfg | N_it | nfg | N_it | nfg |
---|
100 | 357 | 771 | 360 | 783 | 561 | 1321 |
200 | 568 | 1198 | 565 | 1185 | 1083 | 2564 |
300 | 769 | 1607 | 761 | 1607 | 1486 | 3514 |
400 | 952 | 1995 | 975 | 2041 | 1827 | 4335 |
500 | 1132 | 2323 | 1152 | 2405 | 2222 | 5279 |
600 | 1306 | 2673 | 1345 | 2753 | 2587 | 6152 |
700 | 1470 | 3048 | 1489 | 3068 | 2802 | 6652 |
800 | 1599 | 3311 | 1666 | 3419 | 3167 | 7566 |
900 | 1733 | 3581 | 1783 | 3689 | 3543 | 8442 |
1000 | 1876 | 3866 | 1930 | 3992 | 3584 | 8577 |
Table 3 shows the results of function
f3 minimization. Algorithms 2 and 3 show approximately the same results. The results for the BFGS method are approximately five times worse. In this problem, as the extremum is approached, the variables are rescaled. Possibly, this is due to differences in the degree of inclusion of the relation
. For the BFGS method according to (70) it is
, and for subgradient methods according to (71), it is
.
Table 3.
Function minimization results.
Table 3.
Function minimization results.
n | Algorithm 3 | Algorithm 2 | BFGS |
---|
N_it | nfg | N_it | nfg | N_it | nfg |
---|
100 | 407 | 900 | 415 | 911 | 2654 | 6901 |
200 | 681 | 1461 | 696 | 1482 | 4780 | 11,885 |
300 | 951 | 1950 | 965 | 1980 | 6373 | 15,385 |
400 | 1202 | 2415 | 1221 | 2480 | 7571 | 17,917 |
500 | 1441 | 2837 | 1458 | 2912 | 8297 | 19,434 |
600 | 1653 | 3257 | 1674 | 3294 | 8968 | 20,900 |
700 | 1864 | 3672 | 1898 | 3710 | 9572 | 22,214 |
800 | 2061 | 4016 | 2108 | 4090 | 9914 | 22,967 |
900 | 2258 | 4343 | 2288 | 4411 | 10,391 | 24,001 |
1000 | 2457 | 4686 | 2481 | 4761 | 10,645 | 24,500 |
Table 4 shows the results of function
f4 minimization. Algorithms 2 and 3 show approximately the same results. The absence of quadraticity of the function while maintaining the topology of the function level surfaces, equivalent to the topology of the quadratic function, affects the convergence rate of Algorithms 2 and 3 to a lesser extent than the BFGS method. The lack of quadraticity of the function here significantly affects the convergence rate of BFGS method.
Table 4.
Function minimization results.
Table 4.
Function minimization results.
n | Algorithm 3 | Algorithm 2 | BFGS |
---|
N_it | nfg | N_it | nfg | N_it | nfg |
---|
100 | 154 | 267 | 156 | 295 | 953 | 2226 |
200 | 266 | 443 | 261 | 438 | 2012 | 4682 |
300 | 377 | 619 | 362 | 604 | 3136 | 7282 |
400 | 454 | 737 | 456 | 741 | 4314 | 10,027 |
500 | 556 | 889 | 573 | 929 | 5523 | 12,815 |
600 | 669 | 1078 | 672 | 1095 | 6747 | 15,658 |
700 | 762 | 1215 | 778 | 1259 | 7990 | 18,537 |
800 | 877 | 1400 | 870 | 1413 | 9243 | 21,430 |
900 | 968 | 1545 | 971 | 1558 | 10,541 | 24,455 |
1000 | 1094 | 1752 | 1089 | 1765 | 11,746 | 27,226 |
Table 5 shows the results of minimization for function
f5. Algorithm 2 is slightly better than Algorithm 3. This function also turned out to be difficult for the BFGS method. This function, like
f4, contains polynomials of the fourth degree, which, unlike subgradient methods, significantly affects the convergence rate of the BFGS method.
Table 5.
Function minimization results.
Table 5.
Function minimization results.
n | Algorithm 3 | Algorithm 2 | BFGS |
---|
N_it | nfg | N_it | nfg | N_it | nfg |
---|
100 | 498 | 1116 | 432 | 989 | 1170 | 2847 |
200 | 558 | 1286 | 450 | 1051 | 1417 | 3396 |
300 | 609 | 1423 | 496 | 1196 | 1700 | 4118 |
400 | 705 | 1687 | 442 | 1100 | 1862 | 4465 |
500 | 686 | 1653 | 388 | 980 | 1964 | 4722 |
600 | 613 | 1499 | 429 | 1091 | 2081 | 4955 |
700 | 581 | 1434 | 433 | 1106 | 2228 | 5315 |
800 | 451 | 1176 | 394 | 1048 | 2180 | 5200 |
900 | 533 | 1361 | 430 | 1135 | 2412 | 5727 |
1000 | 554 | 1430 | 435 | 1188 | 2490 | 5957 |
In
Table 6, the results for functions
f1–f5 at
n = 1000 are presented. The results show the effectiveness of the methods on all functions under study simultaneously. Conclusions regarding the effectiveness of the methods were made earlier.
Table 6.
Functions minimization results at n = 1000.
Table 6.
Functions minimization results at n = 1000.
n | Algorithm 3 | Algorithm 2 | BFGS |
---|
N_it | nfg | N_it | nfg | N_it | nfg |
---|
f1 | 1821 | 3280 | 1839 | 3429 | 982 | 2061 |
f2 | 1876 | 3822 | 1930 | 3992 | 3584 | 8577 |
f3 | 2457 | 4686 | 2481 | 4761 | 10,645 | 24,500 |
f4 | 1094 | 1752 | 1089 | 1765 | 11,746 | 27,226 |
f5 | 554 | 1430 | 435 | 1188 | 2490 | 5957 |
Table 7 shows the results of Algorithm 3 on the first three functions for
n = 1000 with changed parameters
according to (71) and different values of
c = {0.2; 0.1; 0.05}. It is shown here that, on ill-conditioned problems, such changes increase the efficiency of the minimization method. These examples show the possibility of setting the method parameters for a certain fixed set of optimization problems.
Regarding the convergence rate of minimization methods, the following conclusions can be drawn:
For functions close in properties to quadratic (f1), the quasi-Newton BFGS method significantly exceeds subgradient Algorithms 2 and 3 in terms of convergence rate.
In the case of significant interference imposed on the gradients of the function (f2), subgradient Algorithms 2 and 3 are more effective than the BFGS method.
Variability of scales across variables (f2) affects the convergence rate of subgradient methods to a lesser extent than the BFGS method.
The presence of polynomial degrees higher than 2 in the minimized function affects the convergence rate of subgradient methods to a lesser extent than the BFGS method.
A computational experiment showed the possibility of adjusting the parameters of the method in accordance with theoretical principles. Therefore, the efficiency of the method can be increased on a certain fixed set of optimization problems.
Based on the performed computational experiment, it can be seen that the theoretically predicted ability of subgradient methods to exclude the background that slows down the convergence rate has been confirmed by the computational experiment.
Based on the theoretical principles and experimental results, we can conclude that the presented subgradient methods complement quasi-Newton methods when solving smooth optimization problems.
7. Conclusions
The conditionality of the minimization problem determines the spread of the elongation of level surfaces in different directions, which determines the complexity of solving the problem. In minimization practice, in many cases, it turns out to be possible to reduce the elongation of level surfaces due to some linear transformation of coordinates. The paper studies the possibility of Newton’s method and the subgradient method with parameter optimization by changing the space metric to eliminate the conditionality of the problem using a linear transformation.
The paper proves that under conditions of instability of the second derivatives of the function in the minimization domain, the estimate of the convergence rate of Newton’s method is determined by the strong convexity parameter and Lipschitz parameter in the coordinate system where their ratio is maximum. This means the method’s ability to exclude the linear background, which increases the conditionality degree of the problem. The estimate of convergence rate serves as a standard for assessing the capabilities of the subgradient method being studied.
The paper studies RSM with parameters optimization of the rank-two correction of metric matrices on smooth, strongly convex functions with a Lipschitz gradient without assumptions about the existence of second derivatives of the function. Under broad assumptions on the transformation parameters of metric matrices, an estimate of the convergence rate of the studied RSM and an estimate of its ability to exclude removable linear background are obtained. The obtained estimates turn out to be qualitatively similar to estimates for Newton’s method.
A practical version of RSM and test functions have been developed that simulate the presence of a removable linear background. A computational experiment was carried out in which the quasi-Newton BFGS method and the subgradient method under study were compared on various types of smooth functions. The testing results indicate the effectiveness of the subgradient method in minimizing smooth functions with a high degree of conditionality of the problem and its ability to eliminate the linear background that worsens the convergence.
Depending on the type of function, one or another method dominates, which allows us to conclude that the subgradient method is applicable along with quasi-Newton methods when solving problems of minimizing smooth functions with a high degree of conditionality.