1. Background
Our main focus concerns a special form of accelerated gradient methods that contain two vector directions. Aiming to develop as efficient of an optimization method as possible, we explored an approach that combines several different search vectors. In a calculative sense, iterations with two directions have become the preferred choice above others. In [
1], the authors suggested the double-direction iterative method for solving non-differentiable problems
In (
1),
is the iterative step length, while
and
are two differently defined vector directions. These three parameters are described by the following algorithms listed below. These procedures are denoted as the curve search algorithm, the algorithm for deriving vector direction
, and the algorithm for deriving vector direction
, respectively.
Curve search algorithm
where
is the smallest integer from
such that
In (
3),
is estimated as
,
stands for the second-order Dini upper-directional derivative at
in direction
d, and the function
F presents the Moreau–Yosida regularization of the objective function
f associated to the metric
M, defined as follows:
Algorithm for deriving vector direction ,
,
is an index set at
k-th iteration.
and
,
is a vector that satisfies
.
Algorithm for deriving vector direction is the solution to the problem
The results presented in [
1] motivated the authors in [
2] to define an accelerated gradient version of the iterative rule (
1). In ref. [
2], the accelerated double-direction method, denoted as the ADD, is presented as follows:
In (
5),
is the current iterative point,
is the iterative step size,
is the gradient of the objective function,
is the accelerated parameter of the ADD method, and
is the second vector direction.
The step size parameter
of the iteration (
5) is calculated in the following way:
is the smallest integer from
such that
where
is a real number such that
.
Remark 1. An alternative method for deriving the iterative step size is using the backtracking line search procedure, originally presented in [3]. In this procedure, we find the iterative step length , generally by starting with initial value . Applying the Backtracking parameters and , this algorithm checks the following condition:while updating the reduced value of the iterative step size as next . The optimal initial step length, , is obtained after fulfilling the exit condition of the backtracking algorithm. Remark 2. There are two main approaches for calculating the step length parameter in an optimization method:
1. exact line search;
2. inexact line search.
Using procedure 1, in each iteration, the step size value is derived as the solution to the following problem:Clearly, the exact line search requests additional CPU time. Accordingly, the required number of iterations and the number of the function evaluations are certainly increasing. In relation to that, in many contemporary optimization schemes, the step length parameter is derived using the second approach, i.e., through inexact line search procedures. We list some of the commonly applied inexact line search techniques as follows: The authors in [
2] modified the algorithms for deriving vector directions of the iteration (
5) in gradient terms. These modifications are illustrated in Algorithms 1 and 2.
Algorithm 1 Calculation of the vector
:
Algorithm 2 Calculation of the vector
:
where
is the solution of the transformed minimization problem (
4)
Remark 3. The vector of the search direction is one of the important elements of each gradient minimization scheme for solving unconstrained optimization problems. In solving minimization tasks, it is assumed that the iterative search direction satisfies the following inequality:where is the gradient of the objective function at point . Relation 8 is known as the descending condition.
We list only some of the approaches for generating search directions that fulfill Condition (8). The last suggestion from the previous Remark 3 induces an idea that can be used as a basis for further studies. This research can relate to comparisons of a chosen conjugate gradient method and the hybrid accelerated method proposed in this paper.
The general form of the conjugate gradient method is given as follows:
where the iterative step length variable
is calculated via the exact line search or via one of the listed inexact line searches given in Remark 1. The main specific of a conjugate gradient scheme comes from the method of generating the vector direction
, which is defined as follows:
i.e.,
In (
10),
symbolize the scalar product of the gradient vectors.
For suggested comparative studies, in relation to the research presented in this paper, it would be valuable to pay special attention on the set of quadratic functions. A quadratic function is defined by the following expression:
where
A is a symmetric, positive definite
matrix,
, and
. Starting with the initial condition
, after some calculations, an update for the vector
is obtained as follows:
where
The conjugate gradient method (
9) with a vector direction defined by (
12), where
is calculated using Relation (
13), is known as the Fletcher–Reeves formulation of the conjugate gradient method [
14].
We list several significant variants of the conjugate gradient method that differ with respect to the expressions that define
quotients [
15,
16,
17]:
For example, as a comparative minimization model, the conjugate method proposed in [
13] can be taken. Providing the suggested comparative analysis would certainly contribute to the optimization community in general.
One of the crucial variables of the ADD scheme (
5) is the acceleration factor, calculated using the second-order Taylor expansion of the objective function:
The most significant contribution achieved in ref. [
2] likely concerns the importance of the accelerated parameter. To substantiate this fact, the authors constructed and tested the non-accelerated version of the ADD model, called NADD method. In these studies, the incomparable effectiveness of the ADD method was confirmed.
Recently, in [
18], the authors introduced a new hybrid approach for generating accelerated gradient optimization methods. This calculative technique is denoted as
s-hybridization. In developing this new approach of constructing an efficient minimization scheme, the authors were guided by research regarding the nearly contraction mappings and nearly asymptotically nonexpansive mappings and the existence of fixed points of these classes of mappings [
19,
20]. The main idea regarding the
s-
schemes arrives from the study presented in [
20], where the following three-termed s-iterative rule is presented:
In (
15),
and
are sequences of real numbers satisfying the following conditions:
The authors in [
18] simplified S-Iteration (
15) by applying Condition (
17)
which transforms limits (16)–(18)
Therewith, the
s-iteration with one corrective parameter
is expressed as follows:
Guided by Iteration (
19), in connection with the SM method from [
21], the authors in [
18] proposed the SHSM optimization method (
20):
The authors proved that this model is well defined and established comprehensive convergence analysis.
This paper is organized as follows: in
Section 2, we develop the
s-hybrid double-direction minimization method based on the obtained results from the relevant studies described in the first section. The convergence analysis is presented in
Section 3. Numerical investigations are illustrated in
Section 4.
2. S-Hybridization of the Accelerated Double-Direction Method
In this section, we generate the
s-hybrid model using the ADD iterative rule,
as a guiding operator in the three-term process (
19). Applying the previously stated facts regarding the
s-hybridization technique and the ADD method, we develop the
shADD process trough the following three-term relations:
Before we state and prove that the three-termed process (
21), rewritten in a merged form, presents an accelerated gradient descent method, we induce the following two important terms.
Proposition 1 (Second-order necessary conditions—unconstrained case [
22]).
Let be an interior point of the set Ω,
and suppose that is a relative minimum point over Ω
of the function . Then, Proposition 2 (Second-ordered sufficient conditions—unconstrained case [
22]).
Let be a function defined in a region in which the point is an interior point. Suppose in addition that Then, is a strict relative minimum point of f.
Lemma 1. The accelerated gradient iterative form of the shADD process (21) is given by the following relation: Proof. The merged iterative rule of the
process (
21) can be derived by substituting the expression of
from (
21) into the previous relation of the same three-term method, i.e., the one that defines
:
which proves (
22).
Now, we show that Method (
22) fulfills the gradient descent property. For this purpose, let us rewrite relation (
22) as follows:
where
Knowing that
implies
. Further,
can be considered as a linear combination of the gradient vector, since the vector direction
is derived by Algorithm 2. With that, the parameter
, as an acceleration parameter, is a positive constant. Therefore, direction
is the gradient descent vector.
Now, we derive the iterative value of the acceleration parameter for Method (
22). To achieve this goal, we use the second-order Taylor series of the objective function
f:
where
satisfies the following:
Instead of the function’s Hessian
, we use the diagonal scalar matrix approximation in the previous Taylor expression, i.e., acceleration matrix
:
This gives us the expression of the acceleration parameter
of the
process:
We assume the positiveness of the derived acceleration parameter
. This fact confirms that the second-order necessary and sufficient conditions have been fulfilled. In the case of
, we assign
and derive the next iterative point as
Knowing that
induces
, together with the fact that
confirms that the previous scheme is the gradient descent method. □
We end this section by exposing the algorithm of the shADD method, derived on the basis of the previously provided analysis.
Taking the initial values , , , , the algorithm of the shADD method is given by the following steps:
Set , compute , , and take ;
If , then go to Step 9; else, continue to Step 3;
Apply the backtracking algorithm to calculate the iterative step length ;
Compute the first vector direction using Algorithm 1;
Compute the second vector direction using Algorithm 2;
Compute
using the iterative rule (
22);
Determine the acceleration parameter
using (
24);
If , then take ;
Set , go to Step 2;
Return and .
3. Convergence Features of the shADD Method
We start this section with some relevant known statements that can be found in [
23,
24].
Proposition 3. If the function is twice continuously differentiable and uniformly convex on , then:
Lemma 2. Under the assumptions of Lemma 3, there exist real numbers m, M satisfying the following:such that has an unique minimizer and Depending on the degree of complexity that a particular non-linear problem may have, the examination of its convergence resorts to establishing convergence on specific sets. Therefore, we expose in this section the convergence analysis of the derived
process on the set of strictly convex quadratic functions. The general expression of the strictly convex quadratics is given by (
30).
In (
30),
A is the real positive definite symmetric matrix, and
. Further on, we use the following notations regarding the relevant eigenvalues of matrix
A:
Previous research showed that for the strictly convex quadratic an adequate relation, usually, a connection between the smallest and the largest eigenvalues must be fulfilled in order to establish the convergence of the objective optimization method [
2,
21,
25,
26]. In the next lemma, we define that connection by applying the
method.
Lemma 3. The relation between the smallest and largest eigenvalues of symmetric positive definite matrix that defines the strictly convex quadratic function (30) to which the method (22) is applied is given as follows:where β is the parameter defined in the backtracking procedure. Proof. To prove (
31), we start with the estimation of the difference of the function (
30) values in two successive points:
The expression above, which describes the difference between function’s values for two successive points, is determined based on the following facts:
Matrix
A is symmetric, so
The gradient of Function (
30) is
We now replace the derived difference in the acceleration parameter expression (
24):
The obtained expression
confirms that the acceleration parameter
can be written as the Rayleigh quotient of the real symmetric positive definite matrix evaluated at the vector
This fact results in the following conclusion:
According to findings revealed in [
21], the value of the iterative step length of the accelerated gradient method derived via the backtracking inexact algorithm satisfies the following:
where
L is the Lipschitz constant that figures in Proposition (3), so the following is valid:
Considering relation (
32), which defines the gradient of the strictly convex function, we have the following:
The previous relation confirms that the largest eigenvalue of the symmetric matrix
A fulfills the property of the Lipschitz constant
L in (
34). Additionally, according to the limitations of the backtracking parameters
and
, we derive the following estimations:
which confirms the right side of Estimation (
31). Based on (
33) and the fact that the iterative step size is less than 1, the first inequality of (
31) arises. □
On the basis of proven estimations (
31) relating to the acceleration parameter, backtracking parameter, and the lowest and the largest eigenvalues of the symmetric, positive definite matrix
A that figures in the expression (
30), using the following theorem, we establish the convergence of the
method on the set of strictly convex quadratics.
Theorem 1. For the strictly convex quadratic function (30), the process (22) is linearly convergent when . More precisely, the following relations are valid: for some real constants and , such that When the vectors represent the orthonormal set of eigenvectors of matrix A, the following inequations are fulfilled: For the gradient (32) of the function (30), the following is valid:
Proof. Taking the expression of Gradient (
32) of Function (
30) at the (
)-th iteration, we obtain the following:
Applying the orthonormal representations (
36) in the previous equation leads us to the following:
Knowing that
we will prove (
38) by showing that
For this purpose, let us first assume that
, i.e.,
Applying (
31), we obtain the following:
We rewrite the previous inequalities as follows:
We assume now the opposite case:
The last inequality gives
which directly implies
Estimations (
41) and (
42) prove (
38).
Finally, in order to prove (
39), we use the gradient representation from (
36),
which results in the following conclusion:
Applying the fact that
on inequalities (
37) directly proves (
39). □
Non-Convex Case Overview
In the previous section, we proved that the shADD method linearly converges to a set of strictly convex quadratics. Although it is not the main subject of this research, in this subsection, we analyze a possible application of the presented scheme when the objective function is non-convex. The importance of providing this introductory discussion on this topic simply arises from the endless array of contemporary non-convex problems such as matrix completion, low-rank models, tensor decomposition, and deep neural networks.
Neural networks, considered as universal function approximators, contain significant symmetric properties. These features define them as non-convex structures. Some of the known techniques for solving machine learning problems and other non-convex problems are as follows:
Stochastic gradient descent methods,
Mini batch approach,
Stochastic variance reduced gradient (SVRG) method,
Alternating minimization methods,
Branch and bound methods.
Confirming convergence properties in non-convex optimization is quite difficult. In a theoretical sense, there exist no regular approaches to achieving this goal, as is the case of convex problems. Additionally, there are potentially many local minimums and the existence of saddle points and flat regions when the objective function is non-convex.
Generally, when solving non-convex optimization tasks, theoretical guarantees are very weak, and there is no tried and tested way for ending this process successfully.
Principal component analysis (PCA) is a technique for linear dimensionality reduction, which is useful in proving global convergence of minimization methods when applied on non-convex functions. We propose connecting this approach with the shADD method in further studies. The PCA process can be characterized through the following steps:
Standardizing the range of continuous initial variables;
Computing the covariance matrix to identify correlations;
Computing the eigenvectors and eigenvalues of the covariance matrix to identify the principal components;
Creating a feature vector to decide which principal components to keep;
Recasting the data along the principal components axes.
We set the goal problem as follows: determine the dominant eigenvector and eigenvalue of a positive symmetric semidefinite matrix
A. We can write this problem as follows:
The equivalent of problem (
44) is (
45):
where
is the Frobenius norm, i.e.,
. Taking the objective function, defined in terms of the Frobenius norm,
we see that the gradient of this function is given by the following expression:
The classical gradient descent update step for Function (
46) can be written as follows:
where the adaptive step size parameter fulfills the following relation:
Applying (
49) in (
48) leads to the following:
After inductively taking the previous relation, we conclude that
The previous relation (
51) confirms that the gradient descent iteration (
48) converges linearly, since
A similar analysis can be applied to the
shADD iteration (
22) for Function (
46). According to the construction of the vector
(Algorithm 2) in (
22), we can modify this vector direction as a linear combination of the gradient vector, as explained in the proof of Lemma 1. Considering this fact allows us to rewrite iteration (
22) in a simpler form for which Property (
52) can be easily proved.
4. Numerical Test Results
In this section, we analyze the numerical performance of the
shADD method depending on the choice of parameter
(
18), which is aptly named the corrective parameter. For the selected values of this parameter, we track standard numerical metrics, including the number of iterations performed, CPU time, and the number of function evaluations.
As proposed in ref. [
27], which presents an extensive comparative analysis of several Khan-hybrid models, for a range of
values (
18), we take a specific numerical value
for all
. This further reinforces the corrective expression of the
shADD iteration (
22):
We have observed that for specific values of parameter
, we have the following values of the expression
All of this motivated us to test the
shADD method for these five specified values of
. For this purpose, we selected five test functions from [
28]. We tested these functions for the five given values of the corrective parameter
and for ten different values of the number of variables
. For each test function, we summarized the obtained outcomes for all selected numbers of variables. The results obtained from the measured performance metrics (number of iterations, CPU time, and number of evaluations) are presented in
Table 1,
Table 2, and
Table 3, respectively.
We can observe that for the first test function, the Extended Penalty function, the values of the output results regarding the number of iterations and the number of evaluations do not depend on changes in the corrective parameter . For the same function, the changes in CPU time for different values of the corrective parameter are also minimal. However, for the other test functions, there are differences in the final values of all measured metrics depending on the choice of the corrective parameter value. In terms of metrics, regarding the number of iterations, the best results are achieved for and . In the case of measuring CPU time, tests for and take the least time. In terms of the number of evaluations, the smallest values are achieved for and . A general conclusion we can draw, based on a total of 250 tests, is that it is advisable to use a corrective parameter value of or .
The tests were conducted using a standard termination criterion:
The code was written in the C++ programming language.