Convergence Rates for Hestenes’ Gram–Schmidt Conjugate Direction Method without Derivatives in Numerical Optimization

: In this work, we studied convergence rates using quotient convergence factors and root convergence factors, as described by Ortega and Rheinboldt, for Hestenes’ Gram–Schmidt conjugate direction method without derivatives. We performed computations in order to make a comparison between this conjugate direction method, for minimizing a nonquadratic function f , and Newton’s method, for solving ∇ f = 0. Our primary purpose was to implement Hestenes’ CGS method with no derivatives and determine convergence rates.


Introduction
The conjugate gradient (CG) and conjugate direction (CD) methods have been extended to the optimization of nonquadratic functions by several authors. Fletcher and Reeves [1] gave a direct extension of the conjugate gradient (CG) method. An approach to conjugate direction (CD) methods using only function values was developed by Powell [2]. Davidon [3] developed a variable metric algorithm, which was later modified by Fletcher and Powell [4]. According to Davidon [3], variable metric methods are considered to be very effective techniques for optimizing a nonquadratic function.
In 1952, Hestenes and Stiefel [5] developed conjugate direction (CD) methods for minimizing a quadratic function defined on a finite dimensional space. One of their objectives was to find efficient computational methods for solving a large system of linear equations. In 1964, Fletcher and Reeves [1] extended the conjugate gradient (CG) method of Hestenes and Stiefel [5] to nonquadratic functions. The method presented here is related to those described by G.S. Smith [6], M.J.B. Powell [2] and W.I. Zangwill [7]. The method of Smith is also described by Fletcher [8] on pp. 9-10, Brent [9] on p. 124 and Hestenes [10] on p. 210. In addition to that, Nocedal [11] explored the possibility of nonlinear conjugate gradient methods converging without restarts and with the use of practical line search. In the field of numerical optimization, a number of additional authors, including Kelley [12], Zang and Li [13], among others, investigated a wide range of approaches in the use of conjugate direction methods.
The primary purpose of this work is to implement Hestenes' Gram-Schmidt conjugate direction method without derivatives, which uses function values with no line searches. We will refer to this method as the GSCD method; Hestenes refers to it as the CGS method. We illustrate the procedure numerically, computing asymptotic constants and the quotient convergent factors of Ortega and Rheinboldt [14]. In reference to Hestenes [10], p. 202, where he states that the CGS has Newton's algorithm as its limit, Russak [15] shows that n-step superlinear convergence is possible. We verify numerically that the GSCD procedure converges quadratically under appropriate conditions.
As for notation, we use capital letters, such A, B, C, . . ., to denote matrices and lower case letters, such as a, b, c, . . ., for scalars. The value A * denotes the transpose of matrix A. If F is a real-valued differentiable function of n real variables, we denote its gradient at x by F (x) and the Hessian of F at x by F (x). We use subscripts to distinguish vectors and superscripts to denote components when these distinctions are made together, for example, x k =(x 1 k , . . . , x n k ). The method of steepest descent is due to Cauchy [16]. It is one of the oldest and most obvious ways to find a minimum point of a function f .
There are two versions of steepest descent. The one due to Cauchy, which we call an iterative method, uses line searches and another, described by Eells [17] in Equation (10), p. 783, uses a differential equation of steepest descent. In Equation (4.3) we describe another version of the differential equation of steepest descent. However, numerically, both have flaws. The iterative method is generally quite slow, as shown by Rosenbrock [18] in his banana valley function.
Newton's method applied to ∇ f = 0, where f is a function to be minimized, is another approach for finding a minimum of the function f . Newton's method has rapid convergence, but it is costly because of derivative evaluations. Hestenes' CGS method without derivatives [10], p. 202, has Newton's method as its limit as σ → 0.
If the minimizing function is strongly convex quadratic and the line search is exact, then, in theory, all choices for the search direction in standard conjugate gradient algorithms are equivalent. However, for nonquadratic functions, each choice of the search direction leads to standard conjugate gradient algorithms with very different performances [19].
In this article, we investigate quotient convergence factors and root convergence factors. We computationally compare the conjugate Gram-Schmidt direction method with Newton's method. There are other types of convergence for the conjugate gradient, the conjugate direction, the gradient method, Newton's method and the steepest descent method, such as superlinear convergence [20][21][22], Wall [23] root convergence and Ostrowski convergence factors [24], but, for the sake of this research, quotient convergence is the one that is the most appropriate for the quadratic convergence.
In this article, the well-known conjugate directions algorithm, for minimizing a quadratic function, is modified to become an algorithm for minimizing a nonquadratic function, in the manner described in Section 2. The algorithm uses the gradient estimates and Hessian matrix estimates described in Section 3. In Section 4, a test example for minimizing a nonquadratic function by the developed conjugate direction algorithm without derivatives is analyzed. The advantage of this approach compared to Newton's method is efficiency. The proposed approach is justified in sufficient detail. The results obtained are of certain theoretical and practical interest for specialists in the theory and methods of optimization.

Methodology
In this section, we present a class of CD algorithms for minimizing functions defined on an n-dimensional space. The reader is directed to refer to Stein [25] and Hestenes [10], pp. 135-137 and pp. 199-202, respectively, for more details.

The Method of CD
Let A be a positive definite real symmetric n × n matrix, let k be a constant n-vector and let c be a fixed real scalar. Throughout this section, F denotes the function defined on Euclidean n-space E n by where x is in E n . Suppose 1 ≤ m ≤ n. Let S m be the linear subspace spanned by the set {p 1 , . . . , p n } of m linearly independent and, hence, nonzero vectors. Let x 1 be any vector in E n . Then, the m-dimensional plane P m through x 1 obtained by translating the subspace S m is defined by Two vectors, p and q, in E n are said to be A-orthogonal or conjugate if p * Ap = 0. A set {p 1 , . . . , p m } of nonzero vectors in E n is said to be mutually A-orthogonal or mutually conjugate if p * i Ap j = 0 f or i = j (i = 1, . . . , m).
Theorem 1 ( [25]). Let S m be a subspace of E n , where {p 1 , . . . , p n } is a basis for S m , 1 ≤ m ≤ n. Further assume that p 1 , . . . , p m is a mutually A-orthogonal set of vectors. Let x 1 be any vector in E n . Let x be in P m . Then, the following conditions are equivalent: 1.
x minimizes F on P m . 2.
F (x), the gradient of F at x, is orthogonal to the subspace S m . 3.
Then the quantity c i defined in (3) above is also given by . . , m. Then, there is a unique vector x 0 in the m-dimensional plane P m through x 1 translating S m such that x 0 minimizes the function F given by (1) on P m .
Proof. First, we are going to show that F has at least one minimizing vector in P m . Let p be any vector in P m and let M = F(p). Since A is positive definite, there is an R ∈ R > 0 such that ||x|| > R implies F(x) > M. Hence, F(x) ≤ M implies ||x|| ≤ R. Since C = {x : ||x|| ≤ R} ∩ P m is a compact set in E n on which F assumes values and is continuous, then F has at least one minimizing vector p 0 in the compact set C. Outside this compact set C, F assumes only larger values. Thus, p 0 is a minimizing vector for F in P m .
So F (x) is orthogonal to every vector in the basis of S m and, hence, is orthogonal to S m .
To show (2) implies (1), suppose that F (x) is orthogonal to S m . Let v be any vector in P m . We are going to show that F(v) > F(x) unless v = x. By Taylor's theorem we have the following: Hence, x is a unique absolute minimum for F in P m . Now we can prove that there is a unique vector x 0 in P m minimizing F on P m . Earlier we established that there is at least one minimizing vector p 0 for F in P m . Since (1) implies (2), then F (p 0 ) is orthogonal to S m . From the proof of (2) implies (1), it now follows that p 0 is a unique absolute minimum for F in P m .
To show that (2) implies (3), let x = x 1 + a 1 p 1 + . . . + a m p m since x is in P m , and assume that F (x) is orthogonal to the subspace S m . We are going to show that a i = c i d i , where To show that (3) implies (2), we can use what was established in the previous proof. An indication of this is proved below.
Now we are going to show that the quantity c i defined by Thus, and, by conjugacy of {p 1 , . . . , p m }, we have Hence, This completes the proof of the theorem.

A Class of Minimization Algorithms
Now, we shall describe a class of minimization algorithms known as the method of CDs. The significance of the formulas given in (3) of Theorem 1 is indicated below.
Suppose {p 1 , . . . , p m }, 1 ≤ m ≤ n, is a conjugate set of nonzero vectors and that P m is the m-dimensional plane through x 1 obtained by translating the subspace S m generated by {p 1 , . . . , p m }. Then, the minimum of F given by (1) on P m is attained at x 0 , which we will call x m+1 , where x m+1 = x 1 + a 1 p 1 + . . . + a m p m , refer to Theorem 1. Now we assume that p m+1 is a nonzero vector that has been constructed to be conjugate to p i , i = 1, . . . , m, and let P m+1 denote the (m + 1)-dimensional plane through x 1 obtained by translating the subspace S m+1 generated by {p 1 , . . . , p m , p m+1 }. It turns out that it is not necessary to solve a new (m + 1)-dimensional minimization problem to determine the minimizing vector x m+2 on P m+1 .
The minimizing vector x m+2 on P m+1 is obtained by a one-dimensional minimization of F about the vector x m+1 in the direction p m+1 . This follows directly from the following formulas found in Theorem 1: Note that a m+1 depends upon x m+1 and p m+1 and explicitly involves no other x or p terms. Thus, the minimizing vector x m+1 on P m results from m consecutive one-dimensional minimizations starting at x 1 and preceding along the CDs p 1 , . . . , p m successively. The ways of obtaining a mutually conjugate set {p 1 , . . . , p m } of vectors are not specified in general. Thus, the method of CDs is really a class of algorithms, where a specific algorithm depends upon the choice of {p 1 , . . . , p m }. In practice, the vector p k , k = 1, . . . , m, needed for the (k + 1) th iteration in finding x k+1 , k = 1, . . . , m, is usually constructed from information obtained at the k th iteration, k = 1, . . . , m. The following class of algorithms is referred to as the method of CDs: for k = 1, . . . , n, we find Alternatively, c k may be given by If F (x m ) = 0 for 1 ≤ m ≤ n, then the algorithm terminates and x m minimizes F on E n . Furthermore, any algorithm terminates in n steps or less since F is quadratic.
Then, using the special inner product above, we apply the Gram-Schmidt process to the linearly independent vectors u 1 , u 2 , . . . , u n to obtain a set of mutually A-orthogonal vectors p 1 , p 2 , . . . , p n , where the property of A-orthogonality is relative to the special inner product as performed by Hestenes and Stieffel [5] on p. 425.

Results
A brief description of the CG method is given below using a quadratic function: The CG method is the CD method, which is described previously, with the first CD being in the direction of the negative gradient of function F. The remaining CDs can be determined in a variety of ways, and the CG procedure described by Hestenes [10] is given below.

CG-Algorithms for Nonquadratic Approximations
One can apply the CG method to the quadratic function in z, namely F(z), to obtain a minimum of F(z). Let f be a function of n variables, then Assume that a Hessian matrix is a positive definite symmetric matrix, which implies that F(z) has a unique minimumz min . Then, Applying Newton's method to ∇F(z) = 0, we get Remark 1. We solved ∇F(z) =0 directly to obtain min F(z).
In general, Newton's method is used to solve f (z) =0 forz. It is given by where z 0 is an initial guess and J n is the Jacobian matrix, i.e., . Now, we apply Newton's method by takingf to ∇F and assuming that F and its second partial derivatives are continuous. So, one can apply Newton's method to ∇F(z) = 0, with z 1 = 0 as the initial point, to obtain the minimum pointz min of F, where z n+1 = z n − J −1 n (∇Fz n ) = z n − (F (x 1 )) −1 (∇Fz n ). Then, Then, For convenience in exposition, we include formulas below from Hestenes [10], pp. 136-137 and pp. 199-202.
Here, the first step of Newton's method is applied to ∇F( z) = 0 and z 2 also turns out to be the only min of F(z) (a quadratic equation with positive definite symmetric term), i.e., which satisfies ∇F(z 2 ) = 0. Therefore, Newton's method terminates in one iteration [10].
The initial formulas for b k and c k given in Algorithm 1 imply the basic CG relations p * k r k+1 = 0, s * k p k+1 = 0.

end for
Step 6: When k = n consider the next estimate of the minimum point x 0 of f to be the pointx 1 = x 1 + z n+1 .
The CG cycle in Step 1 can terminate prematurely at the mth step if r m+1 = 0. In this case, we replace x 1 byx 1 = x 1 + z m+1 and restart the algorithm.
If we take A = f (x 1 ), where A is positive definite symmetric, then we establish the formula for the inverse of f (x 1 ).
Since Step 2 implies that s k = f (x 1 )p k , then, in Algorithm 1, we find We obtain the difference quotient by rewriting the vector s k in Algorithm 1 (see Hestenes [10]). Therefore, without computing the second derivative we find In view of the development of Algorithms 1 and 2, each cycle of n steps is clearly comparable to one Newton step.
Thus, we replace c k = p * k r k by c k = p * k r 1 and obtain the following relation

Algorithm 2 CG algorithm without derivative
Step 1: Initially select x 1 and choose a positive constant σ. Set z 1 = 0, r 1 = − f (x 1 ), p 1 = r 1 . for k = 1, . . . , n do perform the following iteration: Step 2: Step 3: a k = c k d k , d k = p * k s k , c k = p * k r k , Step 4: z k+1 = z k + a k p k , r k+1 = r k − a k s k , Step 5:

end for
Step 6: When k = n, thenx 1 = x 1 + z n+1 is to be the next estimate of the minimum point x 0 of f . Then acceptx 1 as the final estimate of x 0 , if | f (x 1 )| is sufficiently small. Otherwise, reset x 1 =x 1 and repeat the CG cycle (Step1)-(Step5).
The new initial pointx 1 = x 1 + z n+1 generated by one cycle of the modified Algorithm 2 is, therefore, given by the Newton-type formulā The above algorithm approximates the Newton and has this algorithm as a limit as σ → 0. Therefore, Algorithm 2 will have nearly identical convergence features to Newton's algorithm if σ is replaced by σ 2 at the end of each cycle.

Conjugate Gram-Schmidt (CGS)-Algorithms for Nonquadratic Functions
With an appropriate initial point x 1 , we can derive the algorithm that is described by Hestenes [10] on p. 135, which relates Newton's method to a CGS algorithm. Since [10] lim σ→0 We can approximate the vector f (x 1 )p k by the vector with a small value of σ k . Then, we obtain the following modification of Newton's algorithm, the CGS algorithm (see Hestenes [10]): In Step 2 of Algorithm 3, substitute s k with the following formula and repeat the CGS algorithm. Then, we obtain Newton's algorithm.

end for
Step 7: When when z n+1 has been computed, the cycle is terminated. Then choosex 1 as the final estimate, if | f (x 1 )| is sufficiently small enough. Otherwise, reset x 1 =x 1 and repeat the CGS cycle (Step1)-(Step6).
In view of (11), for small σ > 0, the CGS Algorithm 3 is a good approximation of Newton's algorithm as a limit as σ → 0.
A simple modification of Algorithm 3 is obtained by replacing the following formulas in Step 2 and Step 3, as described in Hestenes [10].
A CGS algorihtm for nonquadratic functions is obtained form the following relation, where the ratios and p is a nonzero vector. Moreover, for a given vector u = 0 , the ratio has the property that The details are as follows. Suppose p 1 , p 2 , . . . , p n is an orthogonal basis that spans the same vector space as that spanned by u 1 , u 2 , . . . , u n , which are linearly independent vectors. The inner product (x, y) is defined by x * Ay, where A is a positive definite symmetric matrix. Then, the Gram-Schmidt process works as follows: . . .
important concepts in the study of iterative processes are the following: (a) when the iterations converge; and (b) how fast the convergence is. We introduce the idea of rates of convergence, as described by Ortega and Rheinboldt [14].

Rates of Convergence
A precise formulation of the asymptotic rate of convergence of a sequence x k converging to x * is motivated by the fact that estimates of the form for all k = 1, 2, . . ., often arise naturally in the study of certain iterative processes.
Definition 1. Let x k be a sequence of points in R n that converges to a point x * . Let 1 ≤ p < ∞. Ortega and Rheinboldt [14] define the quantities and refer to them as quotient convergence factors, or Q-factors for short.

Definition 2.
Let C(I, x * ) denote the set of all sequences having a limit of x * that are generated by an iterative process I.
are the Q-factors of I at x * with respect to the norm in which the Q p {x k } are computed.
Note that if Q p {x k } < +∞ for some p where 1 ≤ p < ∞, then, for any > 0, there is some positive integer K such that (13) above holds for C = Q p {x k } + . If 0 < Q p {x k } < ∞, then we say that x k converges to x * with Q-order of convergence p, and if Q p {x k } = 0, for some fixed p satisfying 1 ≤ p < ∞, then we say that x k has superconvergence of Q-order p to x * . For example, if 0 < Q p {x k } < +∞ when p = 1, then we also have 0 < C < 1 in (13), we say that {x n } converges to x * linearly. In addition, if Q p {x k } = 0 when p = 1, we say that {x n } converges to x * superlinearly. Definition 3. One other method of describing convergence rate involves the root convergence factors. See ( [14]).

Acceleration
One acceleration procedure is the following: first, apply n CD steps to an initial point x 1 to obtain a point x n+1 = y 1 ; then, take x n+1 to be a new initial point and apply n CD steps again to obtain another x n+1 = y 2 ; finally, check for acceleration by evaluating Q = F(y 2 − (Y 2 − y 1 )), if Q < F(y 2 ); then, we accelerate by taking [y 2 − (y 2 − y 1 )] as our initial point; if Q > F(y 2 ), then take y 2 as a new initial point; after two more applications of the CD method, we check for acceleration again. The procedure continues in this manner [25].

Rosenbrook's Banana Valley Function
We carry out the following computations for Rosenbrook's banana valley function (n = 2). This function possesses a steep sided valley that is nearly parabolic in shape. First, we determine values in the domain of Rosenbrock's function for which its Hessian matrix is positive definite symmetric. Since the Rosenbrock's banana valley function is non-negative, i.e., then we have Therefore, the Hessian matrix is positive definite symmetric if and only if Sylvester's criterion holds: which implies that 1200x 2 + 2 > 400y, ⇔ y < 3x 2 + 1 200 , and So, the Hessian matric is positive definite symmetric if and only if y < x 2 + 1 200 . Figure 1 shows the maximal convex level set on which the Hessian is positive definite symmetric in the interior for Rosenbrock's Banana Valley Function.
Minimizing this function is equivalent to solving the nonlinear system of equations. Therefore, for the initial point (0.98, 0.32), we obtain the minimum point at (0.992779, 0.306440) [25].

Numerical Computation
The goal of this numerical computation is to provide a system of iterative approaches for finding these extreme points [10]. A significant point is that a Newton step can be performed instead by a CD sequence of n linear minimizations in n appropriately chosen directions.
It is important to keep in mind that a function acts like a quadratic function when it is in the neighborhood of a nondegenerate minimum point. Conjugacy can be thought of as a generalization of the concept of orthogonality. Conjugate direction methods include substituting conjugate bases for orthogonal bases in the foundational structure. The formulas for determining the minimum point of a quadratic function can be reduced to their simplest forms by following the CD technique.
The conjugate direction algorithms for minimizing a quadratic function, which are discussed in the current work, were initially presented in Hestenes and Stiefel, 1952 [5]. These algorithms can be found in the present work. The authors Davidon [3], Fletcher and Powell [4] are most known for the modifications and additions that they made to these methods. However, numerous other authors also made these changes.
The iterative methods described above apply to many problems. They are used in least squares fitting, in solving linear and nonlinear systems of equations and in optimization problems with and without constraints [25]. The computing performances and numerical results of these techniques and comparisons have received a significant amount of attention in recent years. This interest has been focused on the solving of unconstrained optimization problems and large-scale applications [19,27].
The final estimate of (x 0 , y 0 ) has more than 150-digit accuracy. For σ = 0.1 × 10 −120 , ρ = 0.2 × 10 −120 , = 0.1 × 10 −60 and the initial values, we obtained the following computations for Rosenbrock's function f using the Gram-Schmidt Conjugate Direction Method without Derivatives or the CGS method, no derivatives, and Newton's Method applied to ∇ f = 0: (See [28]) For additional information regarding the programming, please refer to the supplementary material.

Differential Equations of Steepest Descent
The following equations are known as the differential equations of steepest descent: and The solution to either differential equation of steepest descent with initial condition x 1 (0) = −1.2, x 2 (0) = 1.0 is shown in Figure 2, one can refer to Equation (10), p. 783, in Eells [17]. For Equation (14), the solution will not include the minimum for finite values of t. For Equation (15), the solution will approach the minimum, but will blow up at the minimum.
From a numerical point of view, the differential equation approach has to be used with caution. Rosenbrock [15] pointed out that the iterative method of steepest descent with line searches was not effective with steep valleys. The iterative method was introduced by Cauchy [16].
In summary, the method of steepest descent is not effective and does not compare with Hestenes' CGS method with no derivatives, which is almost numerically equivalent to Newton's method applied to grad( f ) = 0, where f is the function to be minimized.
Below are level curves of Rosenbrock's banana valley function. We used this function to compare Hestenes' CGS method, Newton's method and the steepest descent methods. In Figure 2, the level curves of Rosenbrock's Banana Valley Function show that the minimizer is at (1, 1). Level curves are plotted for function values 4.0, 4.1, 4.25, 4.5 in Figure 3. For steepest descent, the iterative method and the ODE approach are illustrated. The curve y = x 2 appears to parallel the valley floor in the graph.
We use the CGS method for computation. The Rosenbrock's banana valley function gives the minimum point at (1, 1). This example provided us with geometric illustrations in Figure 2. For specific algorithms, please refer to Section 3 for the Gram-Schmidt conjugate direction method and the Newton method in order to compare the two methods along side one another.
The outcomes of the numerical experiments performed on the standard test function using the CGS method are reported above. Based on these data, it is clear that this particular implementation of the CGS method is quite effective.

Conclusions
In this paper, we introduced a class of CD algorithms that, for small values of n, provided effective minimization methods. As n grew, however, the algorithms became more and more costly to run.
The computer program above showed that the CGS algorithm without derivatives could generate Newton's method. Since the Hessian matrix of Rosenbrock's function was positive definite symmetric and satisfied Sylvester's criterion, the CGS method converged if we began anywhere in the closed convex set in the nearby area of a minimum. This was because the CGS method is based on the fact that the Hessian matrix of Rosenbrock's function is positive definite symmetric.
Using quotient convergence factors, one can see that for Rosenbrock's function one sequence converged quadratically. In particular, the numerical computation on p. 21 revealed that the asymptotic constant oscillated between 0.20000 and 0.00307, so the quotient convergence factor by Ortega and Rheinboldt [14] was, approximately, Q 2 {x k } = 0.200002, which indicated quadratic convergence. The results agreed for Newton's method.
Moreover, the CGS algorithm uses function evaluations and difference quotients for gradient and Hessian evaluations, it does not require accurate gradient evaluation nor function minimization. This approach is the most efficient algorithm that has been discussed in this study; yet, it is extremely sensitive to both the choice of σ that is used for difference quotients and the choice of ρ that is used for scaling.
The Gram-Schmidt conjugate direction method without derivatives has been used quite successfully in a variety of applications, including radar designs by Norman Olsen [27] in developing corporate feed systems for antennas and aperture distributions for antenna arrays. He tweaked the parameters sigma and rho in our GSCD computer programs to obtain successful radar designs.