A Family of Hybrid Stochastic Conjugate Gradient Algorithms for Local and Global Minimization Problems

: This paper contains two main parts, Part I and Part II, which discuss the local and global minimization problems, respectively. In Part I, a fresh conjugate gradient (CG) technique is suggested and then combined with a line-search technique to obtain a globally convergent algorithm. The ﬁnite difference approximations approach is used to compute the approximate values of the ﬁrst derivative of the function f . The convergence analysis of the suggested method is established. The comparisons between the performance of the new CG method and the performance of four other CG methods demonstrate that the proposed CG method is promising and competitive for ﬁnding a local optimum point. In Part II, three formulas are designed by which a group of solutions are generated. This set of random formulas is hybridized with the globally convergent CG algorithm to obtain a hybrid stochastic conjugate gradient algorithm denoted by HSSZH. The HSSZH algorithm ﬁnds the approximate value of the global solution of a global optimization problem. Five combined stochastic conjugate gradient algorithms are constructed. The performance proﬁles are used to assess and compare the rendition of the family of hybrid stochastic conjugate gradient algorithms. The comparison results between our proposed HSSZH algorithm and four other hybrid stochastic conjugate gradient techniques demonstrate that the suggested HSSZH method is competitive with, and in all cases superior to, the four algorithms in terms of the efﬁciency, reliability and effectiveness to ﬁnd the approximate solution of the global optimization problem that contains a non-convex function.


Introduction
The major goal of this paper is to find the local and global minima of a convex and non-convex function. The local and global minimization problems are defined as follows.
Definition 1. A local minimum x lo ∈ S of the function f , f : S → R is an input element with f (x lo ) ≤ f (x) for all x neighboring x lo . If S ⊆ R n , it is formulated by (1) The point x gl ∈ S is called the global minimizer of the function f ; f : S → R such that f (x gl ) ≤ f (x) ∀x ∈ S. When S ⊆ R n , then the problem can be formulated by In both problems (formulae) S ⊆ R n is the range in which we find the global minimizer of f (x). f (x) is continuously differentiable.
Global optimization (GO) attempts to find the approximate solution of the objective function are shown in Problem (2).
However, this task can be difficult since the knowledge about f is usually only local. On the other hand, the fastest algorithms (LO) prefer to find a local point since these algorithms are not capable of finding the global solution at each run.
The bottom line is that the core difference between the GO methods and the LO algorithms is as follows: the GO methods focus on solving Problem (2) over the given set, while the task of the LO methods is to solve (1). Consequently, solving Problem (1) is relatively simple by using deterministic (classical) local optimization methods. On the contrary, finding the global optimum of Problem (2) is an NP-hard problem.
Recently, many optimization algorithms have been proposed to deal with these problems. The thoughts of those suggested methods rely on the standard of meta-heuristic strategies (random search).
There are different classifications for meta-heuristic methods [12]. Mohamed et al. [7] presented a brief description of these classifications. In random algorithms, the minimization technique relies partly on probability.
In contrast, in the deterministic algorithms, a guessing scale is not utilized. Hence, deterministic techniques need an exhaustive examination over the research domain of function f to find the approximate solution to Problem (2) at each run. Otherwise, they fail in this task.
Therefore, finding the approximate solution to Problem (2) by using random techniques can be proved by the asymptotic convergence probability. See [13][14][15].
The most popular deterministic method is the CG method [18]. CG methods are exceedingly utilized to find the local minimizer of Problem (1) [21].
However, the CG algorithms have a numerical weakness, so their subsequent actions might be low if a little step is created away from the local point. Hence, for solving this issue, a line-search technique is combined with the CG technique to create a globally convergent algorithm [22,23].
The CG method is an efficient and inexpensive technique to deal with Problem (1). The CG method is an iterative algorithm. Therefore, the candidate solutions are generated by the following recursive formula.
where the step size α k > 0, and the directions d k are created by the following formula: where g k denotes the gradient vector of the function f at the point x k . Several versions of the CG methods are suggested. The core difference between those CG algorithms relies on choosing the parameter β k [18,[27][28][29]. The main features of the CG method are as follows: it has low memory requirements, it is strongly local, and it has global convergence properties [30].
Many authors presented several studies to analyze the CG method; see, for example, Refs. [31,32].
In 1964, the authors of [33] applied the CG methods to nonlinear problems, and they proposed the following parameter.
The authors of [34,35] established the global convergence of the scheme defined in (5); they used an exact line search and an inexact line search respectively.
However, the author of [36] showed that there are some cases that have some strays; these jamming occurrences happen when the search directions d k are almost orthogonal to the gradient vector g k [18].
The authors of [37,38] presented a modification of the parameter β FR k for treating the noise event denoted in [36]. Hence, they proposed the following parameter.
where y k = g k+1 − g k . When a noise occurs g k+1 ≈ g k , β PRP k ≈ 0, and d k+1 ≈ −g k+1 , i.e., when jamming happens, the search direction d k is no longer perpendicular to the gradient vector g k , but it is aligned with the vector −g k . This built-in restart advantage of the β PRP k parameter usually has better quick convergence when compared to the parameter β FR k [18]. The authors of [39] proposed an approach closely related to β PRP k , and it is defined as follows.
β HS k = y T k g k+1 d T k y k . (7) in the case that step-size α k is found by an exact line search algorithm. Hence, by (4) and the orthogonality situation g T k+1 y k = 0, the following can be obtained: Therefore, β HS k = β PRP k when the step size α k is calculated by an exact line search method. Other fundamentals formulas of the parameter β k which contain one term are listed as follows.
Formula (10) was proposed by Dai and Yuan [41]. It is noteworthy that when the f is quadratic and step size α k is selected to reduce f along d k , the options of the parameter β k mentioned above are alike for the generic nonlinear function.
For example, in the following two approaches, we present some modifications to obtain a new CG method. See Section 2.
where σ > 0.5 is a constant. Formula (12) was proposed by [49]. The denominator (d T k−1 y k ) 2 in the β HZ k is modified to max{σ y k 2 d k 2 , (d T k−1 y k ) 2 } in the β MHZ k . This procedure may help the d k stay in a trusted area automatically beneath each iteration [49]. Furthermore, in a situation σ||y k || 2 ||d k || 2 < (d T k−1 y k ) 2 , β MHZ k decreases to β HZ k with α k calculated to satisfy the inexact line search. Moreover, β HZ k decreases to β HS k under the exact line search.
Consequently, by using a line search method, the CG method can satisfy the following descent condition: where C > 0 is a constant. The sufficient descent condition (13) has a core task in the convergence analysis of the algorithms. See [17,[30][31][32]35,41,49,51,52].
However, the CG method has a numerical obstacle; its sub-sequential phases might be low if a little step is created away from the intended point [49].
Recently, the authors of [48,49] proved that the CG algorithm includes powerful convergence features if it satisfies the trust-region feature that is determined by where C v > 0 is a constant. It is shown, therefore, that the trust-region property can enable the search direction d k to be bounded in the trust radius [49]. Numerous researchers proposed many CG algorithms that give perfect results and powerful convergence properties. See [30,48,49,51]. The selection of the right step size α k can help the CG algorithms to achieve global convergence.
The exact line search is defined as follows: It is clear that in big-scale problems, the exact line search cannot be used. Therefore, there are many techniques to achieve this task. Formula (15), for example, the weak Wolfe-Powell algorithm (WWP), is a popular technique, and it is exceedingly utilized. The WWP technique is designed to find the step size α k to satisfy the following inequalities: and where δ ∈ (0, 0.5) and σ ∈ (δ, 1) are constants. Inequality (16) is named the Armijo condition, and the WWP line search decreases to strong Wolfe-Powell (SWP) by substituting Inequality (17) with the following inequality: Generally, under the WWP line search, it is assumed that the gradient g(x) is Lipschitz continuous in the convergence analysis. Therefore, the following inequality is satisfied: with L is a constant ∀ x, y ∈ R n . In fact, the CG technique with the line search methods has proven notability in solving the local optimization problem [18,27,28]. However, in trying to solve Problem (2), the CG method fails to achieve this task per run because it is trapped to a local point. To prevent sticking in a local point, random parameters are used [53].
We can summarize the essence of the above discussions as follows.
Recently, there have been many and many proposed approaches presented to improve the performance of deterministic methods, such as CG methods, gradient descent methods, Newton methods, etc. Those new approaches are designed to deal with the local optimization problems. See, for example, Refs. [16][17][18][19][20].
On the other hand, a plentiful number of stochastic approaches are suggested to deal with the global optimization problems. See, for example, Refs. [1,2,4,5,7,54].
Therefore, to gain the features of both deterministic and stochastic methods, many studies presented several ideas and suggestions to combine deterministic and stochastic techniques to obtain a new technique that is efficient and effective in solving Problem (2). Numerical outcomes demonstrated that the interbreed between classical and stochastic techniques has been hugely successful. See [55][56][57][58][59].
This work focuses on solving the local and global minimization problems. So, the first part of this study trades with Problem (1) by suggesting a new modified CG method, while the second part of this paper presents a new random approach that includes three formulae by which the candidate solutions are generated randomly.
Therefore, the new proposed stochastic approach is combined with the new modified CG method that is proposed in the first part of this paper to obtain a new hybrid stochastic conjugate gradient algorithm that solves Problem (2). The new hybrid stochastic conjugate gradient algorithm has four formulae by which the candidate solutions are created. One of the four formulae is a purely deterministic formula, the second one is a mixture of deterministic and stochastic parameters, and the other two formulas contain parameters generated randomly. The bottom line is that we can claim that the main merit that makes the new hybrid algorithm capable of finding the approximate solution to the global minimum of a non-convex function comes from the hybridization of random and non-random parameters.
Consequently, the contribution of this paper is divided into two parts. Part I presents the following contributions. Consequently, the remainder of the study is arranged as follows. Part I contains the following sections: Section 2 presents a new modified CG-SHZ technique with its convergence analysis.
In Section 3, the approximate value of the gradient vector is calculated by using the numerical differentiation. Section 4 presents the numerical investigations of the local minimization problem. Part II contains the following sections: Section 5 presents a random approach for unconstrained global optimization. Section 6 presents the hybridization of the conjugate gradient method with stochastic parameters. The numerical experiments of Problem (2) are presented in Section 7. Some concluding remarks are given in Section 8.

Part I: Local Minimization Problem
In this part, a new modified CG technique is presented, the convergence analysis of this technique is designed, the numerical differentiation approach is utilized to calculate the approximate values of the first derivative, the five algorithms are designed to solve Problem (1), and their numerical experiments are analyzed by using the performance profiles.

Suggested CG Method
Recently, the authors of [49] suggested a new MHZ-CG method, relying on the study which was proposed by the authors of [30]. The MHZ method contains the sufficient descent and the trust-region features independent of a line search technique. The parameter of the MHZ is defined by (12).
Therefore, the story in this section begins with the authors of [30] who proposed a new CG-HZ method, where the parameter of the HZ method is defined by (11). The parameter β HZ k can ensure that d k satisfies the following inequality: where (20) is proved by [30]. If the step size α k is calculated by the true line search, then β HZ k decreases to the β HS k that was proposed by [39] because d T k g k = 0 is true [49]. Hence, for obtaining the global convergence for a general function, Hager and Zhang [30] dynamically adjusted the down limitation of β HZ k by β HZ + k = max{β HZ , r k }, r k = −1 ||d k−1 || min{r,||g k−1 ||} , where r > 0 is a constant. Many researchers have suggested several modifications and refinements to improve the performance of the CG-HZ algorithm. The latest version of the CG-HZ method was offered by [49]. Yuan et al. [49] presented some modifications to the HZ-CG method, and the result was obtaining the new CG-MHZ algorithm.
The CG-MHZ algorithm contains a sufficient condition and the trust-region feature. The research direction of the MHZ-CG technique is designed as follows: where the β MHZ is defined by (12). In this paper, the MHZ method is extended and modified to obtain a new proposed method called the SHZ method such that the SHZ method has a sufficient condition and the trust-region feature. This method is defined as follows: where the ϑ = max{ρ, R k }, the ρ and R k are defined as follows. The parameter ρ is changed randomly at each iteration and its values are taken from the range [0. 8,2) and The values of f and x are calculated by where Itr is the number of iterations, and after the Itr number of iterations, f Itr and f are computed. Then, we set f 0 = f Itr , while x is defined by  [49]. This procedure gives the advantages of the MHZ, HZ and HS methods to the proposed SHZ method. In other words, the SHZ algorithm gains the characteristics of the three MHZ, HZ and HS algorithms. This is why the SHZ algorithm is superior to the four other MHZ, HZ, HS and FR methods.
Note: The authors of [49] imposed that the σ > 0.5 is a constant, while the parameter ϑ is modified dynamically at each iteration.

Convergence Analysis of Algorithm 1
In this section, we present the features of Algorithm 1. We also present the convergence analysis of this algorithm, and we show that the search direction d k that is defined by Formula (23) satisfies the sufficient descent condition and the trust-region merit, which are defined by Formulae (13) and (14), respectively.

4:
Calculate a new point x k+1 = x k + α k d k . 5: compute f k = f (x k+1 ), g k = g(x k+1 ) 6: Set k = k + 1. 7: calculate the search direction d k by (23). 8: end while 9: return x ac the local minimizer and its function value f ac Two sensible hypotheses are assumed as follows.
Hypothesis 1. We suppose that Problems (1) and (2) contain an objective function f (x) with the following characteristics: continuity and differentiability properties.

Hypothesis 2.
In some neighborhood ℵ of the level set the gradient vector g(x) is Lipschitz continuous. This means that there is a fixed real number L < ∞ such that for all x, y ∈ ℵ.
Merging (23) with (24), the result is obtaining the following: The following inequality u T v ≤ 1 2 (||u|| 2 + v 2 ) is applied to the first term of the numerator of Inequality (29), where u = d k−1 g T k y k , v = y k g k T d k−1 , and it is clear that is right. Therefore, the following inequality obtains where ϑ = max{ρ, R k }. Since ϑ ≥ 8 10 and c = 1 − 7 9ϑ > 0, (27) is true. By using (30), it is obvious that The proof is complete.
Corollary 1. According to Formula (28) of Lemma 1, the following formula is met.
. Now, the final expression is summed as k → ∞. The result is obtaining the following inequality: Under the assumptions, we give a helpful lemma that was basically proved by Zoutendijk [60] and Wolfe [61,62].

Lemma 2.
Assume that the x 0 is the initial point by which Assumption 1 is satisfied. Regarding any algorithm of Formula (23), d k is a descent direction, and α k satisfies the standard Wolfe conditions (16) and (17). Hence, the following inequality is met: Proof. It tracks Formula (17), such that On the other hand, the Lipschitz condition (19) implies The above two inequalities give which with (16) implies that where c = δ(1−σ) L . By summing (36) and with the observation that f is limited below, we see that (32) holds, which concludes the proof.
Theorem 1. Suppose that Hypotheses 1 and 2 hold, and by utilizing the outcome of Corollary 1, the sequence {g k } that is generated by Algorithm 1 satisfies the following: Proof. By contradiction, suppose that (37) is not true; then, for some > 0, the following inequality is true: Hence, with inequality (38) and (27), we obtain Then, we have and by summing the final expression, we obtain Therefore, the above leads to a contradiction with (32). So, (37) is met.
Note 1: The search direction d k that is defined by Formula (23) satisfies the sufficient descent condition which is defined by Formula (13).
Note 2: Lemma 1 guarantees that Algorithm 1 has a sufficient descent property and the trust-region feature automatically.
Note 3: Theorem 1 confirms that the series {g k } that is obtained by Algorithm 1 approaches to 0 as long as k → ∞.
In the next section, the numerical differentiation approach is discussed by which the first derivative is estimated and the step size α k is computed.

Numerical Differentiation
We now turn our attention to the numerical approximation to compute the approximate value of the gradient vector. In precept, it can be possible to find an analytic form for the first derivative for any continuous and differentiable function. However, in some cases, the analytic form is very complicated. The numerical approximation of the derivative may be sufficient for some purposes.
In this paper, the values of the α k , g k and the direction d k are computed by using the numerical differentiation method. Moreover, we have another step size and research directions that are generated randomly.
The common approaches by which the first derivative is computed are the finite difference approximation methods. Therefore, the first derivative f (x) can be estimated by the following numerical differentiation formula: where h is limited and little, but it is not necessarily infinitesimally small. Reasonably, if the value of the h is small, the approximated value of the first derivative may improve. The forward difference and the central difference are the familiar and common methods used in many studies; see for example, [68][69][70][71][72].
The Taylor series can be used to derive these formulas. Thus, 3, 4 and 5 points can be utilized to derive these formulas, but it will be more costly than utilizing 2 points. The central difference method is known to include aspects of both accuracy and precision [73] but it needs 2n function evaluations against the forward-difference approximation approach, which needs n function evaluations for each iteration. So, in this study, the forwarddifference approximation approach is used, because it is a cheap method and it has sensible precision [66,68].
The advantage of the finite difference approximation approaches relies on choosing the fit values of the h.
Error approximation of the first derivative is discussed in the next section. Therefore, the discussion of the error analysis guides us to define an appropriate finitedifference interval for the forward-difference approximation that balances the truncation error that grows from the error in the Taylor formula, and the magnitude error that is obtained from noise during computing the function values [66].

Error Analysis
Formula (41) contains the forward-difference approximation form that is used to estimate the first derivative of the function f . Its errors are proportional to some power of the values of h. Therefore, it appears that the errors go on to reduce if h is reduced. However, it is a part of the problem since it is assumed only the truncation error yielded by truncating the high-order terms in the Taylor series expansion and does not take into account the round-off error induced by quantization. The round-off error is beside the truncation error; all of them are discussed in this section as follows.
Regarding this goal, suppose that the function values with the sizes of the round-off errors 1 and 0 all being smaller than some positive number ε, that is | j | ≤ ε; with j = 0, 1.
Hence, the total error of the forward difference approximation defined by (41) is derived by Hence, with T f = f (x). Therefore, the upper bound of the error is illustrated by the right-hand side of Formula (43). The maximum limited of error contains two expressions; the first comes from the rounding error and in inverse proportion to step-size h, whilst the second comes from the truncation error and in direct proportion to h. These two parts can be Therefore, it can be concluded that as we create small values of h, the round-off error might grow, whilst the truncation error reduces. It is called the "step-size dilemma".
Consequently, there have to be some optimal values of the h * for the forward difference approximation formula, as derived analytically in (44). However, Formula (44) is only of theoretical value and cannot be used practically to determine h * because we do not have any information about the second derivative and, therefore, we cannot estimate the values of T f . Therefore, there are many approaches which have been presented to deal with the step-size dilemma.
Recently, Shi et al. [66] proposed a bisection search for finding a finite-difference interval for a finite-difference method. Their approach was presented to balance the truncation error that grows from the error in the Taylor formula and the measurement error obtained from noise in the function evaluation. According to their numerical experience, the finitedifference interval h * are bounded between the following ranges [2 × 10 −4 , 6.32 × 10 −1 ], [2.72 × 10 −4 , 8.26 × 10 0 ] and [8.44 × 10 −3 , 3.94 × 10 0 ] by using the forward and central differences to estimate the values of the first derivative of the f .
Additionally, the authors of [68] gave a study of the theoretical and practical comparison of the approximate values of the gradient vector in derivative-free optimization. These authors analyzed some approaches for approximating gradients of noisy functions utilizing only function values; those techniques include a finite difference.
The values of the finite difference interval are as follows 10 −8 ≤ h * ≤ 1.
According to the earlier investigations, the core of the difference between all approaches is to determine the step size h. Hence, the value of the step size is ranged between this range h * ∈ [1, 12 × 10 −10 ].
In this paper, the h is designed in a way that makes its values generated randomly. Additionally, the values of the h are connected to the function values per iteration to cover this domain, thus the feature here is that the value of h is modified per iteration randomly.
Therefore, a fresh approach to define the h * is presented in the following section.

Selecting a Step-Size h
The forward difference approach is a cheap method compared to the different techniques.
The forward difference approach has shown promising results for minimizing noisy black-box functions [66].
Depending on the hypotheses which are listed in Section 2, let x 0 be any starting point, thus function f satisfies the following f 0 ≥ f 1 ≥ . . . ≥ f k , for k = 0, 1, 2, . . .. The numerical outcomes that are given in the past papers denote that the values of step-size h belong to the following range [10 −10 , ≤ 1].
Therefore, the next Algorithm 2 is created to generate the values of the h * randomly from the intervals [0.1, 10 −8 ].
Algorithm 2 Algorithm for calculating the values of h * .
Step 1: At each iteration k, we generate a set random values between 10 −2 , and 10 −7 , and this set of random values is denoted by L = {l 1 , l 2 , . . . , l 10 }.
Step 2: The minimum and maximum of the set L are extracted, respectively, as follows Now we determine two cases according to the function values of the | f k | as follows. Finally, if f 0 = 10 −1 , then h 1 = 2 5.618×10 6 10 −1 = 2.67 × 10 −3 . The above example shows how Case 1 is implemented by using Formula (45).

Estimating Gradient Vector
The forward finite difference (DFF) is utilized to compute the approximate value of the gradient vector of function f at x ∈ R n by where h > 0 is the finite difference interval defined in Section 3.2, and e i ∈ R n is the i th column of the identity matrix. Therefore, g(x) ≈ DFF(x), is the approximate value of the gradient vector of function f at point x.
Therefore, the step size ϕ k is defined in the following. The function f (x) is estimated by utilizing Taylor's expansion up to the linear term around the point x k , for each iteration k. Then we have We define the quadratic model of f (x) at x k as where ϕ is the step size along the −g(x k ). The optimal value of the ϕ is picked by solving the following subproblem: min Therefore, where g(x k ) ≈ DFF(x k ).

Convergence Analysis of DFF
The condition which is usually utilized in the convergence analysis of first-order methods with inexact gradient (DFF) vectors is defined by for some 0 ≤ C < 1. This condition is introduced by [74,75] and it is called a norm condition. This condition denotes that the g(x) ≈ DFF(x) is a descent direction for the function f [68]. However, condition (49) cannot be applied, unless we know g(x) ; therefore, this condition might be hard or impossible to verify.
There are many authors who have attempted to deal with this issue; see, for example, Refs. [68,[76][77][78][79]. Byrd et al. [76] suggested a practical approach to estimate g(x k ) , and they utilized it to guarantee some approximation of (49). Cartis and Scheinberg [77] and Paquette and Scheinberg [79] replaced condition (49) by where k > 0, and convergence rate analysis were derived for a line search method that has access to deterministic function values in [77] and stochastic function values (with additional assumptions) in [79]. Berahas et al. [68] established conditions under which (49) holds. For the forward finite differences method (DFF), they set h * = 2 M ε L . Therefore, we present the following Theorem 2. Under Assumptions 1 and 2 of Section 2, let DFF(x) denote the forward finite difference approximation to the gradient g(x). Then, for all x ∈ R n , the following inequality is true: where the value of the ϕ k is estimated by (47). We know that X ∞ and X are the norm infinity and the 2-norm, respectively, and they are defined by and then According to (46) which defines the gradient approximation by forward differences, and therefore, the next inequality is true By using (48), (51), (54) and (55), we obtain DFF( , ϕ k = 0. Therefore, the theorem holds.

Numerical Experiments of Part I
All experiments were run on a PC with Intel(R) Core(TM) i5-3230M CPU@2.60GHz 2.60 GHz with RAM 4.00 GB of memory on a Windows 10 operating system. The five methods were coded by utilizing MATLAB version 8.5.0.197613 (R2015a) and the machine epsilon was about 10 −16 .
The model optimization test problems are categorized into two types. The first type is the test problems that contain a convex function, while the second type include a nonconvex function. Both kinds of test problems are listed in Tables 1-8 such that the second type of the test problem is referred to by * . Columns 1-4 of Table 1 give the data of the test problems as follows: the abbreviation of the function f is given on Column 1, the number of variables n is listed on Column 2, the exact function value f (x * ) at the global point x * is presented on Column 3, and the exact value of the norm of the gradient g(x * ) vector is given by Column 4, where the mark "−" denotes that the value of the norm of the gradient g(x * ) for the convex function satisfies the stopping criterion g(x * ) < 10 −6 . Columns 5-8 are as Columns 1-4.
The data in Table 1 are taken from [56].
The numerical results for the local minimizers of all test problems are listed in Tables 2-8 Note 1: It is worth noting that the full name for each test function is mentioned in Appendix A according to the reference in which the test problem is.
Note 2: F denotes that the algorithm has failed to find the local minimizer of the function f according to the stopping criteria of Algorithm 1 which are listed in Section 4.1 below.         The stopping criteria of Algorithm 1 are as follows.

Stopping Criteria of Algorithm 1
Since this section focuses in finding a local minimizer of all test problems, the stopping criteria of Algorithm 1 can be defined as follows.
According to the discussions of the convergence analysis which are mentioned in the previous sections, the stopping criterion of Algorithm 1 is, if g(x k ) ≤ ε 1 is satisfied, Algorithm 1 stops, where ε 1 ∈ [10 −6 , 10 −8 ]. However, the exact value of the gradient vector is unknown since the value of the gradient vector is estimated by Formula (46); therefore, this condition is replaced by DFF k ≤ ε 2 or FEs = n10 4 , i.e., if one of them is met, Algorithm 1 stops, where ε 2 ∈ [10 −7 , 10 −9 ], FEs denotes the maximum function evaluations and n is the number variables of the f .
In the following section, the performance profile is presented as an easy tool to compare the performance of our proposed method versus other methods in finding local minimizers of convex or non-convex functions regarding the worst and best numbers of iterations and function evaluations, the average of CPU time and the average of iterations and function evaluations, respectively.

Performance Profiles
The performance profile is the best tool for testing the performance of the proposed algorithms [80][81][82][83][84].
In this paper, the five algorithms' performance evaluation standards are as follows: the worst and best numbers of iterations and function emulations, and the average of the CPU time, iterations and function emulations. They are abbreviated as itr.w, itr.be, FEs.w, FEs.be, time.a, itr.a and EFs.a, respectively. In the remainder of the paper, the set Fit will be used to denote the seven criteria; Fit = {itr.w, itr.be, FEs.w, FEs.be, time.a, itr.a, EFs.a}.
Therefore, the numerical outcomes are presented in the form of performance profiles, as depicted in [82]. The most important characteristic of the performance profiles is that they can be shown in one figure by plotting for the different solvers a cumulative distribution function ρ s (τ).
The performance ratio is defined by first setting r p,s = t p,s min{t p,s :s∈S} , where p ∈ P, P is a set of test problems, S is the set of solvers, and t p,s is the value obtained by solver s on test problem p.
Then, define ρ s (τ) = 1 |P| size{p ∈ P : r p,s ≤ τ}, where |P| is the number of test problems. The value of ρ s (1) is the probability that the solver will win over the remaining ones, i.e., it will yield a value lower than the values of the remaining ones.
In the following, the performance profiles are utilized to evaluate the performance of the five methods: SHZ, MHZ, HZ, SH and FR. Therefore, in this paper, the term t p,s indicates one element of the set Fit, |P| = 46 is the number of test problems. We have 46 unconstrained test problems, 14 of which include non-convex functions. The group of solvers S = {SHZ, MHZ, HZ, SH, FR} finds the local minimizers of the 46 test problems; therefore, the values of the Fit are taken from the results of the 46 test problems as follows.
Each solver s of the set S is run 51 times for each of the 46 problems; at each run, every element of the set Fit has owned its value. So, they are analyzed in the following.
where fit p,s is an element of the Fit for the test problem p by using the solver s. Note: Formula (56) means that if the final result, obtained by a solver s ∈ S, satisfies Inequality (57), then the first branch of (56) is computed. Otherwise, we set r p,s = ∞.
where ε 2 ∈ [10 −5 , 10 −9 ]. Therefore, the performance profile of solver s is defined as follows: Therefore, the performance profile for solver s is then given by the following function: As we mentioned above, |P| = 46 and τ ∈ [1, 60]. By definition of Fit p,s , ρ s (1) denotes the fraction of test problems for which solver s performs the best. In general, ρ s (τ) can be explained as the probability for solver s ∈ S that the performance ratio r p,s is within a factor τ of the best possible ratio. Additionally, the essential characteristic of performance profiles is that they present data on the proportional performance of numerous solvers [82,83].
The numerical outcomes of the five methods are analyzed by using the performance profiles as follows. Figures 1-4 show the performance profiles of the set solvers S, for each element of the set Fit, respectively.
The performance profile depicted on the left of Figure 1 (in the term itr.w) compares the five techniques for a set of the 46 test problems.
The SHZ method has the best performance for the 46 test problems; this means that our suggested approach is capable of finding a local minimizer to the 46 test problems as fast as, or faster than, the other four approaches.
For instance, if τ = 1, the SHZ technique is capable of finding the local minimizer for 65% of problems versus the 33%, 20%, 20% and 13% of a set of test problems solved by the MHZ, HS, FR and HZ methods, respectively.
In general, the term itr.w, τ = 60 displays that all test problems are solved by SHZ against 96% of test problems solved by the MHZ, HZ and FR methods respectively, while 93% of test problems are solved by the HS method. At τ ≥ 400, all test problems are solved by the MHZ, HZ and FR methods respectively, while 98% of test problems are solved by the HS.
The right graph of Figure 1 shows that the method SHZ is capable of finding the local minimum of all test problems regarding term FEs.w.
The rest of Figures 2-4 show that the SHZ algorithm is superior to the four algorithms regarding the rest of the terms of the set Fit.
Therefore, the SHZ technique includes the characteristics of efficiency, reliability and effectiveness in solving Problem (1) compared to the other four methods.
Note: The power of the SHZ technique comes from the fact that the SHZ method gains the features of the four methods MHZ, HZ and HS, as we mentioned in Section 2.

Part II: Global Minimization Problem
It is worth mentioning that the final results of Part I for the second set of test problems contain some global minimizers at some runs for some non-convex functions. This means that the pure CG technique could not find the global minimizer of the second type of test problems for each run because it is a local method.
Therefore, to make this method capable of solving Problem (2) per run, the random technique is proposed and it is added to the CG approach to gain a new PS-CG hybrid technique that solves Problem (2). In many studies, the numerical outcomes indicated that the interbreed between a classical method and a random technique is very successful in overcoming the weakness of these methods. See [55][56][57][58][59].  Consequently, this part of the paper seeks to solve Problem (2). Therefore, each method of the five CG methods mentioned in Part I is hybridized with the stochastic technique to obtain five algorithms to try to solve Problem (2).
In the next section, a stochastic technique is presented.

Random Technique
In this section, a new random parameter "SP" is presented. This stochastic technique contains three different formulas by which three different points are generated. This set of formulas is combined with the CG method to obtain a new algorithm that solves Problem (2).

Random Parameters (SP Technique)
Step 1: The first point is computed as follows, generate V k ∼ [−1, 1] n is as a random vector, set γ k = 10 ψ k , ψ k ∈ [0.01, 1), where the interval [0.01, 1) is divided into Itr of fractions and at every iteration k, the parameter ψ k takes one value of the Itr and then computes SV i as a research direction with the step lengths, where i = 1, 2, . . . , n, n is the of number variables, Itr is the number of iterations, and SV i denotes the signs of the V and is defined by Thus, a point is calculated as follows: where x ac is the best point obtained yet, and then we compute f 1 = f (x 1 ).
Step 2: The second point is defined by where B k = ϕ k d k , ϕ k is defined by (47), η k ∈ (0, 2) is a random number, and the d k is defined by (23). Then, we compute f 2 = f (x 2 ).
Step 3: This point is defined by where Dx = (1+µ k ) |V i | −1 µ k +0.1 SV i , µ k = | f ac | 2 , f ac is the function value at the point x ac that has been accepted, and X w is a stochastic variable picked from the feasible range of the objective function. This means that for X w ∼ [a, b] n , a and b are the lower and upper bounds of the feasible range, respectively, and the random vector V with its signs SV i is defined by the first step.
Therefore, we calculate f 3 = f (x 3 ). For finding the global minimizer of a non-convex function, the above stochastic technique is used since Algorithm 1 is not capable of finding the global solution at each run. In other words, in some runs, Algorithm 1 fails to find the global solution to this function due to it sticking to a local point.
In the following example, we show how the SP algorithm is run. Example: This example shows how the three steps of the SP algorithm are implemented.
We use the first test problem of the list of the test problems that are listed in Appendix A.
In the following, we explain how the candidate solution is generated by Formula (62). Let M = 1.2 × 10 −6 . By using Formula (45), we obtain h 3 = 4.381 × 10 −5 as the step size h (a random interval) to the difference approximations method, and then we have Therefore, the values of the function at the three points x ac , x h 1 and x h 2 are listed in the following. We note that the R 2 (x 2 ) = 530.66 < R 2 (x ac ) = 2501, i.e., the function value is reduced by the point x 2 .
In the following, we explain how the candidate solution is generated by Formula (63). We note that the R 2 (x 3 ) = 31.193 < R 2 (x ac ) = 2501. Therefore, the point x 3 minimizes the function value.
According to the above example that illustrates the mechanism of Formulas (61)-(63), we deduce the following results. (3), (61) and (62) are the main formulas which are used in the new hybrid proposed algorithm that is described in Section 6. However, Formula (63) is used when ∆ f = 0 that is defined by Formula (25); in this case, Algorithm 3 reaches a critical point, thus if this point is the approximate value of the global minimizer point of the f , then Algorithm 3 stops according to the condition in Line 4 or Line 1 of Algorithm 3. Otherwise, the candidate solution is generated by Formula (63); see Section 6. Consequently, in this example, at iteration k = 3, the result which is obtained by Formula (63) cannot be taken into account due to the ∆ f = 0.

Hybridization of the CG Method with Stochastic Parameters
When a stochastic method as a global optimization algorithm is combined with a globally convergent method (deterministic method), the result is a global optimization algorithm [55,56].
Therefore, the SP technique is hybridized with each of the five conjugate gradient methods SHZ, MHZ, HZ, HS and FR to obtain five techniques.
Our proposed algorithm is called a hybrid stochastic CG method abbreviated by HSSHZ that solves Problem (2). However, Algorithm 3 represents five alternative algorithms when the SHZ method is hybridized with the PS technique, then we obtain a new algorithm abbreviated by HSSHZ. When we combine any method of MHZ, HZ, HS or FR, we obtain four other abbreviations of algorithms as follows: HSMHZ, HSHZ, HSHS and HSFR, respectively.
In general, the outputs of this paper are five algorithms that solve Problem (2), where the best one is the HSSHZ algorithm as illustrated by the numerical experiments section of Part II.
In the following, Algorithm 1 is combined with SP technique to obtain Algorithm 3. The SP method permits conducting an exhaustive wipe of the search range to guarantee that the global minimizer point is visited at least once per run.

Algorithm 3 Hybrid stochastic CG method.
Input: f : R n → R, f ∈ C 1 , f ac = f cg gained by Algorithm 1 and ε > 0. Output: x gl = x ac the global minimizer of f , f (x gl ), the value of f at x gl .
f cg is a function value f gained by Algorithm 1. 3: f ac = min{ f cg , f 1 , f 2 } and x ac the best point gives the f ac . 4: if | f ac − f * | ≤ ε then 5: Stop. 6: end if 7: if f == 0 then 8: calculate the x 3 and the f 3 = f (x 3 ) by Formula (63). 9: if f 3 < f ac then 10: the x 3 is accepted, compute the x ac → x 3 , f ac → f 3 , and go to Line 1. 11: else 12: generate another point x 3 by Formula (63). 13: end if 14: else 15: go to Line 1. 16: end if 17: end while 18: return x ac the best point and its function value f ac A Mechanism Running Algorithm 3 As we mentioned above, Algorithm 3 is a combination of two methods; the first is a CG method of the five techniques CG = {SHZ, MHZ, HZ, SH, FR} that are discussed in Part I, and the second is a random method is depicted by Section 5. The point x cg is obtained by Algorithm 1 and it will be an input to Algorithm 3. Algorithm 3 begins with Line 1 that is the stopping standard of the algorithm. Therefore, Algorithm 3 ends if one of the following standards is satisfied: The first standard is | f ac − f * | ≤ ε, and the second standard is FEs≥ n10 4 , where f ac the best value of the function f is gained, the f * is the true solution, ε = 10 −6 , FEs is the number of function evaluations, and FEs = n10 4 is a stopping standard indicated by [85,86].
In Line 3, the best value of f is selected from the three values of the function f cg , f 1 and f 2 , and indicated by f ac , the three values of the function f are calculated by Algorithms (1), (61) and (62), respectively, and x ac indicates this.
In Line 4, if | f ac − f * | ≤ ε is fulfilled, the algorithm ends. The standard that is listed in Line 7 gives the algorithm an opportunity to flee from the local points. Consequently, if f = 0, then the algorithm has reached a crucial point. Therefore, if the norm of the gradient vector is 0 or ≈0, this point is either a local point or the global point. According to the above actions, the hybrid algorithm has been granted sequential opportunities to escape out of a snare (a local point). Thus, the procedures in Lines 8-12 are eligible for helping the algorithm to flee this snare, especially since the second stopping standard guarantees that most of the research domain is scanned.
The numerical outcomes of the five methods are given in the next section.

Numerical Experiments of Part II
The numerical results for the second test problems (non-convex functions) are presented, and these results are obtained by Algorithm 3.
The performance profiles tool that is described in Part I is used here for assessing the achievement of Algorithm 3 that contains five alternatives of algorithms as we mentioned above in Section 6.
The numerical results of the second type of the test problems are listed in Tables 9-15. Note: F denotes that the algorithm has failed to find the local minimizer of the function f according to the stopping criteria of Algorithm 3 which are listed in Section 6.       The performance profiles for the five algorithms are analyzed as follows. Figures 5-8 show the performance profiles of the five set solvers S regarding the set standard Fit that is mentioned in Section 4.2.
The performance profiles which are drawn on the left of Figure 5 (in the term itr.w) compares 5 methods for the 14 test problems.
The HSSHZ technique has a good achievement (for the term itr.w) for all test problems, which indicates that the HSSHZ technique is capable of solving Problem (2) as fast as or faster than the four techniques.
In general, for the term itr.w, τ ≥ 60 exhibits that the second type of the test problems are solved by HSSHZ, while 64%, 71%, 43% and 50% of test problems are solved by the HSMHZ, HSHZ, HSHS and HSFR algorithms respectively.
Figures 5-8 demonstrate that the performance of the HSSHZ technique is better than the performance of the four techniques regarding the seven standards listed in the set Fit, respectively.
Therefore, the HSSHZ technique includes the characteristics of efficiency, reliability and effectiveness in finding the global minimizer of the non-convex function f compared to the other four methods.
It is worth observing that the power of the HSSHZ algorithm comes from the fact that the SHZ method gains the features of the four methods, MHZ, HZ, HS and FR, as mentioned in Section 2.
Note 1: In Algorithm 3, a run is considered successful if Inequality (64) is met.
where f * is the exact global solution that is listed in Columns 3 and 7 of Table 1, respectively, and the f ac is the final result obtained by Algorithm 3. Note 2: Formula (56) means if the final result f ac , obtained by Algorithm 3 satisfies Inequality (64), then the first branch of (56) is computed; otherwise, we set r p,s = ∞.

Conclusions and Future Work
A new modified CG algorithm is suggested, named SHZ. The SHZ finds the local minimizers of unconstrained optimization problems. The modernized formulae of the SHZ algorithm are more complicated than previous approaches; nevertheless, the numerical experiments of the SHZ are very strong. The convergence analysis of the SHZ algorithm is designed. We also analyzed the gradient approximation g(x) ≈ DFF constructed by finite differences (the forward differences method). This method includes a new approach for selecting the fit value of the h according to the value of the objective function and it is updated dynamically at each iteration. The numerical results demonstrate that the performance of the SHZ method is positively competitive with the other four conjugate gradient methods based on performance profiles.
Comparing the final results of the gradient vector that were obtained by the method DFF to the exact values of the gradient vector demonstrates that the fresh technique succeeded in picking the right value of h. The proposed random approach recreates a critical role to make the SHZ method capable of finding the global minimizers of unconstrained optimization test problems, especially when the objective function is non-convex.
It can be worth observing that the power of the HSSHZ algorithm comes from the fact that the SHZ method gains the characteristics of the four methods, MHZ, HZ, HS and FR.
The suggested approach can be improved and modified to deal with constrained, multi-objective optimization problems, and it will be used for image restorations. 14 Ras * : Rastrigin function [93] min x x 2 1 + x 2 2 − cos(18x 1 ) − cos(18x 2 ) .
Number of local minima: many local minima.
Number of local minima: many local minima.
Number of local minima: many local minima.