Fractional Gradient Methods via ψ -Hilfer Derivative

: Motivated by the increase in practical applications of fractional calculus, we study the classical gradient method under the perspective of the ψ -Hilfer derivative. This allows us to cover several deﬁnitions of fractional derivatives that are found in the literature in our study. The convergence of the ψ -Hilfer continuous fractional gradient method was studied both for strongly and non-strongly convex cases. Using a series representation of the target function, we developed an algorithm for the ψ -Hilfer fractional order gradient method. The numerical method obtained by truncating higher-order terms was tested and analyzed using benchmark functions. Considering variable order differentiation and step size optimization, the ψ -Hilfer fractional gradient method showed better results in terms of speed and accuracy. Our results generalize previous works in the literature.


Introduction
The gradient descent method is a classical convex optimization method.It is widely used in many areas of computer science, such as in image processing [1,2], machine learning [3][4][5], and control systems [6].Its use on a large scale is essentially due to its intuitive structure, ease of implementation, and accuracy.In recent years, there has been an increase in interest in the application of fractional calculus techniques to develop and implement fractional gradient methods (FGM).The first work dealing with such methods is in [2,7], which addresses problems in the fields of signal processing and adaptive learning.The design of fractional least mean squares algorithms is another example of the application of FGM [8][9][10].Recently, some applications of FGM have focused on artificial intelligence subjects such as machine learning, deep learning, and neural networks (see [11][12][13] and references therein).
Replacing the first-order integer derivative with a fractional derivative in a gradient can improve its convergence, because long-term information can be included.However, there are some convergence issues in the numerical implementation of the FGM because the real extreme value of the target function is not always the same as the fractional extreme value.
In [14], the authors propose a new FGM to overcome this problem, considering an iterative update of the lower limit of integration in the fractional derivative to shorten the memory characteristic presented in the fractional derivative and truncating the higher order terms of the series expansion associated with the target function.Afterward, Wei et al. [15] designed another method involving variable fractional order to solve the convergence problem.
In the field of fractional calculation, several definitions of fractional and derivative integrals varying in their kernel size can be found.This diversity allows certain problems to be tackled with specific fractional operators.To establish a general operator, a ψ-fractional integral operator with respect to function ψ was proposed in [16,17], where the kernel depends on a function ψ with specific properties.To incorporate as many fractional derivative definitions as possible into a single formulation, the concept of a fractional derivative of a function with respect to the function ψ was introduced.In 2017, Almeida [18] proposed the ψ-Caputo fractional derivative and studied its main properties.A similar approach can be used to define the ψ-Riemann-Liouville fractional derivative.In 2018, Sousa and Oliveira [19] unified both definitions using Hilfer's concept and introduced the ψ-Hilfer fractional derivative.This approach offers the flexibility of choosing the differentiation type, as Hilfer's definition interpolates smoothly between fractional derivatives of Caputo and Riemann-Liouville types.Additionally, by choosing the function ψ, we obtain well-known fractional derivatives, such as Caputo, Riemann-Liouville, Hadamard, Katugampola, Chen, Jumarie, Prabhakar, Erdélyi-Kober, and Weyl, among others (see Section 5 in [19]).
The aim of this work is to propose a FGM with a ψ-Hilfer fractional derivative.Using this type of general derivative allows us to deal with several fractional derivatives in the literature at the same time.It also allows us to study cases where the target function is a composition of functions.In the first section, we show some auxiliary results concerning the chain rule and solutions of some fractional partial differential equations to study the convergence of the continuous ψ-fractional gradient method for strongly and non-strongly convex target functions.In the second section, we introduce and implement numerical algorithms for the ψ-Hilfer FGM in one-and two-dimensional cases, generalizing the ideas presented in [14,15].The proposed algorithms were tested using benchmark functions.The numerical results had better performance compared with the classical gradient in terms of accuracy and number of iterations.
In summary, this paper is organized as follows: in Section 2, we recall some basic concepts about the ψ-Hilfer derivative and the two-parameter Mittag-Leffler function.We present some auxiliary results in Section 3, which are then used to analyze the continuous gradient method for strongly and non-strongly convex target functions in Section 4. In the last section of the paper, we design and implement numerical algorithms for the ψ-Hilfer FGM by replacing the lower limit of the fractional integral with the last iterate and by using the variable order of differentiation with the optimization of the step size in each iteration.The convergence, accuracy, and speed of the algorithms are analyzed using different examples.

General Fractional Derivatives and Special Functions
In this section, we recall some concepts related to fractional integrals and derivatives of a function with respect to another function ψ (for more details, see [16,18,19]).Definition 1. (cf.[19], Def. 4) Let [a, b] be a finite or infinite interval on the real line R and α > 0. In addition, let ψ be an increasing and positive monotone function on (a, b).The left Riemann-Liouville fractional integral of a function f with respect to another function ψ on [a, b] is defined by Now, we introduce the definition of the so-called ψ-Hilfer fractional derivative of a function f with respect to another function.Definition 2. (cf.[19], Def.7) Let α > 0, m = α + 1, and I = [a, b] be a finite or infinite interval on the real line and f , ψ ∈ C m [a, b] of two functions such that ψ is a positive monotone increasing function and ψ (t) = 0, for all t ∈ I.The ψ-Hilfer left fractional derivative H D α,µ;ψ t,a + of order α and type µ ∈ [0, 1] is defined by We observe that when µ = 0, we recover the left fractional derivative ψ-Riemann-Liouville (see Definition 5 in [19]) and when µ = 1, we obtain the left ψ-Caputo fractional derivative (see Definition 6 in [19]).The following list shows some fractional derivatives that are encompassed in Definition 2 for specific choices of the function ψ and parameter µ: For a more complete list, please refer to Section 5 in [19].By considering partial fractional integrals and derivatives, previous definitions can be defined for higher dimensions (see Chapter 5 in [16]).Furthermore, the ψ-Hilfer fractional derivative of an n-dimensional Next, we present some technical results related to previously introduced operators.
where γ = α + µ(k − α) and f Some results of the paper are given in terms of the two-parameter Mittag-Leffler function, which is defined by the following power series (see [20]) For z = −x, with x ∈ R + , 0 < β 1 < 2, and β 2 ∈ C, the two-parameter Mittag-Leffler function has the following asymptotic expansion (see Equation (4.7.36) in [20]):

Auxiliary Results
In this section, we present some auxiliary results needed for our work.These extend some results presented in [21] to the ψ-Hilfer derivative of arbitrary type µ ∈ [0, 1].
We start by presenting a representation formula for the solution of a Cauchy problem involving the ψ-Hilfer derivative.Let us consider h : R + 0 × R n → R n and f : R → R n .We obtain the following results.
if and only if f is given by Proof.Applying I α;ψ a + to both sides of the fractional differential equation in (7), and taking (4) with m = 1, we have which is equivalent to From (1) and (7), we obtain these results.Now, we present some results concerning the fractional derivative of a composite function.Let us consider g : R n → R, f : R → R n , ∇ as the classical gradient operator, and the ψ-Hilfer fractional derivative of an n-dimensional vector function as (3).
, a ≥ 0, and the function ψ to be in the conditions of Definition 2. For t > a, let us define the function ζ t by setting where The following identity holds Proof.By the Newton-Leibniz formula, for each component of the function f , one has From (3), for i = 1, . . ., n, we have Taking (1) into account, the first term in ( 11) is For the second term in (11), taking (1) and the Leibniz rule for differentiation under the integral sign into account, we obtain From ( 12) and ( 13), expression (11) simplifies to and therefore, Hence, we can write which implies, by integrating by parts, that As α ∈ [0, 1], we have by L'Hôpital's rule that Finally, from ( 17) and ( 9), we obtain the following from ( 16) Proof.From (18), it follows that for x = 0, On the other hand, based on Theorem 2, we have the following for x = 0 Combining the two previous expressions, we obtain our results.
If we consider µ = 0, which corresponds to the ψ-Riemann-Liouville case, the previous results reduce to Proposition 3.3 and Corollary 3.4 in [21], respectively.Moreover, the correspondent results for the ψ-Caputo case (µ = 1) are presented in Proposition 3.1 of [21].Now, we present an auxiliary result involving the two-parameter Mittag-Leffler function.
, and a ≥ 0.Moreover, let ψ be in the conditions of Definition 2, such that sup{ψ(t) : t ≥ a} = +∞.Then, the following limit holds Proof.Taking into account Theorem 5.1 in [22] for the case of a homogeneous equation, the solution of the initial value problem is given by Hence, by Proposition 1, we have Taking the limit when t → +∞ on both sides and considering the asymptotic expansion (6), we conclude that the left-hand side tends to approach zero and the first term of the right-hand side also tends to approach zero.Hence, we obtain which leads to our results.

Continuous Gradient Method via the ψ-Hilfer Derivative
Assume that we aim to determine the minimum of a function f : R n → R. To achieve this, the gradient descent method is used, starting with an initial prediction x 0 of the local minimum and producing a sequence x 0 , x 1 , x 2 , . . .based on the following recurrence relation: where the step size θ > 0 is either constant or varying at each iteration k.The sequence {x k } +∞ k=0 generated by the gradient descent method is monotonic, i.e., f (x 0 ) > f (x 1 ) > f (x 2 ) > . .., and is expected to converge to a local minimum of f .Typically, the stopping criterion is in the form ∇ f (x) ≤ , where > 0. By expressing (20) as we can interpret (21) as the discretization of the initial value problem using the explicit Euler scheme with step size θ k .The system ( 22) is known as the continuous gradient method (see [21]).Assuming that f is both strongly convex and smooth, the solutions of ( 20) and ( 22) converge to the unique stationary point at an exponential rate.In general, if a convergence result is shown for a continuous method, then we can construct various finite difference schemes for the solution of the associated Cauchy problem.Let us now consider the following ψ-fractional version of ( 22) and , where the last expression is evaluated at the limit t → a + .For y * ∈ S( f ) = {z ∈ R n : ∇ f (z) = 0}, let us define the following sum of squares error function:

The Convex Case
Here, we investigate (23) under the assumption of non-strongly convexity of f .
and convex, i.e., f satisfies (18).For the ψ-fractional differential Equation (23), with step size θ constant, the solution z(•) converges to y * with the upper bound Proof.Based on Corollary 1 applied to (24) and the fact that f is of class C 1 and convex, we have By the properties of ψ (see Definition 2), the previous expression is equivalent to Using the composition rule (4), with m = 1, we have From (24) and considering C = ϕ(a + ), we obtain our results.
If we consider µ = 0 in the previous result, we recover Theorem 4.2 in [21].

The Strongly Convex Case
Here, we show that under the assumption of strong convexity of the function f , the solution of (23) admits a Mittag-Leffler convergence, which is a general type of exponential convergence to the stationary point.Recall the definition of a strongly convex function.
where ∇ stands for the gradient operator. where Proof.In Definition 3, if we consider y = z(t) and x = y * , we have for t ≥ a and m f > 0 that which is equivalent to where the last inequality holds, as y * ∈ arg(min x∈R n f (x)).From (23), Corollary 1, and (26), we have Setting we have from (27) that h(t) ≥ 0 for all t ≥ a, and moreover, from (28) and recalling (24), the previous expression is equivalent to By Theorem 5.1 in [22], we have that the solution of (29) is where the last inequality holds, as the integrand function in (30) is positive and 0 ≤ a < t.

Convergence at an Exponential Rate
Theorem 4 establishes the Mittag-Leffler convergence rate for the solution of (23) to a stationary point.Specifically, when α = 1, the exponential rate O e −θ,m f ψ(t) of the continuous gradient method ( 22) is recovered for any µ ∈ [0, 1].
and ψ satisfy the conditions of Definition 2 with sup{ψ(t) : t ≥ a} = +∞.Let f : R n → R be a function that is C 1 , convex, and Lipschitz smooth with constant L f , that is, If the solution z(•) of (23) converges to y * at the exponential rate O e −ωψ(t) , then y * = 0.
Proof.Let z(•) be a solution of (23) converging to the stationary point y * at the rate O e −ωψ(t) .Then, there exists a t 1 greater or equal to a, such that By contradiction, let us assume that y * = 0. We can then set From Formula (4.11.4b) in [20], we can find t 2 ≥ t 1 with the property that By Proposition 1, z(•) is of the following form Setting then, by (37), we obtain By assumption (32), we obtain Now, denoting Q = sup{ z(•) − y * : t ∈ [a, t 2 ]} and using (36), we obtain By the mean value theorem (applied to the function (ψ(t) which implies that, as α ∈ [0, 1] and sup{ψ(t) : t ≥ a} = +∞, Hence, taking the limit of u(t) when t → +∞, we obtain This implies that y * = 0, which is a contradiction.

ψ-Hilfer Fractional Gradient Method
The aim of this section is to construct and implement a numerical method for the ψ-Hilfer FGM in one-and two-dimensional cases.For both cases, we perform numerical simulations using benchmark functions.

Design of the Numerical Method
The gradient descent method typically takes steps proportional to the negative gradient (or approximate gradient) of a function at the current iteration, that is, x k+1 is updated by the following law where θ > 0 is the step size or learning rate, and f (x k ) is the first derivative of f evaluated at x = x k .We assume that f : R → R admits a local minimum at the point x * in D ρ (x * ) = {x ∈ R : |x − x * | < ρ}, for some ρ > 0, and f admits a Taylor series expansion centered at with domain of convergence D ⊆ R such that X * ∈ D. As we want to consider the fractional gradient in the ψ-Hilfer sense, our first (and natural attempt) is to consider the iterative method where is the ψ-Hilfer derivative of order α ∈]0, 1[ and type µ ∈ [0, 1], given by (2), and the function ψ is in the conditions of Definition 2. However, a simple example shows that (42) is not the correct approach.In fact, let us consider the quadratic function f (x) = (x − h) 2 with a minimum at x * = h.For this function, we have that where p 0 = 0 if µ ∈ [0, 1] and p 0 = 1 if µ = 1.As H D α,µ;ψ a + f (h) = 0, the iterative method (42) does not converge to the real minimum point.This example shows that the ψ-Hilfer FGM with a fixed lower limit of integration does not converge to the minimum point.This is due to the influence of long-time memory terms, which is an intrinsic feature of fractional derivatives.In order to address this problem and inspired by the ideas presented in [14,15], we replace the starting point a in the fractional derivative by the term x k−1 of the previous iteration, that is, where α ∈]0, 1[, µ ∈ [0, 1], and θ > 0. This eliminates the long-time memory effect during the iteration procedure.In this sense, and taking into account the series representation (41) and differentiation rule (5), we obtain where Thus, the representation formula (45) depends only on µ = 1 or µ = 1.With this modification in the ψ-Hilfer FGM, we obtain the following convergence results.
Theorem 6.If the algorithm (44) is convergent, with the fractional gradient given by (45), then it converges to the minimum point of f (ψ(•)).
Proof.Let x * be the minimum point of f (ψ(•)).We prove that the sequence (x k ) k∈N converges to x * by contradiction.Assume that x k converges to a different x = x * and f (ψ(x)) = 0.As the algorithm is convergent, we have that lim k→+∞ x k − x = 0.Moreover, for any small positive , there exists a sufficiently large number N ∈ N, such that for any k > N. Thus, we have, from the previous expression, The geometric series in the previous expression is convergent for sufficiently large k.Hence, we obtain where One can always find sufficiently small, such that because the function g( ) = 2 α θ + 2C 1− is positively increasing for α ∈ [0, 1], θ ∈ [0, 1], and ∈ [0, 1].Hence, from (48) and taking into account (50), we obtain On the other hand, from the assumption (46), we have which contradicts (51).This completes the proof.
Sometimes, the function f is not smooth enough to admit a series representation in the form (41), and therefore, the implementation of (44) using the series (45) is not possible.For implementation in practice, we need to truncate the series.In our first approach, we consider only the term of the series containing f (ψ(x k )), as it is the most relevant for the gradient method.Thus, the ψ-Hilfer FGM (44) simplifies to Furthermore, in order to avoid the appearance of complex numbers, we introduce the modulus in the expression (52), that is, The pseudocode associated to (53) is presented in Algorithm 1.As (53) is independent of the µ parameter, from now on we call the method ψ-FGM.Following the same arguments as in the proof of Theorem 6, we obtain the following results.Theorem 7. If the algorithm (44) is convergent with the fractional gradient given by (53), then it converges to the minimum point of f (ψ(•)).
In the following pseudocode, we describe the implementation of the algorithm (53).

Inputs:
Functions: ψ(x), f (ψ(x)) Fixed parameters: α, a, θ, Initial guess: As we have seen, it is possible to construct a ψ-FGM that converges to the minimum point of a function.To improve the convergence of the proposed method, we can consider variable order differentiation α(x) in each iteration.Some examples of α(x) are given by (see [15]): where β > 0 and we consider the loss function J(x) = ( f (ψ(x))) 2 to be minimized in each iteration.The consideration of the square in the loss function guarantees its non-negativity.All examples given satisfy where the second limit results from the fact that J(x) → 0 as x → x * .Variable order differentiation turns the ψ-FGM into a learning method, because as x gradually approaches x * , α(x) ≈ 1.The ψ-FGM with variable order is given by Theorem 6 remains valid for this variation of Algorithm 1.

Numerical Simulations
Now, we provide some examples that show the validity of the previous algorithms applied to the quadratic function f (x) = (x − 1) 2 , which is one of the simplest benchmark functions.This function is a convex function with a unique global minimum at x * = 1.For the ψ-derivative, we consider the following cases:  Analyzing the plots in Figure 1, we see that the convergence of the ψ-FGM in the non-integer case is slower, in general, than in the integer case (α = 1.0).In Figure 2, we consider x 0 = 2, α = 0.75, = 10 −9 , and different step sizes θ = 0.05, 0.1, 0.2, 0.3.
The plots in Figure 2 show that the increment of the step size makes convergence faster.The optimization of the step size in each iteration would lead to the optimal convergence of the method in a few iterations.In Figure 3, we consider different initial approximations x 0 , and the values α = 0.75, θ = 0.1, and = 10 −9 .As expected, convergence becomes faster as x 0 approaches the minimum point.Now, we show the numerical simulations of Algorithm 1 with a variable order of differentiation α(x).In Figure 4, we consider θ = 0.1, β = 0.1, x 0 = 2, = 10 −9 , and the variable order Functions (54)-(58).In Figure 5, we exhibit the behaviour of the algorithm for α(x) given by ( 58) and different values of β.From these plots, we conclude that in the one-dimensional case, the consideration of variable order differentiation can speed up the convergence, but it is, in general, slower than the classical gradient descent method with integer derivative.A further improvement of the algorithm could be made by considering a variable step size optimized in each iteration.This idea is implemented in the next section, where we consider optimization problems in R 2 .

Untrained Approach
Motivated by the ideas presented in [14,15], we extend the results presented in Section 5.1 to the two-dimensional case.We consider a function ψ in the conditions of Definition 2 and the vector-valued function Ψ : R 2 → R 2 given by Ψ(X) = (ψ(x), ψ(y)) with X = (x, y).Moreover, let f : R 2 → R be a function that admits a local minimum at the point X * in D ρ (X * ) = X ∈ R 2 : X − X * < ρ for some ρ > 0. We want to find a local minimum point of the function f (Ψ(X)) = f (ψ(x), ψ(y)), with X = (x, y), through the iterative method We assume that f admits a Taylor series centered at the point (a 1 , a 2 ) ∈ D ρ (X * ), given by with a domain of convergence D ⊆ R 2 , such that X * ∈ D.Then, the ψ-Hilfer fractional gradient in (60) is given by where denote the partial ψ-Hilfer deriva- tives of f , with respect to x and y, of order α ∈ [0, 1], type µ ∈ [0, 1], and with the lower limit of integration replaced by x k−1 and y k−1 , respectively.Taking into account (61) and ( 5), we have the following expressions for the components of (62) where The iterative method proposed in (60) takes into account the short memory characteristics of the fractional derivatives and, as in the one-dimensional case, we can see from ( 63) and (64) that the method does not depend on the type of derivative µ.Furthermore, due to the freedom of choice of the µ parameter (µ = 1 or µ ∈ [0, 1]) and the ψ function, we can deal with several fractional derivatives (see Section 5 in [19]).We obtain the following convergence results: Theorem 8.If the algorithm (60) is convergent where the fractional gradient is given by (62)-( 64), then it converges to the minimum point of f (Ψ(•)).
Proof.Let X * be the minimum point of f (Ψ(•)).We prove that the sequence (X k ) k∈N converges to X * by contradiction.Assume that X k converges to a different X = X * and f x (Ψ(X)) = 0 = f y (Ψ(X)).As the algorithm is convergent, we have that lim k→+∞ X k − X = 0.Moreover, for any small positive , there exists a sufficiently large number N ∈ N, such that for any k > N. Thus, must hold.From (62)-(64) we have Considering from the previous expression, we obtain The double series that appears in the previous expression are of a geometric type with positive radius less than 1.Hence, by the sum of a geometric series, we have which is equivalent to where One can always find sufficiently small, such that and because the functions and ∈ [0, 1].Hence, from (69) and taking into account (70) and (71), we obtain On the other hand, from assumption (71), we have which contradicts (72).This completes the proof.
Despite the convergence of (60), it is important to point out that sometimes the function f (Ψ(•)) is not smooth enough.In these cases, the algorithm involving the double series (63) and (64) cannot be implemented.Moreover, in the same way as was done for the one-dimensional case, assuming that f is at least of the class C 1 , we only consider the following terms of (63) and (64) )) 1−α , (term p = 0 and q = 1 in (64)).
Thus, the higher order terms are eliminated and we have the following update of (62): To avoid the appearance of complex numbers, we also consider The implementation of (73) is presented in Algorithm 2. In a similar way as it was done in Theorem 8, we can state the following results.Theorem 9.If the algorithm (60) is convergent where the fractional gradient is given by (73), then it converges to the minimum point of f (Ψ(•)).

The Trained Approach
In this section, we refine Algorithm 2 in two ways that train our algorithm in each iteration to find the most accurate X k .First, we consider a variable step size θ k > 0 that is updated in each iteration by minimizing the following function In the second refinement, we adjust the order of integration α with X k .More precisely, if f is a non-negative function with a unique minimum point X * , we can consider any of the functions (54)-(58).In our approach, we only consider the following variable fractional order where the loss function is J(X) = ∇ f (Ψ(X)) .From (74), we infer that when J(X) ≈ 0, one has α(X) ≈ 1, and when J(X) 0, one has α(X) ≈ 0. As a consequence of the previous refinements, we have the following iterative method: where the fractional gradient is given by The implementation of ( 76) is presented in Algorithm 3. Likewise, with a variable fractional order α(X), the following theorem follows.
Theorem 10.If the algorithm (75) is convergent where the fractional gradient is given by (76), then it converges to the minimum point of f (Ψ(•)).
The proof of this result follows the same reasoning of the proof of Theorem 8, and therefore, is omitted.
Algorithm 3: 2D ψ-FGM with variable fractional order and optimized step size

Numerical Simulations
In this section, we implement Algorithms 2 and 3 for finding the local minimum point of the function f (Ψ(•)) for particular choices of f and ψ.For the function f , we consider the following cases: • f 1 (x, y) = 4x 2 − 4xy + 2y 2 with minimum point at (0, 0), • Matyas function: f 2 (x, y) = 0.26 x 2 + y 2 − 0.48xy with minimum point at (0, 0), • Wayburn and Seader No. 1 function: The function f 1 is a classic convex quadratic function in R 2 and can be considered an academic example for implementing our algorithms.The choice of functions f 2 and f 3 is due to the fact that they are benchmark functions used to test and evaluate several characteristics of optimization algorithms, such as convergence rate, precision, robustness, and general performance.More precisely, the Matyas function has a plate shape and the Wayburn and Seader No. 1 function has a valley shape, which implies slow convergence to the minimum point of the corresponding function.For the functions ψ in the vector function Ψ, we consider the choices For the numerical simulations, we consider some combinations of the functions f i , i = 1, 2, and ψ j , j = 1, 2, 3, and compare the results in the following scenarios: • Algorithm 3 with α(X) = 1 that corresponds to the classical 2D gradient descent method, • Algorithm 2 with α(X) = 0.8 and step size θ = 0.1, • Algorithm 3 with α(X) = 1 − 2 π arctan(β J(X)), with β = 0.1.
Figure 6 shows the target functions f 1 (Ψ 1 (X)) = 4x 2 − 4xy + 2y 2 and f 1 (Ψ 2 (X)) = 4x 4 − 4x 2 y 2 + 2y 4 , both with a local minimum point at X * = (0, 0).Figures 7 and 8 show the X k iterates in the corresponding contour plots of the functions.The plots on the right show the amplification close to the minimum point.The results of numerical simulations are summarized in Table 1.The stopping criterion used was = 10 −9 .In Table 1, we present the information concerning the X k iterates of the implemented algorithms.When we consider Ψ 1 , we achieve the global minimum point in the three cases; however, it is clear that Algorithm 2 leads to the worst results in terms of speediness, whereas the classical case and Algorithm 3 have similar results.If we consider Ψ 2 and we restrict our analysis to the two fastest algorithms, we conclude that, in this case, Algorithm 3 provides a more accurate approximation in fewer iterations.We point out that the objective function f 1 (Ψ 2 (•)) is a function with less convexity near the minimum point when compared with the objective function f 1 (Ψ 1 (•)), which leads to an optimization problem that is more challenging under the numerical point of view.
Figure 9 shows both functions f 2 (Ψ 1 (•)) and f 2 (Ψ 3 (•)) with a plate-like shape.Con- sidering the same stopping criteria, we have the following results.From the analysis of Table 2, we see that the three methods converge in the case of f 2 (Ψ 1 ), but Algorithm 2 is the worst in terms of iterations.The classic case and Algorithm 3 have similar results in terms of precision; however, Algorithm 3 presents better performance in terms of the number of iterations.In the case of f 2 (Ψ 3 ), the Algorithm 2 diverges and the other two are convergent.Algorithm 3 required half of the iterations compared with the classical gradient method.with local maximum at X * = (0, 0) (or a local minimum of the function − f 3 (Ψ 2 (X))).
The plots in Figure 12 show that both functions are valley-shaped.Figures 13 and 14 and Table 3 show the obtained numerical results.
In this last case, we see that the classic gradient method and Algorithm 3 provide very good approximations.Algorithm 3 performs better in terms of the number of iterations.For instance, in the case of f 3 (Ψ 1 ), the number of iterations decreased around 97% in comparison with the classical gradient method.

Conclusions
In this work, we study the classical gradient method from the perspective of the ψ-Hilfer fractional derivative.In the first part of the article, we consider the continuous gradient method and perform the convergence analysis for strongly and non-strongly convex cases.The identification of functions of the Lyapunov type, together with the auxiliary results demonstrated, allowed us to establish the convergence of the generating trajectories in the case of ψ-Hilfer.
In the second part of the paper, we first show that the ψ-Hilfer FGM with the ψ-Hilfer gradient given as a power series can converge to a point different from the extreme point.To work out this problem, we propose an algorithm with a variable lower bound of integration, reducing the influence of long-time memory terms.By truncating the higher order terms, we obtain the ψ-FGM, which allows easy implementation in practice.Furthermore, we optimized the step size in each iteration and considered a variable order of differentiation to increase the precision and speed of the method.These two tunable parameters improved the performance of the method in terms of speed of convergence.
Our numerical simulations showed that the proposed FGM achieved the approximation with equal or better precision, but in much fewer iterations compared with the classical gradient method with optimized step size.We emphasize that in our 2D numerical simulations, the Matyas function and the Wayburn and Seader No. 1 functions are well-known benchmark functions used to test optimization methods.These functions have the shapes of plates and valleys, respectively, representing an extra challenge in numerical simulations.
In future works, it would be interesting to further develop this theory to see its application in the field of convolutional neural networks.

Theorem 4 .
Let α ∈ [0, 1] and µ ∈ [0, 1].Suppose that f is of class C 2 and is strongly convex.Considering the ψ-fractional differential Equation (23), where the step size θ is a constant, then the solution z(•) converges to y * , with the upper bound

Figure 4 .
Figure 4. Algorithm 1 for different variable orders of differentiation.

Figure 11 .
Figure 11.Iterates for f 2 (Ψ 3 (X)).In the final set of figures and tables, the function 3 is composed with Ψ 1 and Ψ 2 .Taking into account the results obtained previously for the Matyas function, where it is clear that Algorithm 2 leads to worse results in terms of rapidness and accuracy, we only implemented the classical gradient method and Algorithm 3. The following figure shows the graphical representation of f 3 (Ψ 1 (X)) = x 6 + y 4 − 17 2 + (2x + y − 4) 2 with local minimum at the point X * = (1, 2) and f 3 (Ψ 2 (X)) = x 12 + y 8 − 17 2 + 2x 2 + y 2 − 4 2

Table 1 .
Information about the k-iteration associated with Figures7 and 8.

Table 3 .
Information about the k-iteration associated with Figures13 and 14