The gradient descent method typically takes steps proportional to the negative gradient (or approximate gradient) of a function at the current iteration, that is,
is updated by the following law
where
is the step size or learning rate, and
is the first derivative of
f evaluated at
. We assume that
admits a local minimum at the point
in
, for some
, and
f admits a Taylor series expansion centered at
,
with domain of convergence
such that
. As we want to consider the fractional gradient in the
-Hilfer sense, our first (and natural attempt) is to consider the iterative method
where
is the
-Hilfer derivative of order
and type
, given by (
2), and the function
is in the conditions of Definition 2. However, a simple example shows that (
42) is not the correct approach. In fact, let us consider the quadratic function
with a minimum at
. For this function, we have that
where
if
and
if
. As
, the iterative method (
42) does not converge to the real minimum point. This example shows that the
-Hilfer FGM with a fixed lower limit of integration does not converge to the minimum point. This is due to the influence of long-time memory terms, which is an intrinsic feature of fractional derivatives. In order to address this problem and inspired by the ideas presented in [
14,
15], we replace the starting point
a in the fractional derivative by the term
of the previous iteration, that is,
where
,
, and
. This eliminates the long-time memory effect during the iteration procedure. In this sense, and taking into account the series representation (
41) and differentiation rule (
5), we obtain
where
if
or
if
. Thus, the representation formula (
45) depends only on
or
With this modification in the
-Hilfer FGM, we obtain the following convergence results.
Proof. Let
be the minimum point of
. We prove that the sequence
converges to
by contradiction. Assume that
converges to a different
and
. As the algorithm is convergent, we have that
. Moreover, for any small positive
, there exists a sufficiently large number
, such that
for any
. Thus,
must hold. From (
45) we have
Considering
we have, from the previous expression,
The geometric series in the previous expression is convergent for sufficiently large
k. Hence, we obtain
which is equivalent to
where
One can always find
sufficiently small, such that
because the function
is positively increasing for
,
, and
. Hence, from (
48) and taking into account (
50), we obtain
Sometimes, the function
f is not smooth enough to admit a series representation in the form (
41), and therefore, the implementation of (
44) using the series (
45) is not possible. For implementation in practice, we need to truncate the series. In our first approach, we consider only the term of the series containing
, as it is the most relevant for the gradient method. Thus, the
-Hilfer FGM (
44) simplifies to
As we have seen, it is possible to construct a
-FGM that converges to the minimum point of a function. To improve the convergence of the proposed method, we can consider variable order differentiation
in each iteration. Some examples of
are given by (see [
15]):
where
and we consider the loss function
to be minimized in each iteration. The consideration of the square in the loss function guarantees its non-negativity. All examples given satisfy
where the second limit results from the fact that
as
Variable order differentiation turns the
-FGM into a learning method, because as
x gradually approaches
,
The
-FGM with variable order is given by
Theorem 6 remains valid for this variation of Algorithm 1.