1. Introduction
The gradient descent methodology is not computationally efficient in all applications. Sometimes, optimization algorithms become stuck in the flat regions of manifolds. In those cases, the optimization algorithm requires a long time to escape. This is the challenge of a vanishing gradient where, for instance,
is almost zero (see
Section 4). The method of stochastic gradient descent (SGD) generally overcomes this problem.
Recent advancements in the field of factional stochastic processes exhibit the theoretical benefits of modeling complex systems [
1,
2,
3,
4]. Yet, so far no literature exists on fractional stochastic gradient descent (fSGD) or fractional stochastic networks. This paper sketches the potential of such new literature for modeling complex systems. Moreover, we exhibit that fractional stochastic processes are an advancement in machine learning (ML) and artificial intelligence (AI).
The methodology of fractional stochastic gradient descent and the role of stochastic neural networks are based on a generalized assumption of randomness. Mandelbrot and Van Ness defined a fractional Brownian motion (fBM),
, together with a Hurst parameter in 1968 [
5]. For
, we obtain a standard Browning motion. Yet, for
, we obtain new forms of randomness or stochastic processes that match real-world phenomena.
The new feature of a fractional Brownian motion (fBM) is that increments are interdependent. In the literature, this is called self-similarity. A self-similar stochastic process reveals invariance with respect to the time scale (scaling invariance). A standard Brownian motion or a Lévy process displays different properties. They have independent increments and belong to the famous class of Markov processes.
However, in science, there is ubiquitous evidence that fractional stochastic processes are of relevance. For instance, we frequently observe probability densities with sharp peaks, which is related to the phenomena of long-range interdependence. In many real-world observations and applications, we find the presence of interdependence, too. This pattern can be captured by fractional stochastic processes.
Nonetheless, some phenomena are even more complicated and require further generalization towards sub-fractional stochastic processes. The literature on sub-fBM’s demonstrates that those stochastic processes are useful in scientific applications [
6]. A sub-fractional Brownian motion provides a nexus between a Brownian motion and fractional stochastic process. Those processes were introduced by Tudor et al. [
7,
8] and Bojdecki et al. [
9]. Note that, as sub-fractional stochastic processes are not martingale processes, the basic tools of stochastic analysis are insufficient. However, researchers have developed new machinery to handle fractional stochastic processes, such as [
10] or [
11,
12,
13,
14,
15].
In this paper, our purpose is to develop and study the idea of fractional stochastic gradient descent algorithms. Our approach generalizes the existing literature on stochastic gradient descent (SGD) and stochastic neural networks. For instance, Hopfield [
16] developed neural networks consisting of several perceptrons with randomness. Similarly, a Boltzmann network is a type of stochastic neural network wherein the output of the activation function is interpreted as a probability.
A study already exists about stochastic gradient descent and its challenges in machine learning [
17,
18]. Recent developments in the theory and applications of stochastic gradient descent are discussed in the following papers: Schmidt et al. [
19], Haochen and Sra [
20], Gotmare et al. [
21], Curtis and Scheinberg [
22], de Roos et al. [
23]. The focus of our research is the motivation of fractional stochastic gradient descent (fSGD) algorithms. Thus, our research is beyond the scope of current literature and focuses on the possibility of fractional stochastic gradient descent in theory. We neglect potential computational limitations in machine learning.
The paper is organized as follows.
Section 2 provides preliminary definitions. Subsequently, we introduce the foundations of fractional stochastic processes in
Section 3.
Section 4 introduces the idea of fractional stochastic search algorithms and derives the convergence results in general. Finally, in
Section 5, we apply the method to two different cases.
Section 6 concludes the paper.
2. Preliminaries
Machine learning is mainly based on neural networks and efficient optimization algorithms. The most primitive neural model is inspired by the work of Rosenblatt [
24]. In the following section, we define the major elements from a machine learning perspective.
Definition 1. A stochastic neuron is defined by n-inputs , n-weighting factors and an n-dimensional vector of biases together with a sigmoid activation function with and a stochastic output where . Hence, we define the output for by and for by the inverse probability: . Note that, if the activation potential is greater than zero, such as , then this neural network is not necessarily activated according to Definition 1. An activation value of one only occurs with the probability of the activation function.
In machine learning, the gradient descent algorithm is omnipresent in all optimization problems. Yet, it does not provide robust solutions in each case. There are computational obstacles, such as when the algorithm becomes stuck in a local minimum or lost in a plateau from which it takes a long time to get out. A plateau is defined as a flat surface region where the gradient is very small (or almost zero).
The optimization algorithm of a neural network always has the goal of finding the optimal weighting parameters . The standard algorithm used to optimize the parameters is frequently reformulated in order to minimize the cost function. This is called the gradient descent method. This method is closely related to Newton’s algorithm in numerical computing. The following definition summarizes the algorithm from a machine learning vantage point.
Definition 2. The gradient descent algorithm is defined bywhere is the gradient of a cost function, is an optional conditioning matrix, and is the learning rate. The stochastic gradient descent (SGD) method overcomes the obstacle if the gradient is close to zero. Indeed, SGD reaches a minimum along a non-linear stochastic process. In the following sections, we first discuss the literature and then generalize the approach to fractional stochastic processes.
3. Fractional Stochastic Processes
3.1. General Definitions
Consider a stochastic process with a Hurst parameter H. Subsequently, we define the elementary tools in fractional calculus.
Definition 3. Let , . Let and . The left- and right-sided fractional integrals of f of order α are defined for , respectively, asand This is the fractional integral of the Riemann–Liouville type. In the same vein, we define factional derivatives where we distinguish between left- and right-sided derivatives.
Definition 4. The factional left- and right-sided derivatives, for and , are defined byandfor all and is the image of . Let us assume
, then we obtain
Notably, exists for all if . Given those definitions, we are ready to define a Brownian motion:
Definition 5. Let H be , and let be an arbitrary real number. We call a fractional Brownian motion (fBM) with Hurst parameter H and starting value at time 0, such as
- 1.
, and;
- 2.
[Wyle fractional integral];
- 3.
Equivalent to the Riemann–Liouville integral:
.
Next, let us consider the following corollary:
Corollary 1. Consider and . Then the Brownian motion is of .
Proof. Let , we find □
In the literature, there exists an alternative, yet useful, definition:
Definition 6. A fractional Brownian motion is a Gaussian process for defined by the following covariance functionwhere the Hurst index is denoted by . Since the covariance of a Brownian motion is given in the literature, it is easy to extend the definition to an fBM with Hurst index
H, such as
where we obtain the definition of a Brownian motion for
. Following Herzog [
15], we derive the covariance step-by-step:
Corollary 2. Consider a fractional Brownian motion. The expectation values of non-overlapping increments are and the variance is of for all
3.2. Properties
Next, we consider the properties of the fBM over time for different Hurst parameters. Suppose
or
. If we assume that the Hurst parameter is of
, we say the fractional stochastic process has a short memory. Conversely, if
, we obtain the property of long-range dependence.
Figure 1 illustrates sample processes for the three ranges of the Hurst parameters,
H.
Proposition 1. Given a fractional Brownian motion, we obtain the following properties:
- 1.
The fBM has stationary increments: ;
- 2.
The fBM is H-self-similar, such as ;
- 3.
The fBM is H-self-similar, such as ;
Proof. The proof follows Herzog [
15]. In order to prove the stationary of increments, we set
. The equality of the covariance implies
. Moreover, it has the same distribution, such as
. Subsequently, we find
where
and
with
. This demonstrates that the increments and the time evolution of the increments are the same at any given point. Consequently, we obtain stationary increments.
The second property of Proposition 1 is self-similarity. Consider the following definition,
Here, we find that and . Part (3) is already given in Corollary 2. □
3.3. Definition of Sub-Fractional Processes
In a recent paper, Herzog [
15] described a sub-fractional Brownian motion (sub-fBM) as an intermediate between a Brownian motion and a fractional Brownian motion. Without loss of generality, a sub-fBM is a self-similar Gaussian process. Note that both the fBM and sub-fBM have the properties of self-similarity and long-range dependence, yet a sub-fBM does not have stationary increments [
9].
Any Brownian motion is uniquely defined by its covariance. For the sub-fBM we denote covariance by .
Definition 7. Consider a sub-fractional Brownian motion with Hurst parameter H and a centered mean zero Gaussian process with the following covariance functionwhere and . Note, a fractional Brownian motion coincides with a Brownian motion if the Hurst parameter is
. Thus, a Brownian motion on the real line has a covariance of
. The process
has the following representation for
(see [
25]):
The kernel function of a sub-fractional Brownian motion is given by
3.4. Properties of Sub-Fractional Processes
In this subsection, we reiterate useful properties of sub-fractional Brownian motions such as those described in Herzog [
15].
Lemma 1. Consider be a sub-fBM for all t. The properties of the sub-fBM are:
- 1.
.
- 2.
.
- 3.
If , then , i.e., the increments are non-stationary.
Finally, we follow Herzog [
15] and prove the following proposition:
Proposition 2. Let be a fractional Brownian motion and be a sub-fractional Brownian motion. For , the following holds:
- 1.
;
- 2.
.
Proof. Obviously, an fBM has the following variance: . Similarly, we obtain the variance of for a sub-fBM. Subsequently, we have if .
The second part follows for
:
In the case of or , we have equality. □
4. Fractional Stochastic Search
Let
be a an
m-dimensional stochastic process driven by a fractional Brownian motion
, where
. The respective stochastic process
is as follows:
where
is the initial value and
is an
m-dimensional fractional Brownian motion. Next, consider a cost function
which needs to be optimized. Hence, we study the vector field for which the auxiliary function
is decreasing. This requires us to find the expectation value:
Thus, the function
is stochastic and dependent on time
t. In general, an optimization algorithm of a neural network minimizes the expectation value of this function. Utilizing the machinery of stochastic analysis, Dynkin’s formula, among others, and following the approach described in [
26], we obtain
where the operator
. The usage of Taylor-series approximation and the differencing of Equations (
12) yields
The method of steepest descent computes the gradient of such that the process , is as negative as possible. However, if is a stochastic process, we need to study the expectation of the gradient, particularly where the value is as negative as possible, such as .
In order to construct a stochastic process
with this property, we specify
and
in Equation (
10), respectively. Next, we specify the diffusion term in Equation (
10),
, or the product
, which is a matrix, such that the algorithm in Equation (
1) converges efficiently. Indeed, if we set the term of
as being inversely proportional to the Hessian matrix,
, then
with
. Through this process we can show the convergence of the algorithm and the existence of the solution.
Given that function
z is of class
and strictly convex, then, according to [
27], the Hessian matrix
is symmetric, real, positive definite, and non-degenerate. This guarantees that the Hessian matrix
has an inverse, which is also positive definite. Efficient computation can be achieved by utilizing the Cholesky decomposition. One can show that the diffusion term
is a lower triangular matrix satisfying
. Under those conditions, we compute
In order to minimize the gradient,
, we have to minimize the first-term, because the second term is a constant. Choosing
and assuming
obtains the following condition:
Using the square vector norm and the assumption of , it is sufficient to set and as the main parameters in the SGD algorithm, such that . In the sequel, we apply this algorithm to fractional stochastic search problems.
5. Application of Fractional Stochastic Search
In this section, we demonstrate the working of a fractional stochastic search. We exhibit the convergence of a fractional stochastic search within neural networks.
5.1. Stochastic Search: Case I
Suppose that we have a neural network with the following cost function: for with and . The stochastic gradient descent method searches the minimum of this cost function.
Mathematically, the solution is obvious for this problem. The first derivative is of for . Hence, the minimum is at , and consequently, the minimum value is . Next, we show that we can obtain the same value under a fractional stochastic search algorithm in a neural network.
In step one, we establish an adequate stochastic differential equation according to
for
. The gradient of the cost function, which is equal to the first derivative
, as well as the Hessian of the cost function and the second derivative, is defined as
. Both conditions enable us to compute the Lipschitz continuous coefficient functions. For
. For
. Hence, we obtain
. The stochastic differential equation for
has the form:
The SDE in Equation (
14) is an Ornstein–Uhlenbeck process driven by a Brownian motion
with the Hurst parameter
[
28].
The solution is divided into two parts: In part one, we solve the non-stochastic problem
. This is an ordinary differential equation and has the solution
. In part two, we define an auxiliary function
and apply the Itô-Doeblin’s lemma:
Note that, in this case, the derivation coincides with a standard Brownian motion. Next, integrating the last line yields
. Hence, we obtain
, which is
Based on Equation (
15), we find the expectation of
. Note that the expected integral of a Brownian motion is zero. For
, the expected value is
. Next, utilizing the general condition of
for
and
in
Section 4, we obtain
Finally, it remains to show that the SDE in Equation (
14) converges to the minimum value. Hence, we study the convergence sequence:
where, for
, we have substituted Equation (
15). Next, we use the property that the expected stochastic integral is zero and the variance of the Brownian motion is
. Thus, we obtain
In order to show the convergence, we compute the limit of the sequence for time to infinity. We obtain the following:
Indeed, we find that the (fractional) stochastic algorithm converges to the same minimum value of our function for .
5.2. Stochastic Search: Case II
Conversely, suppose a fractional stochastic differential equation with a Hurst index
of the form
where we define
and
. We search the minimum of the function
, where
is the solution of the SDE in Equation (
16). This equation can be rewritten in the fractional Hida space
as
where ⋄ is defined as the Wick product. Using Wick calculus, we find the solution as
where we have used the definition
. By applying the following definitions
, where
and
for
, we obtain the final solution:
The solution of Equation (
18) has an expectation of
. Hence, for
, the expected value is zero:
. It remains to show the convergence of the fractional SDE in machine learning:
In order to show convergence, we compute the limit of the sequence for time to infinity. We equally obtain .
There are notable limitations of fractional stochastic gradient descent in general. Fractional calculus is built around the Riemann–Liouville integral, which is a non-local operator, lacks in uniqueness, and relies on the initial conditions. Given that a fractional process is not a martingale, the common stochastic tools are not applicable. Whether those properties constrain fractional stochastic gradient descent remains an open research question. Computational aspects might also be a limiting factor. However, for the first time, this research studies the idea of fractional search analogous to stochastic gradient descent in machine learning.
6. Conclusions
This article discovers fractional stochastic gradient descent algorithms for the optimization of neural networks. In the standard case, the fractional stochastic approach follows the well-known stochastic gradient descent method in machine learning. We discuss two special cases. First, we exhibit that fractional stochastic algorithms find the minima. This result might enhance algorithmic optimization in machine learning. Second, we discover the generalized patterns and properties of fractional stochastic processes. These insights may create a universal optimization approach in machine learning and AI in the future. We highlight the need for further research in that direction, particularly for the computational issues.