The Improved Stochastic Fractional Order Gradient Descent Algorithm

: This paper mainly proposes some improved stochastic gradient descent (SGD) algorithms with a fractional order gradient for the online optimization problem. For three scenarios, including standard learning rate, adaptive gradient learning rate, and momentum learning rate, three new SGD algorithms are designed combining a fractional order gradient and it is shown that the corresponding regret functions are convergent at a sub-linear rate. Then we discuss the impact of the fractional order on the convergence and monotonicity and prove that the better performance can be obtained by adjusting the order of the fractional gradient. Finally, several practical examples are given to verify the superiority and validity of the proposed algorithm


Introduction
In the field of machine learning, the gradient descent algorithm is one of the most fundamental methods for optimization problems [1,2]. Due to the continuous expansion of the scale of data, the traditional gradient descent algorithms cannot be used effectively to solve the optimization problems in large-scale machine learning [3][4][5]. To deal with this situation, the SGD algorithm was introduced, which is an online optimization algorithm and can be used to reduce the computational complexity by selecting one or several sample gradients randomly to replace the overall gradients in the iteration process [6]. Recently, to improve performance, the traditional SGD algorithm was modified from the different aspects, such as the adaptive gradient algorithm (Adagrad) [7], the Adadelta method [8] and adaptive moment estimation (Adam) [9], where the objective functions were assumed to be convex or strongly convex, which might be not satisfied in applications. Hence, the nonconvex learning algorithm was introduced [10].
In some practical problems, the bounds of the regret need to be minimized as far as possible. Hence, a fractional order gradient was introduced due to its good properties, such as long memory and multiple parameters [11,12]. However, the fractional order algorithm might fail to find the optimum [13]. Hence, some improved fractional order gradient algorithms were proposed by the technology of the truncation of high order terms [14][15][16]. By using a variable initial value strategy and truncation on high order terms, the fractional order gradient descent method was transformed into an adaptive learning rate method, the learning rate of which corresponds to the power of the previous terms. Its long memory property was transformed into a short memory property that can be tolerated. By combining the SGD algorithm and fractional order, some composite algorithms were introduced in machine learning [17][18][19][20][21]. However, the literature above only applied new algorithms with a fractional order gradient to applications without proving its convergence or proving its convergence on a rather specific algorithm, like an RBF neural network; to the best of our knowledge, there are few theoretical works on general fractional online SGD algorithms.
In this paper, we mainly propose three new SGD algorithms with a fractional order gradient for the online optimization problem, under which the bounds of the regret can be lowered by adjusting the fractional order. The main contributions of this paper include the following three aspects.
(1) Three improved stochastic fractional order gradient algorithms are proposed for the online optimization problem combing fractional gradients.
(2) Compared with the results in [20,21], where only simulation results were expressed, this paper proves the convergence of the proposed algorithms in mathematics. In addition, it is shown that the fractional order gradient can relieve the gradient exploding phenomenon, which often occurs in deep learning.
(3) The proposed algorithms are applied to the parameter identification problem and the classification problem to examine the effectiveness of the proposed algorithm. Notation 1. Let R n be the Euclidean space with dimension n. N is the natural number of a set, given A ∈ R n×m and x ∈ R n , A and x are the 2-norm, and |x| is the absolute value vector of each component of x. ∇ f (x) denotes the gradient of f (x).

Materials and Methods
Many researchers have given different definitions of a fractional derivate from different aspects; the Caputo type is the most common in applications due to its integer order initial values condition. Additionally, the Riemann-Liouville type also works in some situations [14]. For a smooth function f (t), the Caputo derivate is defined by where n − 1 < α < n, n ∈ N , and Formula (1) can be rewritten in Taylor series as follows: Let f (θ) be a convex objective function, and the traditional gradient descent algorithm is where g t = ∇ f (θ t ) and µ t is the learning rate. Substituting the gradient by the fractional order gradient in (3), we get the following iteration algorithm: To eliminate the effect of the global character of a fractional differential operator, we choose the following variable initial value strategy, i.e., Adopting the Tayor series of a fractional order differential, we have By Reserving the first item and omitting the other items in the summation, we have In fact, we can extend algorithm (7) to the case of 1 < α < 2 directly. In the following , we assume 0 < α < 2. To guarantee the feasibility of (θ t+1 − θ t ) 1−α , we modify (7) into the following form: Remark 1. Notice the properties of the gamma function-singularities would occur when the orders of the fractional gradient are taken as other integers α = 3, 4, · · · . And the gamma function displays a poor property when the order is closed to its singularities.
For the SGD algorithm, each batch of samples has its own objective function f t (θ). The task of this paper is to propose the SGD algorithm with a fractional order and analyze their convergence. Let where R(T) is the empirical regret function of the optimization problem and f t (θ) is the objective sub-function of one batch of samples at present. If R(T)/T → 0 as T → ∞, then we call that algorithm convergent. Next, we will analyze the asymptotic property of R(T), where θ * argmin θ ∑ T t=1 f t (θ). We need the following assumptions on parameter space and objective functions in the following analysis [22]. Assumption 1. Suppose all f i (θ) are convex. By using the property of the convex function, we

Assumption 2.
Suppose that parameters θ * ∈ θ d and θ i − θ j ≤ D, ∀θ i , θ j ∈ θ n , where θ d is a bounded and closed set and D > 0 is a positive scalar. Assumption 3. Suppose the gradient of the objective function is bounded, i.e., g t ≤ G, ∀t, where g t is the gradient of function f t in the tth batch of data.

Remark 2.
In Assumption 1, we assume all objective functions f t (θ) are convex. Hence, there must exist a global optimal parameter for the objective function. Assumption 2 implies that the Euclidean distance of any two temporary parameters would not be too long. Most of the results on online optimization require this assumption, because the optimal points of different objective functions can be far apart. Assumption 3 guarantees that the gradient of the objective function is bounded, which is very important for minimizing the empirical regret function [23]. When parameters between iterations are bounded, it can be proved that the gradient is also bounded since the objective function cannot vary too severely during a bounded interval. An example is Gaussian kernel function f (x) = exp(−x 2 ). In addition, in machine learning, Assumption 3 can also be satisfied by clipping the gradient [24].

Main Results
In this section, we will analyze the convergence of the proposed three online algorithms with fractional order α.

Standard SGD with Fractional Order Gradient
To avoid the phenomena that the before and after iteration values are the same, we add a small positive parameter δ to the format of the algorithm to improve the performance. Proof. By Formula (10), we have Then, By applying the property of convex objective function The first summation term of (12) can be enlarged by rearranging the order of the term as follows: By adopting Assumption 3, we can get the bounds of the regret function from different situations.
When fractional order 0 < α ≤ 1: When fractional order 1 < α < 2: And the bounds of the regret function on 0 < α < 2 can be summarized as Let T → ∞; the convergence of Algorithm 1 is decided by 1 When we take the polynomial decay rate, such as Algorithm 1 SGD with fractional order.

Remark 3.
Compared with the integer order algorithm, fractional term would accelerate the convergence of algorithms. When 0 < α ≤ 1, a larger value of fractional term leads to an increase in learning rate when θ t+1 and θ t differ significantly and the difference is close to D at the beginning of iterations; the situation is the opposite when 1 < α < 2.
In particular, when D < 0.5,

Remark 4. When learning rate
). And if we take µ t as a constant, R(T)/T = O(1), the algorithm still works for other some situations in [20,25].

Remark 5.
The bound of the regret function mainly depends on the summation part of Formula (16). When we take parameters δ = 10 −3 and D in the algorithm of different orders, we find that the monotonicity of the regret function would vary as a fractional order. It is shown that the larger fractional order brings a smaller loss of function when 0 < α ≤ 1, while the result is the opposite when 1 < α < 2 until the order is close to 2. We usually take the normalization of the training data to reduce the effect of data magnitude. Hence, the value of D used in Figure 1 is meaningful where the monotonicity of the coefficient is shown in Figure 1.

Theorem 2.
Under Assumptions 1-3, Algorithm 2 is convergent, where · is the Hadamard product between vectors and µ is a constant.
Proof. We split the objective function according to its dimension: We get the same form of R(T) with Theorem 1 when 0 < α ≤ 1: hence, ].
The first term of Formula (19) can be enlarged to For the last term, we have Finally, combining Formula (21) and the estimation of the first term of R(T), we have Similarly, we get the bound of R(T) when 1 < α < 2: Similar to the proof in Theorem 1, let T → ∞, ) → 0, which means Adagrad with fractional order is convergent.

Remark 6.
The learning rate of Algorithm 2 is related to its history gradient information. When the number of iterations is increasing, the value of the summation of the former gradient is hard to compute; meanwhile, the step size decreases so that new samples become less important. Once we take fractional order α into the algorithm, history gradient accumulation can be relieved by selecting a bigger order but the extra cost of the computation of the fractional gradient would increase.

Remark 7.
Different from Algorithm 1, the fractional gradient part of adaptive step size µ t does not need a parameter δ to avoid singularity due to the accumulation of the square of the history gradient. Actually, parameter δ might damage the convergence of the algorithm, especially when 1 < α < 2 as shown in Figure 2.
Proof. For simplicity, denote h t,i = . The proposed algorithm can be expressed in the form of a component.
Similar to [9], we set β i ≤ √ c < 1 as a decreasing sequence which means momentum m i would be close to h i . Let . So the gradient term can be separated as below: The first term of Formula (25) can be changed into The third inequality in (26) holds for β t < · · · < β 1 and the second term of Formula (25) can be changed into: As for the momentum term, we have Therefore, the second term of Formula (25) can be enlarged as: As for the third term, we have Finally, we get the bound of the regret function of SGD with momentum and a fractional gradient order.
where β t ≤ √ c < 1. The result is similar when 1 < α < 2. It is shown that the bound has a connection with 1 and fractional order α. If we take β t as ). But in practical engineering, SGD with momentum has the fastest convergence speed for its history gradient information, which would be shown in the simulation. For the fractional gradient condition, we can take the order by comparing the effects of β t and µ t .

Simulations
In this section, we will solve two practical problems using the proposed stochastic fractional order gradient descent method to demonstrate the convergence and the relationship between convergence speed and fractional order. The neural network training and all other experiments are conducted on a computer with a GeForce RTX 4060 Laptop GPU and Intel i7 CPU @ 2.6 GHz.

Example 1
In this example, we will solve a system identification problem to examine the effectiveness of the proposed algorithm. The target model is an auto-regressive (AR) model with its coefficients to be recognized.
where y(k − i) is the output of the system at time of k − i and ξ(k) is the stochastic noise sequence. a i are parameters to be estimated. Our goal is to determine the coefficients of the model. The regret function of the system is whereθ(k) = (â 1 (k), · · · ,â p (k)) T and φ(k) = (y(k − 1), · · · , y(k − p)) T . For each sample {y(k),θ(k), φ(k)}, the objective functions can be seen as the online optimization problem. Therefore, applying Algorithm 1 to the system, the algorithm degrades into the LMS algorithm. The iteration formula of the algorithm can be written as: In particular, when α = 1, Formula (35) was turned intô which became an ordinary LMS algorithm. We take µ k = 0.1 √ k as the learning rate and analyze the convergence of the parameters under noise. We consider an AR model with order p = 2: when the noise sequence is Gaussian white noise with mean value 0 and variance 0.5. The numerical simulation results are shown in Figures 3-5, where the abscissa is the number of iterations. It is shown that the parameters of the AR model could be identified by the algorithms proposed in this article and the figures reflect the effectiveness of the algorithm and that the larger fractional order would bring a faster convergence speed. Meanwhile, the problem of gradient accumulation in the Adagrad method can be relieved by a bigger fractional order as Figure 6 shows.
As Figures 4 and 5 shows, compared with existing algorithms with an integer order gradient-like standard SGD, Adagrad, where µ = 0.1, and the momentum method, where β t = 0.1 √ k -the fractional counterpart performs better on convergence but the accuracy is not outstanding for the disturbance of noise. In particular, SGD with momentum and a fractional order has a better result than the other algorithms in terms of convergence speed. That is to say, the fractional gradient SGD method can be applied to other mature algorithms and is likely to perform better than them.

Example 2
A deep BP neural network with a fractional gradient order is built to test the validity of the proposed algorithm. The data we adopted are from the well-known MNIST dataset, which contains 60,000 handwritten numeral images for training [20]. Each sample of the dataset can be seen as a matrix of 28 * 28 and the label of each sample is a vector of 10 * 1, where the element being 1 is the classification result. The main application of the fractional order gradient is the process of error back propagation, where each layer of the network updates its value of weights from the error between samples and the network weights trained before, and this can be seen as a typical online optimization problem. The batch size set here is 200 and the maximum of number of iterations is 300, which can be checked in Figure 7. The designed network consists of 5 layers which have 64 nodes in a hidden layer and 10 nodes in an output layer. Under the fractional order gradient information, the update of the weight matrix between layers can be described as where The results are shown in Figures 7-9, which reveal the effectiveness of SGD with a fractional order gradient. The degrees of accuracy on the test set are 90.55%, 93.3%, and 94.4%, respectively. The neural network under order 1.2 has a faster convergence speed and higher accuracy on the test set, which is consistent with Theorem 1. Figure 8 shows that order 1.1 has the highest accuracy and other orders larger than 1 would achieve better results than the integer order.

Conclusions
In this article, we proposed some improved stochastic fractional order gradient descent algorithms for the online optimization problem; the convergence is given, where the objective functions were assumed to be convex and the gradients were assumed to be bounded. The bounds of the empirical regret functions of the improved SGD with a fractional order were built based on several assumptions, and the proposed algorithm can relieve the gradient exploding problem. Finally, it was shown how the fractional order affects the convergence of the algorithm by way of two practical applications including system identification and classification.
Future work will include studying the variable fractional order stochastic gradient descent algorithm and developing the decentralized version of FOSGD to accommodate larger-scale datasets. While the setting in this article considers an invariable fractional order in a centralized algorithm, which cannot give scope to the advantage of the fractional algorithm, developing the decentralized version with a variable fractional order is in our future agenda.