A Deep Learning Optimizer Based on Grünwald–Letnikov Fractional Order Deﬁnition

: In this paper, a deep learning optimization algorithm is proposed, which is based on the Grünwald–Letnikov (G-L) fractional order deﬁnition. An optimizer fractional calculus gradient descent based on the G-L fractional order deﬁnition (FCGD_G-L) is designed. Using the short-memory effect of the G-L fractional order deﬁnition, the derivation only needs 10 time steps. At the same time, via the transforming formula of the G-L fractional order deﬁnition, the Gamma function is eliminated. Thereby, it can achieve the uniﬁcation of the fractional order and integer order in FCGD_G-L. To prevent the parameters falling into local optimum, a small disturbance is added in the unfolding process. According to the stochastic gradient descent (SGD) and Adam, two optimizers’ fractional calculus stochastic gradient descent based on the G-L deﬁnition (FCSGD_G-L), and the fractional calculus Adam based on the G-L deﬁnition (FCAdam_G-L), are obtained. These optimizers are validated on two time series prediction tasks. With the analysis of train loss, related experiments show that FCGD_G-L has the faster convergence speed and better convergence accuracy than the conventional integer order optimizer. Because of the fractional order property, the optimizer exhibits stronger robustness and generalization ability. Through the test sets, using the saved optimal model to evaluate, FCGD_G-L also shows a better evaluation effect than the conventional integer order optimizer.


Introduction
In the field of deep learning, the research of optimization algorithms has been an important direction.Among them, the optimization algorithms based on the gradient descent method have become the mainstream.Gradient Descent (GD) is the most basic optimization algorithm.Currently, various improved optimization algorithms in deep learning are based on it.It has a faster convergence speed, its convergence rule can be illustrated simply as θ n+1 = θ n − η∇J(θ n ), where η is the learning rate and ∇J(θ n ) is the gradient of J(θ) at θ n [1].GD is hardly practical in deep learning.It evaluates the entire datasets at each iteration, and the current datasets are getting larger and larger, which can easily make video memory and memory insufficient.SGD accelerates the convergence speed and solves the problem of excessive video memory and memory occupation.However, it causes the direction of the gradient to fluctuate too much at each iteration.Due to the disadvantages of GD and SGD, MBGD is proposed, which does not cause video memory and memory overflow and overcomes the problem of gradient direction fluctuation [2].It is a compromise between GD and SGD.It degenerates to SGD when the batch size is 1.It becomes GD when the batch size is the total sample size.In deep learning, a moderate batch size can speed up sample training [3].Thus, the batch size is generally set.Namely, SGD is also equated to MBGD in the application scenario.The same is true for the fractional order gradient descent optimizer in the paper (FCGD_G-L).Polyak introduced the concept of momentum [4], which was discussed in detail theoretically by Nesterov in the context of convex optimization [5].The introduction of momentum in deep learning has long been shown to be beneficial for parameters' convergence [6].It speeds up the convergence and prevents the parameters from falling into local optimal solutions.In addition, scholars have proposed some algorithms for adaptive learning rates.For example, AdaGrad proposed by Duchi [7], RMSProp proposed by Thieleman [8], and Adadelta proposed by Zeiler [9]; all of them use the current gradient state to change the learning rate or as a calibration for changes.Kingma proposed Adam [10] by combining the momentum method and the adaptive learning rate algorithm.Although the above algorithms and their improvements have their own characteristics, their gradients are based on the first-order derivative.Therefore, further development is limited.
As the research on fractional order gradient descent and deep learning optimization algorithms has intensified, more and more scholars have introduced fractional order calculus into deep learning optimization algorithms.Thus, it is possible for deep learning optimization algorithms to rely on fractional order derivation.Some scholars have achieved some good results on related research by exploiting the fractional order property.Under the convexity condition and Caputo, Li studied the convergence rate of different orders of GD, by jointly using the integer order and fractional order, parameters that finally converge to the integer order extreme value points [11].By using Riemann-Liouville (R-L) and Caputo, Chen studied GD under convexity conditions and proposed a deformation's formula; the formula can converge quickly to integer order extreme value points [12].By transforming R-L and looking for special initial parameters, Wang designed a deep learning optimization algorithm that can guarantee the same convergence result as the integer order under the convexity condition, and the related optimizer is validated using the MNIST dataset by experiments [13].Yu designed a deep learning optimizer using G-L by setting the step size to two.Its current gradient is determined by the gradient in the past fixed size time window according to a specific weight [14].Kan studied the deep learning optimization algorithm using G-L and validated the relevant optimizers using the MNIST dataset and the CIFAR-10 dataset with the inclusion of momentum, discussing the effect of different step sizes on the results [15].Khan designed a deep learning optimizer using a power series of fractional order, which was applied to a recommender system with good results [16][17][18].Due to the constraints of the fractional order power series, it is limited in the kinds of loss functions.In addition, the update of the parameters can only be kept in the positive range [19].Constrained optimization problems have been studied by Yaghooti [20] using Caputo, and Viola [21] using R-L.
There are various fractional order definitions.The commonly used ones are the G-L fractional order definition, R-L fractional order definition, and Caputo fractional order definition [22].SGD is rarely used directly in deep learning, but it is the basis for improved optimization algorithms.SGDM is an optimization algorithm with momentum [23].Ada-Grad, RMSProp and Adadelta are a class of optimization algorithms with an adaptive learning rate [24].Adam is an optimization algorithm for combining momentum and adaptive learning rate property [25,26].
Based on the above discussion, FCGD_G-L is designed in this paper using the G-L fractional order definition.Its current α order gradient is obtained by summing the current first order gradient and the first order gradients of the past 10 time steps according to the fractional order property.Compared with the integer order, which can only add momentum and disturbances to the gradient descent, FCGD_G-L can add perturbations to its own derivation process to accelerate the descent and prevent falling into the local optimum solution.At the same time, the integer order needs additional momentum per iterative process, and this increases the computational workloads.Because of the fractional order property, the fractional order is equivalent to self-contained momentum.Thus, these computational workloads are eliminated in FCGD_G-L.The designed optimizer in this paper adds small perturbations to the fractional order derivation process; this maximizes the ability of finding the global optimal solution while ensuring the fractional order properties.The major contributions of this paper are as follows: 1.
In this paper, a novel deep learning optimizer is designed, written according to Pytorch documentation specification, with the same invocation methods as the existing optimizers of Pytorch, enriching the variety of optimizers.

2.
Compared to other fractional order deep learning optimizers, the use of the G-L fractional order definition reduces the work of adding momentum and perturbation to the gradient at each iteration; thus, reducing the computational workload and improving the efficiency of the fractional order deep learning optimizer.

3.
The new G-L fractional order definition uses improved Grünwald coefficients, avoiding the use of the Gamma function.In addition, it solves the problem that the previous fractional order optimizer is not perfectly compatible with the integer order.4.
In this paper, obtaining the global optimal solution is the best result.However, it is easy to fall into local optimal solutions during training.Thus, a constant factor c j is added before each term of the G-L fractional order definition, and a small internal perturbation is added to the current time step by fine-tuning c j , which can prevent the parameters from falling into the local optimal solutions well. 5.
The deep learning algorithm in this paper provides a new way of thinking; by introducing fractional order, the optimizer adds a hyperparameter α.By adjusting α, the optimizer can be adapted to different application scenarios well, and a faster convergence speed and higher convergence accuracy can be obtained than the integer order.
The remainder of this paper is organized as follows.Section 2 introduces the G-L fractional order definition and SGD and its related improvement algorithm.Section 3 introduces the definition of fractional order gradient descent in this paper and gives the corresponding fractional order optimizer algorithms for SGD and Adam.Section 4 validates the optimizers of this paper on deep neural network models using two time series datasets, compares the corresponding integer order optimizers, analyzes the train loss of each optimizer's loss function, and evaluates the effectiveness of the resulting models on test sets.Section 5 summarizes FCGD_G-L, pointing out its shortcomings and future improvement directions.

Derivation of Fractional Calculus Gradient Descent Based on G-L Fractional Order
To better illustrate the optimizer in this paper, this section focuses on some of the basic concepts mentioned above.

The Definition of G-L Fractional Order
Definition 1.The G-L fractional order definition is defined as follows [22,27]: where h is step size, and [t 0 , t] is the upper and lower bound on the number of steps.The α is the order of the G-L fractional order definition.
Theorem 1.The limit finding operation in Equation ( 1) can be neglected if the chosen computational step is small enough.The G-L fractional order definition can be written as follows [22]: Proof.We begin by proving Theorem 1: With Equation (2), obviously, w 0 = 1.Thus, Equation ( 4) is given in the paper.
Namely, Theorem 1 is proved.Through Equation ( 4), the calculation of Equation ( 2) by the Gamma function can be avoided.Since Equation ( 4) does not need to compute double float [20], the computational efficiency and robustness of the algorithm are improved.

Preliminaries Algorithms
Let a data set sample be n, the f i (x) is the loss function of the training samples with index i, and t is a variable of time.This means that the parameter x is iterated at the t time node, i ⊂ n, x 0 = 0. Accordingly, the following definitions are obtained: Definition 2. SGD's parameter update equation is as follows [1]: where η is the learning rate (lr), and The stochastic gradient method with momentum (SGDM) [3] is the accumulation of past historical gradients on top of the current gradient, in order to achieve faster convergence and prevent falling into local optima.SGDM is defined as follows: Definition 3. SGDM's parameter update equation is as follows [4]: where β is the momentum coefficient, and v t is the momentum, this accumulates past gradients to the current gradient to improve the descent speed and reduce the fluctuation of parameter updates.
Adaptive learning rate algorithms are a class of deep learning optimization algorithms created by adjusting the learning rate and gradient according to their current state [2].In addition, AdaGrad, RMSProp and Adadelta are the three main representatives, which are defined as follows: Definition 4. AdaGrad's parameter update equation is as follows [7]: where s t is used to accumulate the variance of past gradients and then construct a different learning rate for each iteration, to optimize the process of iteration.The default s 0 = 0. ε = 1e − 8, and the ε is to prevent the denominator from being 0. Definition 5. RMSProP's parameter update equation is as follows [8]: where γ is the weight coefficient of s.Definition 6. Adadelta's parameter update equation is as follows [9]: where u t is the leaked average of g t with rescaled gradients.Together with s t , it constructs a learning rate for g t .The default u 0 = 0.
Adam's algorithm combines the advantages of an adaptive learning rate algorithm and momentum to achieve fast convergence.However, its gradient is easy to oscillate and the convergence accuracy is slightly poor.It is defined as follows.

G-L Fractional Order Definition
The fractional order derivative formula of this paper is first given to illustrate its rationality.Its SGD algorithm and Adam's algorithm are also given in the paper.

G-L Fractional Order Definition of the Model
From the G-L fractional order definition of Equation ( 1), due to the characteristics of computers, it is clear that the algorithm cannot be expanded infinitely in practical calculations; therefore, a finite expansion is required.Some scholars have shown that, in neural networks, the expansion to the 10th term already characterizes the properties of fractional order derivatives well [15,28,29].Let q(j) = Γ(α+1) Γ(j+1)Γ(α−j+1) and make a graph of the variation of q(j) with j ∈ Q + for Equation (2), as shown in Figure 1.
Figure 1 shows the curves of α = [0.1,0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5].When Equation ( 2) is expanded to the 10th term, the effect of the coefficients on the overall fractional order derivation becomes small.Therefore, the fractional order derivation in this paper only accumulates the past 10 time steps.2) is expanded to the 10th term, the effect of the coefficients on the overall fra order derivation becomes small.Therefore, the fractional order derivation in this only accumulates the past 10 time steps.On the other hand, the step size in Equation ( 1) is not a continuous value in t rameter update of the neural network [14,15], namely * h .In this paper, let th size be the minimum value, namely 1 h = , and according to Equations ( 3) and ( 4rivative equation is obtained that can be used for updating the parameters: ( ) can eliminate the computation of the Gamma function and achie unification of the fractional order and integer order optimizers.In order to impro On the other hand, the step size in Equation ( 1) is not a continuous value in the parameter update of the neural network [14,15], namely h ∈ N * .In this paper, let the step size be the minimum value, namely h = 1, and according to Equations ( 3) and ( 4), a derivative equation is obtained that can be used for updating the parameters: Equation ( 11) can eliminate the computation of the Gamma function and achieve the unification of the fractional order and integer order optimizers.In order to improve the ability of the algorithm to find the global optimal solution, a coefficient c j is added before each accumulated term.At the same time, the probability of each coefficient having 0.9 is 1, and the probability of 0.1 is 0. To obtain Equation (12): According to Equation (12), when α = 0, the algorithm degenerates to SGD without momentum.In order to make α = 1 and the SGD equal, Equation ( 4) is improved further in the paper and obtains Equation (13): By using Equation (13), when the order α = 1, the fractional order derivative becomes the integer order derivative, and corresponds to SGD.Thus, the unification of the fractional order gradient descent and integer order gradient descent is achieved.In the original formula, α < 0 denotes integral and α > 0 denotes differential.Because of the transformation of Equation ( 4), α ≤ 0 in this paper also has good gradient descent capability, while retaining the fractional order property.

FCGD_G-L Algorithm
From Equations ( 12) and ( 13), an SGD based on FCGD_G-L and an Adam based on FCGD_G-L are proposed.In these two algorithms, the integer order derivation process becomes a fractional order derivation process.In addition, the extra momentum is no longer needed by taking advantage of the long memory property of the fractional order derivation.
The FCSGD_G-L Algorithm 1, which combines FCGD_G-L and SGD: Algorithm 1: The SGD optimization Algorithm based on FCGD_G-L Input: η(lr),x 0 (params), f (x)(objective), λ(weight decay), α(order), c(disturbance coefficient), w(fractional coefficient) In Algorithm 1, because fractional order derivatives are used, the algorithm adds a hyperparameter α that can adjust the order.In addition, the momentum and Nesterov are removed from the algorithm.
The FCAdam_G-L Algorithm 2, which combines FCGD_G-L and Adam: In Algorithm 2, a hyperparameter α that can adjust the order is added.
The above two algorithms are based on FCGD_G-L and two classical gradient descent algorithms.They have the same time complexity as the original algorithm.Because of FCGD_G-L, the deep learning optimization algorithm becomes more flexible.

Experiment
In this section, two time series datasets are used to validate FCSGD_G-L and FCAdam_G-L.One of the datasets is the Dow Jones Industrial Average (DJIA), preprocessed with 24,298 rows of data, spanning from 3 February 1930 to 13 October 2022, with five dimensions: Open, High, Low, Volume and Close, predicting Close, and the training sets and test sets are cut according to 8:2 [30].The other dataset is the Electricity Transformer dataset (ETTh1), with 17,420 rows of data, which has seven dimensions HUFL, HULL, MUFL, MULL, LUFL, LULL and OT, predicting OT, and the training sets and test sets are cut according to 8:2 [31].
The computer configuration for the experiment is as follows: CPU is an AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz; GPU is a RTX 3060 Laptop.
The selection criteria of the order are their fastest convergence speed and highest accuracy when training.
The neural network structure of the whole experiment consists of a three-layer LSTM [32] and two Linear, whose structure is shown in Figure 2. The computer configuration for the experiment is as follows: CPU is an AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz; GPU is a RTX 3060 Laptop.
The selection criteria of the order are their fastest convergence speed and highest accuracy when training.
The neural network structure of the whole experiment consists of a three-layer LSTM [32] and two Linear, whose structure is shown in Figure 2. In Figure 2, let X be the size of each input sample; it is a matrix with rows of Fea- ture size and columns of Sliding window size.The ' y is a 64 1  vector.The '' y is a 32 1  vector.The y is a predicted value; it is a scalar.As can be seen from Figure 2, X passes through the three-layer LSTM.After that, the   In Figure 2, let X be the size of each input sample; it is a matrix with rows of Feature size and columns of Sliding window size.The y is a 64 × 1 vector.The y is a 32 × 1 vector.The y is a predicted value; it is a scalar.As can be seen from Figure 2, X passes through the three-layer LSTM.After that, the h t+1 is obtained on the last layer; it is equal to the y .The y is processed by the first Linear, and the y is obtained.Finally, the y is processed by the second Linear, and the y is obtained.

Metrics
In this paper, the Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are used as evaluation metrics [32].They are defined as follows: Let n be the total number of samples, y i is the true value, and f (x i ) is the predicted value, and y is the sample mean.The following equation is obtained: 14)

Training for DJIA
In the training of DJIA, FCSGD_G-L and FCAdam_G-L are used as optimizers, respectively.In addition, the train loss and convergence accuracy of different optimizers are recorded.The hyperparameters are set as: epoch = 250; weight_deacy = 0.The sliding window size is 30 and the batch size is 256.MSE is used as the loss function, and the lr of FCSGD_G-L during the training is set as in Equation ( 18): After 250 iterations, the train loss of different orders of FCSGD_G-L is shown in Figure 3.
Figure 3 shows the decreasing trend of train loss with epoch, where the FCSGD_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.From Figure 3, it can be seen that train loss of DJIA converges the fastest, and the convergence accuracy is the highest when the hyperparameter α = 0.1 of FCSGD_G-L.Therefore, FCSGD_G-L with α = 0.1 is used for comparison with SGD and SGD with momentum = 0.9 (SGDM); the other hyperparameters are default, as shown in Figure 4.It can be seen from Figure 4, on DJIA, that the train loss convergence speed of FCSGD_G-L is faster than the SGD and the SGDM, and when 0.1  = , FCSGD_G-L has a higher convergence accuracy.
When using FCAdam_G-L to train DJIA, due to the characteristics of Adam, the lr smaller than the SGD is selected in this paper to avoid divergence, and the other hyperparameters remain unchanged.The lr setting is as shown in Equation ( 19 After 250 iterations, Figure 5 shows the decreasing trend of train loss with epoch, where the FCAdam_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.It can be seen from Figure 4, on DJIA, that the train loss convergence speed of FCSGD_G-L is faster than the SGD and the SGDM, and when 0.1  = , FCSGD_G-L has a higher convergence accuracy.
When using FCAdam_G-L to train DJIA, due to the characteristics of Adam, the lr smaller than the SGD is selected in this paper to avoid divergence, and the other hyperparameters remain unchanged.The lr setting is as shown in Equation ( 19 After 250 iterations, Figure 5 shows the decreasing trend of train loss with epoch, where the FCAdam_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.It can be seen from Figure 4, on DJIA, that the train loss convergence speed of FCSGD_G-L is faster than the SGD and the SGDM, and when α = 0.1, FCSGD_G-L has a higher convergence accuracy. When using FCAdam_G-L to train DJIA, due to the characteristics of Adam, the lr smaller than the SGD is selected in this paper to avoid divergence, and the other hyperparameters remain unchanged.The lr setting is as shown in Equation (19): After 250 iterations, Figure 5 shows the decreasing trend of train loss with epoch, where the FCAdam_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.
As can be seen from Figure 5, when the hyperparameter α = 0.3 of FCAdam_G-L, DJIA has the highest precision of train loss convergence, and the convergence rate of each order are roughly the same.Therefore, FCAdam_G-L with α = 0.3 is used to compare with Adam, and the other hyperparameters are default, resulting in Figure 6.As can be seen from Figure 6, the train loss of FCAdam_G-L with 0.3  = demonstrates a higher convergence accuracy than Adam on DJIA.In terms of convergence speed, Adam and FCAdam_G-L with 0.3  = are the same; and combined with Figure 5, it can be seen that FCAdam_G-L and Adam converge at the same speed on DJIA.

 =
of FCAda DJIA has the highest precision of train loss convergence, and the convergence rate order are roughly the same.Therefore, FCAdam_G-L with 0.3  = is used to c with Adam, and the other hyperparameters are default, resulting in Figure 6.As can be seen from Figure 6, the train loss of FCAdam_G-L with

Training for ETTh1
In the training of ETTh1, FCSGD_G-L and FCAdam_G-L are used as optimi spectively, and the train loss and convergence accuracy of different optimizers orded.The hyperparameters are set as:  As can be seen from Figure 6, the train loss of FCAdam_G-L with = 0.3 demonstrates a higher convergence accuracy than Adam on DJIA.In terms of convergence speed, Adam and FCAdam_G-L with α = 0.3 are the same; and combined with Figure 5, it can be seen that FCAdam_G-L and Adam converge at the same speed on DJIA.

Training for ETTh1
In the training of ETTh1, FCSGD_G-L and FCAdam_G-L are used as optimizers, respectively, and the train loss and convergence accuracy of different optimizers are recorded.The hyperparameters are set as: epoch = 250; weight_deacy = 0.The sliding window size is 72 and the batch size is 256.MSE is used as the loss function.The lr of FCSGD_G-L during the training is set as in Equation (18), and the lr of FCAdam_G-L during the training is set as in Equation (19).After 250 iterations, Figure 7 shows the decreasing trend of train loss with epoch, where the FCSGD_G-L fractional order is −0.7, −0.6, −0.5, −0.4,−0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4.Figure 8 shows the decreasing trend of train loss with epoch, where the FCAdam_G-L fractional order is −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5.In Figure 7, for ETTh1, FCSGD_G-L performs best when α = −0.6, and its convergence speed and convergence accuracy reach the highest.In Figure 8, when α = 0.4, FCSGD_G-L performs the best and its convergence accuracy reaches the highest.The two figures show roughly the same train loss descent domain.So, on ETTh1, the SGD, SGDM, Adam, FCSGD_G-L and FCAdam_G-L are compared together in this paper in Figure 9.In Figure 7, for ETTh1, FCSGD_G-L performs best when 0.6  =− , and its conver- gence speed and convergence accuracy reach the highest.In Figure 8, when 0.4  = , FCSGD_G-L performs the best and its convergence accuracy reaches the highest.The two figures show roughly the same train loss descent domain.So, on ETTh1, the SGD, SGDM, Adam, FCSGD_G-L and FCAdam_G-L are compared together in this paper in Figure 9.In Figure 9, FCSGD_G-L with α = −0.6 has the fastest decrease in train loss and the highest convergence accuracy.The SGDM with momentum = 0.8 also performs well, except that the convergence accuracy is worse than FCSGD_G-L, but at the default momentum = 0.9, the SGDM diverges without other hyperparameters being changed.FCGD_G-L in this paper is also essentially an algorithm with momentum, but as can be seen from Figures 7 and 8, FCGD_G-L on ETTh1 not only converges quickly and with high convergence accuracy, but is also robust and less likely to diverge.On ETTh1, both Adam and FCAdam_G-L perform poorly; however, using FCAdam_G-L is better than Adam.Among the various optimizers that achieve convergence, the SGD is the least effective, converges slowly, and has the lowest accuracy.

Evaluation of DJIA and ETTh1
Four evaluation metrics: MSE, RMSE, MAE, and MAPE were obtained by Equations ( 14)- (17).They are used in this paper to evaluate the effects of the two test sets.The results are recorded in Tables 1 and 2. Further, these metrics are used in order to compare the FCSGD_G-L correlation optimizer without considering the existing optimal network model.Further, the main hyperparameters are also initially set in Section 4, with the order and momentum settings based on the best results discussed above, namely, on DJIA, the lr is as in Equation ( 18), momentum = 0.9, α = 0.1 for FCSGD_G-L and α = 0.3 for FCAdam_G-L; on ETTh1, lr is as in Equation ( 19), momentum = 0.8, α = −0.6 for FCSGD_G-L and α = 0.4 for FCAdam_G-L.At the end of each epoch, the train loss is compared, the model with the smallest train loss is saved, and then the model is evaluated by using test sets.In Table 1, due to the high volatility of DJIA, using the full test set is not effective and it is difficult to show the advantages and disadvantages of each optimizer.Therefore, only the first half of the test set of DJIA is used in this paper.It can be seen from Table 1, that the four metrics of FCAdam_G-L are the best among the five optimizers.For FCSGD_G-L, the results of the four metrics are all better than the SGD and the SGDM.This indicates that FCGD_G-L has obvious advantages in DJIA.
In Table 2, FCSGD_G-L's MSE, RMSE and MAE have the best results.For FCAdam_G-L, the results of the four metrics are all better than Adam.This indicates that FCGD_G-L has obvious advantages in ETTh1.

Conclusions
On DJIA and ETTh1, for the train loss of FCGD_G-L, its convergence speed and convergence accuracy exceed the corresponding integer order optimizer.In addition, the evaluation effect on test sets is also better than the corresponding integer order optimizer.
Taking advantage of the fractional order long memory property, FCGD_G-L does not need additional momentum, because it is equivalent to containing momentum inside.In addition, because of the properties of the G-L fractional order definition, the addition of perturbations becomes flexible in the iteration process.By using the transforming formula of the G-L fractional order definition, the Gamma function is removed in the paper.In addition, FCGD_G-L includes the integer order and the fractional order, thus achieving the unification of both.
Algorithm 1 and algorithm 2 make full use of the Autograd package of Pytorch to avoid the complicated derivation process in the complex neural network.The optimizers designed according to Algorithm 1 and algorithm 2 are very compatible with Pytorch.In Pytorch, our optimizer can be used just like any other existing optimizer.Using the order of the fractional order, we can fine-tune the results of the optimizer to obtain better convergence speed and convergence results.In Tables 1 and 2, the evaluation results on the test set also show better results than the integer order through adjusting the order.
In the foreseeable future, we will further explore the influence of the fractional calculus gradient descent on deep neural network, how to select the appropriate order quickly, and how to reduce hyperparameters.Eventually, it is also a significant research direction to make FCGD_G-L play a role in other fields.

Figure 1
Figure 1 shows the curves of

Figure 1 .
Figure 1.The variation of () qj with j .

Figure 1 .
Figure1.The variation of q(j) with j.

Figure 2 .
Figure 2. The neural network structure of this paper.

Figure 2 .
Figure 2. The neural network structure of this paper.

Figure 4 .
Figure 4. Train loss comparison at DJIA by SGD.

Figure 4 .
Figure 4. Train loss comparison at DJIA by SGD.

Figure 4 .
Figure 4. Train loss comparison at DJIA by SGD.

Figure 6 .
Figure 6.Train loss comparison at DJIA by Adam.
72 and the batch size is 256.MSE is used as the loss function.The lr of FCSGD_G-L during the training is set as in Equation (18), and the lr of FCAdam_G-L during the training is set as in Equation (

Figure 6 .
Figure 6.Train loss comparison at DJIA by Adam.
convergence accuracy than Adam on DJIA.In terms of convergenc Adam and FCAdam_G-L with 0.3  = are the same; and combined with Figure be seen that FCAdam_G-L and Adam converge at the same speed on DJIA.

Figure 6 .
Figure 6.Train loss comparison at DJIA by Adam.

Table 1 .
The best results for DJIA are bolded at MSE, RMSE, MAE, and MAPE.

Table 2 .
The best results for ETTh1 are bolded at MSE, RMSE, MAE, and MAPE.