Next Article in Journal
Optimal Control of an Eco-Epidemiological Reaction-Diffusion Model
Previous Article in Journal
Quasi-Irreducibility of Nonnegative Biquadratic Tensors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting

1
Mathematics and Data Science Laboratory, Taza Multidisciplinary Faculty, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco
2
Faculty of Law, Economics and Social Sciences—LRMEF Fez, Sidi Mohamed Ben Abdellah University, Fez 30060, Morocco
3
Centre for Computational Science and Mathematical Modelling, Coventry University, Priory Road, Coventry CV1 5FB, UK
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(13), 2068; https://doi.org/10.3390/math13132068
Submission received: 16 May 2025 / Revised: 11 June 2025 / Accepted: 18 June 2025 / Published: 22 June 2025

Abstract

:
This study investigates the theoretical foundations and practical advantages of fractional-order optimization in computational machine learning, with a particular focus on stock price forecasting using long short-term memory (LSTM) networks. We extend several widely used optimization algorithms—including Adam, RMSprop, SGD, Adadelta, FTRL, Adamax, and Adagrad—by incorporating fractional derivatives into their update rules. This novel approach leverages the memory-retentive properties of fractional calculus to improve convergence behavior and model efficiency. Our experimental analysis evaluates the performance of fractional-order optimizers on LSTM networks tasked with forecasting stock prices for major companies such as AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH. Considering four metrics (Sharpe ratio, directional accuracy, cumulative return, and MSE), the results show that fractional orders can significantly enhance prediction accuracy for moderately volatile stocks, especially among lower-cap assets. However, for highly volatile stocks, performance tends to degrade with higher fractional orders, leading to erratic and inconsistent forecasts. In addition, fractional optimizers with short-memory truncation offer a favorable trade-off between computational efficiency and modeling accuracy in medium-frequency financial applications. Their enhanced capacity to capture long-range dependencies and robust performance in noisy environments further justify their adoption in such contexts. These results suggest that fractional-order optimization holds significant promise for improving financial forecasting models—provided that the fractional parameters are carefully tuned to balance memory effects with system stability.

1. Introduction

Financial time series prediction remains one of the most challenging tasks in the field of quantitative finance due to the inherent noise, volatility, and dynamic behavior of financial markets. According to the semi-strong form of market efficiency [1], asset prices reflect all publicly available information, theoretically limiting the predictability of future price movements. However, numerous empirical studies have identified persistent market anomalies that challenge this notion, suggesting that certain patterns can indeed be exploited [1]. Over the past decade, financial time series forecasting—whether framed as a regression or a classification problem—has attracted significant academic and industrial interest. Traditional statistical and machine learning models, while useful, often fall short in capturing the nonlinear and temporal dependencies intrinsic to financial data [2,3]. In contrast, deep learning methodologies—particularly long short-term memory (LSTM) networks—demonstrate superior performance in financial time-series forecasting, attributable to their capacity for modeling sequential dependencies and capturing long-term temporal patterns [4,5,6]. This capability is increasingly critical given escalating financial market complexity and heightened economic uncertainty, which amplify the demand for robust, adaptive forecasting systems [3]. Consequently, recurrent neural network (RNN) architectures, notably LSTMs, have garnered significant scholarly and practical interest in recent years, proving particularly effective for applications such as equity price prediction [6]. However, the effectiveness of these models is strongly influenced by the choice and performance of the optimization mechanisms employed during training. Despite notable advancements, comprehensive guidance remains scarce for financial practitioners regarding optimal model selection, architecture design, and implementation strategies, thus underscoring the necessity for continued research in this domain [2].
Classical optimizers, including Stochastic Gradient Descent (SGD) [7], Momentum [8], Nesterov Accelerated Gradient (NAG) [9], Adagrad [10], Adadelta [11], RMSprop [12], and Adam [13] are widely adopted because of their effectiveness and robustness in addressing challenging optimization problem landscapes. Newer developments, including AMSGrad [14], AdamW [15], AdaBelief [16], and Rectified Adam (RAdam) [17], have enhanced convergence stability and generalization performance, particularly for deep and recurrent architectures.
Despite this progress, all the aforementioned optimizers are founded on classical integer calculus and may inherently limit their capacity to grasp the complex temporal contingencies and long-term memory influences that characterize financial time-series data. Such restrictions prompt the investigation of alternative mathematical frameworks that naturally incorporate memory and hereditary characteristics within the training dynamics.
Fractional calculus, which extends traditional differentiation and integration to non-integer (fractional) orders, provides a powerful approach to this challenge. By substituting fractional-order derivatives for conventional derivatives in the update rules of optimization algorithms, it becomes possible to construct fractional-order optimizers that better capture long-term dependencies and dynamic patterns in the data.
This paper investigates the theoretical foundations and empirical performance of fractional-order optimization algorithms for tuning LSTM models in stock price forecasting tasks. We introduce fractional variants of commonly used optimizers and measure their influence on forecast accuracy across a range of company stock price datasets, including AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH. Our findings highlight that fractional-order optimization, when appropriately tailored, can significantly enhance model efficiency, especially in relatively stable financial markets, while also revealing its limitations in highly unstable market environments.
The principal contributions of this work can be summarized as follows:
  • We introduce a novel class of fractional-order training optimizers that extend conventional optimization algorithms (Adam, RMSprop, SGD, Adadelta, FTRL, Adamax, and Adagrad) by integrating fractional computation into their update rules, allowing memory and hereditary effects to be embedded in the training process.
  • We incorporate these fractional-order optimizers into the training phase of long short-term memory (LSTM) networks, establishing a new learning paradigm that enhances the model’s capacity to capture the long-term dependencies embedded in financial sequential data.
  • We apply the proposed fractional optimization schemes to time-series forecasting tasks in the stock market domain, using leading technology stock prices (e.g., AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH) as benchmark datasets.
  • We conduct in-depth empirical comparisons between traditional integer-order optimizers (such as Adam, RMSprop, Adagrad, SGD, Nadam, Adadelta, and AMSGrad) and their fractional counterparts, highlighting the potential benefits and challenges associated with fractional-order processing in financial forecasting applications.
The rest of the paper is structured as follows. In Section 2, we provide a succinct background on fractional calculus. Section 3 introduces the development of fractional-order optimizers and their integration into the LSTM training framework. Section 4 outlines the experimental setup, including a detailed description of the dataset, implementation details, evaluation metrics, and tuning strategies. Finally, Section 5 concludes the paper and discusses promising directions for future work.

2. Fractional Calculus

Fractional calculus has recently attracted significant attention in the machine learning (ML) community, particularly for its ability to model memory effects and non-local behavior. These properties make it a powerful tool for enhancing the dynamics of optimization algorithms, especially in sequence modeling tasks such as stock price forecasting using LSTM networks. Unlike traditional integer-order methods, fractional-order algorithms can encode historical gradient information more effectively, offering improved convergence behavior and potentially better generalization in noisy or volatile environments.
Motivated by these theoretical advantages, this study proposes fractional variants of several widely used optimizers—including Adam, RMSprop, SGD, Adadelta, FTRL, Adamax, and Adagrad—by integrating fractional derivatives into their update mechanisms. Our approach is grounded in a growing body of recent work that demonstrates the efficacy of fractional operators in neural network training and optimization. Notable contributions in this field include both theoretical formulations and empirical validations of fractional-order gradient descent strategies and their convergence behavior [18,19,20,21,22,23].
Fractional calculus is a modern extension of standard calculus to non-integer orders of derivatives and integrals. For a function f ( t ) , the fractional derivative of order α , referred to as D, is expressed as shown below [24]:
D α f ( t ) = d α f ( t ) d t α
where α R + . Fractional derivatives have emerged as beneficial in numerous domains, covering signal treatment, monitoring theory, viscoelastic systems, and, progressively, machine learning and optimization, owing to their capability of encompassing the memory and inheritance characteristics of systems.

2.1. Definitions of Fractional Derivatives

There are different kinds of fractional derivatives, tailored to meet the needs of individual applications. The most frequently used are the following:
  • Riemann–Liouville derivative [24]:
    D R L α f ( t ) = 1 Γ ( n α ) d n d t n 0 t f ( τ ) ( t τ ) α n + 1 d τ , n 1 < α < n
  • Caputo derivative: [25]
    D C α f ( t ) = 1 Γ ( n α ) 0 t f ( n ) ( τ ) ( t τ ) α n + 1 d τ , n 1 < α < n
    Caputo’s version tends to be favored in physics and machine learning applications, since it enables starting assumptions to be handled in the equivalent form to integer-order differential systems.
  • Grünwald–Letnikov derivative: [26]
    D G L α f ( t ) = lim h 0 1 h α k = 0 t / h ( 1 ) k α k f ( t k h )
    This formulation is especially fruitful for numerical applications and represents the foundations of numerous discrete-time approximations of fractional derivatives.

2.2. Approximation Methods of Fractional Derivatives

Because fractional derivatives are defined as integrals over time or history, computing them exactly is often impractical, especially in online optimization. Thus, approximation methods play a vital role in applications.
  • Short-memory principle: Computing the fractional derivative analytically is challenging, so we employ the short-memory principle, using only the most recent M steps to approximate it. The estimated fractional derivative is given by
    D α f ( t n ) 1 h α k = 0 M ω k ( α ) f ( t n k ) ,
    where ω k ( α ) = ( 1 ) k α k represents the binomial factor related to the fractional derivative, h denotes the time frame, and M denotes the number of current time frames considered.
  • Lubich’s method: Founded on convolution squaring involving generative functions, this approach provides steady and rigorous approximations of fractional derivatives for hard differential equations and optimization procedures [27].
  • Fractional backward differentiation formulas (FBDFs): These approaches expand conventional BDF procedures to fractional orders and are ideally matched for use in inflexible or memory-intensive systems.
  • Adam–Bashforth–Moulton predictors: Fractional variants of ABM techniques offer successful estimator-corrector frameworks [28].
  • Adaptive and data-driven approaches: Current research recommends approaching fractional operators by means of neural networks or kernel-based regressors that adjust to the function’s response with time [29].

2.3. Theoretical Rationale Behind Fractional Order Memory and Market Dynamics

To grasp why fractional LSTM models work better with small memory orders, it is important to understand the math behind fractional derivatives. In this subsection, we offer an intuitive and semi-formal interpretation of how the Caputo derivative introduces long-term memory and temporal smoothing, which aligns with empirical behaviors observed in financial markets such as volatility clustering and trend persistence. While a full theoretical treatment is beyond the scope of this empirical study, we summarize the key mechanisms that link fractional dynamics to financial signal structure.

2.3.1. Fractional Derivatives and Memory Effects

The use of Caputo fractional derivatives introduces non-local memory into the system via a convolution with a power-law kernel; see Equation (3).
This expression reveals that the fractional order α acts as a memory parameter, where lower α values increase the weight assigned to older values of f ( t ) . That is, the system retains long-term memory with a power-law decay governed by
w ( t τ ) ( t τ ) α
as compared to the exponential decay in traditional LSTM gates.

2.3.2. Temporal Filtering and Volatility Sensitivity

The long-memory behavior of the fractional derivative effectively acts as a temporal low-pass filter [30]. In the context of financial signals,
  • When α 0 , the model heavily smooths the input, making it more robust to noise and transient fluctuations—ideal for steady stocks like AAPL.
  • When α 1 , the filter approximates a first-order derivative, recovering a reactive model that is sensitive to short-term fluctuations—better suited for high-volatility assets like META.
This filtering effect can be interpreted using the frequency-domain response of the fractional derivative, where the Laplace transform of the Caputo derivative is
L D t α C f ( t ) = s α f ˜ ( s ) k = 0 n 1 s α k 1 f ( k ) ( 0 + )
This reveals that the response attenuates higher-frequency components more aggressively as α decreases, thus suppressing noise.

2.3.3. Relation to Financial Market Structure

Financial time series are well known to exhibit volatility clustering and heteroskedasticity, as modeled in ARCH/GARCH models. The heterogeneous memory hypothesis [31] and empirical studies [32] suggest that smoother markets (with lower realized volatility) exhibit more persistent autocorrelations. In such cases, a fractional model with small α provides a more appropriate inductive bias.
This relationship has also been exploited in fractional Brownian motion (fBm) models, where the Hurst exponent H quantifies long-memory behavior:
E [ ( B H ( t + τ ) B H ( t ) ) 2 ] τ 2 H
Lower α values in our model correspond functionally to higher H values (i.e., smoother paths), and thus align well with trend-following markets.
Therefore, the enhanced performance of small α values for smooth equities is not coincidental but rooted in
  • The power-law memory integration of the Caputo derivative.
  • The temporal smoothing inherent in fractional orders.
  • The market-dependent spectral properties of financial time series.
With these techniques, fractional-order strategies can be conveniently employed in optimization, especially for time-series training and recurrent neural networks such as LSTM, in which the detection of long-term dependencies is essential.

3. Fractional Optimizers to LSTM Learning

In this section, we introduce fractional-order variants of widely used optimization algorithms in machine learning. These fractional optimizers incorporate additional hyperparameters that influence training dynamics, memory effects, and convergence stability—factors that are particularly critical when adapting LSTM models to volatile and non-stationary financial time series. The overall methodology applied in this section is summarized in Figure 1.

3.1. Loss Function in LSTM for Time Series Forecasting

Recently, deep learning tools such as LSTM networks have attracted increasing attention in the field of time series forecasting. These models require powerful optimization methods for effective training, particularly when dealing with complex datasets such as financial time series. Standard optimization techniques, including Stochastic Gradient Descent (SGD) and its variants, often struggle to capture long-range dependencies in the data. To address this limitation, we explore the potential advantages of incorporating fractional calculus into optimization procedures, specifically through the use of fractional derivatives, to better model such dependencies in a meaningful way.
In time series prediction, the loss function measures the difference between the model’s predicted values and the actual observations. It serves as the objective that the optimizer minimizes during training by adjusting the model’s weights to improve forecasting accuracy.
The most widely used loss function in time series forecasting is the Mean Squared Error (MSE), defined as
L MSE = 1 T t = 1 T y t y ^ t 2
where
  • y t is the actual value at time step t;
  • y ^ t is the predicted value at time step t;
  • T is the total number of time steps in the sequence.
The latter penalizes higher magnitudes of error to a greater extent, which is suitable for financial time-series data, in which extreme volatility and aberrant values have significant consequences. However, where robustness to aberrant values is demanded, the Mean Absolute Error (MAE) or the Huber loss can be favored:
L MAE = 1 T t = 1 T | y t y ^ t |
The selection of the loss function is determined by the features of the data and the particular objectives of the prediction job, for example, monitoring a tendency or forecasting a particular position.

3.2. LSTM Architecture and Mathematical Formulation

LSTM networks are specifically designed to capture long-term dependencies in sequential data, overcoming the limitations of standard recurrent neural networks (RNNs) through the use of powerful gating mechanisms that control the flow of information within and between cells. The cell state C t serves as the memory component, retaining long-term information, and is updated at each time step according to the following equations; see Figure 2.
1. Forget Gate: The miss port specifies what data from the preceding time step should be deleted from the cell’s status. It provides a value ranging from 0 to 1, with 1 meaning “keep all” and 0 meaning “forget all”.
f t = σ ( W f · [ h t 1 , x t ] + b f )
2. Input Gate: The entry port manages the quantity of new information authorized in the status of the cell. It likewise generates a refresh quantity that changes the status of the cell.
i t = σ ( W i · [ h t 1 , x t ] + b i )
C ˜ t = tanh ( W C · [ h t 1 , x t ] + b C )
3. Cell State Update: The status of the cell C t is actualized by merging the decision of the forget port to retain the preceding storage and the response of the input gate to upgrade the memory by adding new knowledge.
C t = f t · C t 1 + i t · C ˜ t
4. Output Gate: The exit port identifies the subsequent hidden status h t , utilized in the succeeding time interval and also serves as the output of the LSTM.
o t = σ ( W o · [ h t 1 , x t ] + b o )
h t = o t · tanh ( C t )
These portals operate jointly to allow LSTM to conveniently omit, refresh, and send information, ensuring its ability to pick up long-term relationships in time-series data; see Figure 3.

3.3. State of the Art: Optimization Methods for LSTM Learning

Training long short-term memory (LSTM) networks efficiently remains a vital research challenge in the field of deep learning. Optimization algorithms play a critical role in ensuring convergence speed, robustness, and generalization. The choice of optimizer significantly affects the performance, particularly for time-series data where sequential dependencies are central.
Classical Methods: Stochastic Gradient Descent (SGD) has historically been the foundational optimizer in deep learning [33] but suffers from slow convergence and sensitivity to learning rates. Momentum-based methods partially address these issues by smoothing gradients [8].
Adaptive First-order Methods: The introduction of adaptive learning methods such as AdaGrad [10], RMSProp [12], and Adam [13] significantly advanced LSTM training. Adam combines the benefits of RMSProp and momentum, becoming a standard in sequence modeling due to its stability and speed.
Second-order Methods: More computationally intensive approaches such as Hessian-Free Optimization [34] and Natural Gradient Descent [35] have been explored for LSTM training, especially in settings where curvature information improves convergence.
Fractional-Order Optimization: Recently, fractional-order derivatives have been applied to gradient descent, giving rise to fractional optimizers such as FracSGD, FracAdam, and FracRMSProp [36,37]. These optimizers introduce memory and long-range dependency characteristics beneficial for sequence learning, and have shown empirical improvements in noisy and sparse environments.
Hybrid and Custom Strategies: Other methods like Lookahead [38], RAdam [17], and Yogi [39] aim to refine convergence trajectories and deal with unstable variance during training. Such techniques are often combined with LSTM architectures in financial forecasting, language modeling, and anomaly detection.
As LSTM models continue to be applied in more complex and data-intensive settings, research increasingly focuses on optimizer robustness, generalization under limited data, and convergence guarantees under non-convex loss landscapes.

3.4. Fractional Adam

Adam’s optimizer represents an optimization algorithm largely adopted in deep learning that fuses momentum and adaptive training factors. It performs two estimations of momentum—the primary momentum ( m t ) and the secondary momentum ( v t )—and adjusts the training rates for every parameter according to these estimations. Adam’s classic adjustment rules consist of the following:
m t = β 1 m t 1 + ( 1 β 1 ) θ J ( θ )
v t = β 2 v t 1 + ( 1 β 2 ) θ J ( θ ) 2
where θ J ( θ ) represents the gradient of the loss function against the machine configuration; β 1 and β 2 denote the exponential decline factors at the first and second juncture, correspondingly.
Next, the resulting approximations are adjusted to incorporate the zero entry:
m ^ t = m t 1 β 1 t , v ^ t = v t 1 β 2 t
Lastly, the configuration is adjusted according to the following formula:
θ = θ η m ^ t v ^ t + ϵ
where η represents the training rate and ϵ is a tiny fixed value to exclude the possibility of division by 0.

3.4.1. Fractional Adam with General Fractional Gradient

With the fractional version of Adam, rather than employing the traditional gradient θ J ( θ ) , we call for a fractional gradient. With this operator approach, we are able to include long-term linkages in the optimization workflow, especially relevant for time-series foresight jobs. The usual adjustment rules for fractional Adam include the following:
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) D α θ J ( θ )
v t ( α ) = β 2 v t 1 ( α ) + ( 1 β 2 ) D α θ J ( θ ) 2
Here, D α denotes the fractional derivative of order α , and records long-term storage and reliance in gradient upgrades.

3.4.2. Short-Memory Estimation of the Fractional Derivative

Based on the approximation given by Equation (5), the adjustment guidelines for the fractional Adam becomes
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k
v t ( α ) = β 2 v t 1 ( α ) + ( 1 β 2 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k 2
The upgrades have integrated short-memory estimation, boosting the performance of the Adam fractional calculator without sacrificing the advantages of fractional computing.

3.4.3. Training Regulation in the Framework of LSTMs

Within the context of LSTM nets, the parameters θ comprise the weights and skews regulating the hidden status and cell-state conversions in LSTM units. The goal of LSTM is to minimize the error function, quantizing the deviation from expected values of the time series. The fractional Adam optimizer, discussed earlier, serves to adjust the LSTM configuration while training.
In this case, the training guideline is given below:
θ = θ η m ^ t ( α ) v ^ t ( α ) + ϵ
where m ^ t ( α ) and v ^ t ( α ) are the bias-adjusted primary and secondary estimates by means of fractional gradients. With these upgrades, the LSTM can more effectively incorporate long-term linkages in the time series data, rendering the machine better suited to predictive purposes like stock price forecasting.
Thanks to Adam’s fractional model, LSTM is able to effectively handle the time dependencies intrinsic to market data, thereby boosting forecast efficiency, notably in turbulent market circumstances.

3.5. Fractional Nadam

Nadam (Nesterov-accelerated Adam) combines Adam with Nesterov’s momentum. The principal distinction compared with Adam lies in the fact that Nadam expects the parameters to be adjusted by incorporating the momentum, allowing for quicker convergence. Nadam’s basic upgrade guidelines include the following:
m t = β 1 m t 1 + ( 1 β 1 ) θ J ( θ )
v t = β 2 v t 1 + ( 1 β 2 ) θ J ( θ ) 2
where θ J ( θ ) represents the gradient of the loss function versus the parameters θ and β 1 and β 2 represent the decreasing ratios of the two time estimates.
For Nadam, the upgrades feature
m ˜ t = β 1 m t 1 + ( 1 β 1 ) D α J ( θ ) ,
where the D α J ( θ ) fractional gradient represents the loss function.
Skew compensation is calculated as follows:
m ^ t = m t 1 β 1 t , v ^ t = v t 1 β 2 t
The recommended rule for setting the configuration is described below:
θ = θ η β 1 m ^ t + ( 1 β 1 ) D α J ( θ ) v ^ t + ϵ
where η represents the training intensity and  ϵ is a tiny value to avoid dividing by 0.

3.5.1. Fractional Nadam Using a Generalized Fractional Gradient

The generic upgrade rule at Fractional Nadam substitutes the fractional gradient in place of the conventional gradient D α θ J ( θ ) :
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) D α θ J ( θ )
v t ( α ) = β 2 v t 1 ( α ) + ( 1 β 2 ) D α θ J ( θ ) 2
where β 1 and β 2 represent exponential decay intensities and D α θ J ( θ ) represents the fractional gradient.
Skew adjustments are made using the same method as with conventional Nadam:
m ^ t ( α ) = m t ( α ) 1 β 1 t , v ^ t ( α ) = v t ( α ) 1 β 2 t
The latest configuration settings are processed according to the following rule:
θ = θ η β 1 m ^ t ( α ) + ( 1 β 1 ) D α θ J ( θ ) v ^ t ( α ) + ϵ

3.5.2. Short-Memory Approximation of Fractional Gradient

Based on the approximation, given by Equation (5), Nadam’s fractional adjustment takes the form of:
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k
v t ( α ) = β 2 v t 1 ( α ) + ( 1 β 2 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k 2
Therefore, the rule to adjust the fractional Nadam parameters using the short-memory estimation is as follows:
θ = θ η β 1 m ^ t ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k v ^ t ( α ) + ϵ

3.5.3. LSTM Training Policy

With LSTM nets, LSTM parameters θ are adjusted in training to reduce the loss function. The fractional Nadam tuner is used to tune LSTM parameters while learning, as shown below:
θ = θ η β 1 m ^ t ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k v ^ t ( α ) + ϵ
This adjustment rule integrates the fractional gradient, allowing LSTM to grasp long-term relationships effectively. This is especially beneficial in studies like time series prediction, where precise forecasting requires comprehending complex temporal dependencies.
The incorporation of fractional derivatives improves the optimizer’s capacity to manage complex time-varying dynamics, rendering it especially appropriate for predictive purposes, including stock market forecasting [40].

3.6. Fractional SGD (Frac-SGD)

Stochastic Gradient Descent (SGD) represents the basic optimizer and provides the foundation for a large number of others. The traditional upgrade rule for SGD is the following:
θ = θ η θ J ( θ )
where θ values are the parameters (weights, biases), η is the learning rate, and  θ J ( θ ) is the standard gradient of the loss function J ( θ ) .

3.6.1. General Fractional SGD (Frac-SGD)

Within Fractional SGD, the classical gradient θ J ( θ ) is substituted with the fractional derivative D α θ J ( θ ) , incorporating a memory impact:
θ = θ η D α θ J ( θ )
where D α θ J ( θ ) represents the fractional gradient of level α , 0 < α 1 , and  α = 1 retrieves classic SGDs.
The fractional order α , which handles the impact of preview gradients, enables the optimizer to introduce past knowledge, which is particularly relevant for teaching long-term linkages in time-series data.

3.6.2. Short-Memory Approximation of Fractional Gradient

Based on the approximation, given by Equation (5), the Frac-SGD is given by:
θ = θ η 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
This estimated fractional gradient aggregates historical contributions from the latest gradient history, offering a greater understanding of chronological dynamics instead of having to compute a complete fractional derivative across the full story.

3.6.3. Learning Rule in the Context of LSTM

During LSTM network training, Frac-SGD adjusts the LSTM parameters (weights and bias) according to the fractional gradient of the waste relative to those parameters:
θ = θ η 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
In the case of this example, θ stands for all LSTM parameters, and  J ( θ ) is commonly the cross-entropy loss or root mean square error (RMSE) depending on the task.
Fractional order α incorporates long-term storage into weight updates, which is particularly useful for learning LSTM models on financial time series or any other sequence data involving long-term dependencies [41].

3.7. Fractional RMSprop (Frac-RMSprop)

RMSprop (Root Mean Square Propagation) represents an evolutionarily adaptive training technique that standardizes gradient adjustments by means of a shifting means of quadratic gradients. The standardized RMSprop adjustment works as follows:
E [ g 2 ] t = β E [ g 2 ] t 1 + ( 1 β ) ( θ J ( θ ) ) 2
θ = θ η E [ g 2 ] t + ϵ θ J ( θ )
where θ values represent the machine’s configuration parameters, η stands for the learning rate, β represents the decline rate of the shifting mean, and  ϵ represents a tiny fixed value to guarantee numerical consistency.

3.7.1. General Fractional RMSprop (Frac-RMSprop)

In Fractional RMSprop, the classical gradient θ J ( θ ) is substituted with the fractional gradient D α θ J ( θ ) . Upgrades take the form of
E [ D α g 2 ] t = β E [ D α g 2 ] t 1 + ( 1 β ) ( D α θ J ( θ ) ) 2
θ = θ η E [ D α g 2 ] t + ϵ D α θ J ( θ )
where D α θ J ( θ ) represents the fractional derivative of order α .
The fractional derivative incorporates a memory of previous experience within the optimization procedure, providing greater smoothness and reactivity to long-term models, which is especially beneficial in non-stationary or turbulent scenarios.

3.7.2. Short-Memory Approximation of the Fractional Gradient

Based on the approximation, given by Equation (5), in Frac-RMSprop, the fractional gradient term D α θ J ( θ ) and its squared value are approximated using recent gradient history:
D α θ J ( θ t ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
This practical formulation enables efficient training while still leveraging fractional dynamics.

3.7.3. Learning Rule in the Context of LSTM

During LSTM model training, the Frac-RMSprop algorithm upgrades the network’s parameters by integrating the fractional past gradient history:
E [ D α g 2 ] t = β E [ D α g 2 ] t 1 + ( 1 β ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 2
θ = θ η E [ D α g 2 ] t + ϵ 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
In this case, θ stands for all LSTM parameters, and  J ( θ ) is typically the loss function employed for workflow forecasting purposes. This fractional method enhances LSTM’s capacity to adjust to complex, time-dependent sequences in the data [29].

3.8. Fractional Adagrad (Frac-Adagrad)

Adagrad (Adaptive Gradient Algorithm) tunes the training rate of every parameter according to the previous gradient magnitude. The usual Adagrad upgrades consist of the following:
G t = G t 1 + ( θ J ( θ ) ) 2
θ = θ η G t + ϵ θ J ( θ )
where θ values are the model parameters, η is the initial learning rate, G t is the cumulative sum of squared gradients, and  ϵ is a small constant to avoid division by zero.
With increasing G t , the actual training rate diminishes, rendering Adagrad ideally matched to sparse data challenges.

3.8.1. General Fractional Adagrad (Frac-Adagrad)

With Fractional Adagrad, the conventional gradient is substituted with the fractional gradient D α θ J ( θ ) , introducing a storage aspect to the adjustment of training rates:
G t ( α ) = G t 1 ( α ) + ( D α θ J ( θ ) ) 2
θ = θ η G t ( α ) + ϵ D α θ J ( θ )
where D α θ J ( θ ) represents the fractional derivative of the loss against the parameter.
This fractional accumulation allows us to better grasp the long-term evolution of the gradient, providing an effective means of addressing non-stationary or time-varying dependencies in the data.

3.8.2. Short-Memory Approximation of the Fractional Gradient

Calculating the entire fractional gradient across the whole history is expensive. Utilizing the short memory assumption, the fractional gradient can simply be estimated from the latter M steps (using Equation (5)):
D α θ J ( θ t ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
Thus, the updates become
G t ( α ) = G t 1 ( α ) + 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 2
θ = θ η G t ( α ) + ϵ 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
Such an estimation maintains the memory aspect but is still computable.

3.8.3. Learning Rule in the Context of LSTM

During LSTM nets training, the Frac-Adagrad scheduler changes parameter settings by incorporating fresh gradient records:
G t ( α ) = G t 1 ( α ) + 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 2
θ = θ η G t ( α ) + ϵ 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
This fractional variant allows LSTM designs to tailor training rates accordingly across time, considering both newer and earlier gradient knowledge, thereby enhancing learning dynamics for time-series and sequential processes [40].

3.9. Fractional Adadelta (Frac-Adadelta)

Adadelta is an enhancement to Adagrad that restricts the collection of squared gradients within a fixed period, thereby restraining the training rate against indefinite decay. The default Adadelta upgrades include the following:
E [ g 2 ] t = β E [ g 2 ] t 1 + ( 1 β ) ( θ J ( θ ) ) 2
Δ θ t = E [ Δ θ 2 ] t 1 + ϵ E [ g 2 ] t + ϵ θ J ( θ )
θ = θ + Δ θ t
where E [ g 2 ] t is the decaying average of past squared gradients, E [ Δ θ 2 ] t is the decaying average of past squared updates, β is the decay rate, and  ϵ is a small constant for numerical stability.
In contrast to Adagrad, Adadelta automatically adjusts training rates depending on the latest gradient feedback, rather than necessitating a hand-set initial training rate η .

3.9.1. General Fractional Adadelta (Frac-Adadelta)

With the fractional variant, the conventional gradient is substituted by the fractional derivative D α θ J ( θ ) , improving storage and dynamics:
E [ D α g 2 ] t = β E [ D α g 2 ] t 1 + ( 1 β ) ( D α θ J ( θ ) ) 2
Δ θ t = E [ Δ θ 2 ] t 1 + ϵ E [ D α g 2 ] t + ϵ D α θ J ( θ )
θ = θ + Δ θ t
where D α θ J ( θ ) represents the fractional derivative of the loss function relative to the parameters.
This extension incorporates a manageable memory impact through fractional order α , catching deeper time dependencies.

3.9.2. Short-Memory Approximation of the Fractional Gradient

Utilizing the short memory idea, the fractional gradient D α θ J ( θ t ) can be approximated over the most recent M steps (using Equation (5)):
D α θ J ( θ t ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
Frac-Adadelta updates become
E [ D α g 2 ] t = β E [ D α g 2 ] t 1 + ( 1 β ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 2
Δ θ t = E [ Δ θ 2 ] t 1 + ϵ E [ D α g 2 ] t + ϵ 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
θ = θ + Δ θ t
This approach preserves the benefits of Adadelta while efficiently incorporating fractional memory.

3.9.3. Learning Rule in the Context of LSTM

During LSTM network training, Frac-Adadelta adjusts training rates accordingly while integrating fractional memory:
E [ D α g 2 ] t = β E [ D α g 2 ] t 1 + ( 1 β ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 2
Δ θ t = E [ Δ θ 2 ] t 1 + ϵ E [ D α g 2 ] t + ϵ 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
θ = θ + Δ θ t
This enables LSTM to adjust more smoothly to changing temporal dynamics over the course of training, especially during the treatment of long or noisy sequences [41].

3.10. Fractional FTRL (Frac-FTRL)

The Follow-The-Regularized-Leader (FTRL) approach is used extensively in e-learning and sparse optimization. It involves a combination of gradient summation and regularization to adjust the model configuration. The basic upgrades of the FTRL algorithm include the following:
g t = θ J ( θ )
z t = z t 1 + g t 1 λ θ
θ = prox λ ( z t )
where g t is the standard gradient at time t, z t accumulates the gradients with a shrinkage term, λ is the regularization strength, and  prox λ denotes a proximal operator incorporating L 1 and/or L 2 regularization.
FTRL is especially efficient in high-dimensional applications, as it encourages dispersion and monitors the amplitude of weights.

3.10.1. General Fractional FTRL (Frac-FTRL)

With the fractional version, the classical gradient is substituted with a fractional derivative D α θ J ( θ ) , enabling the algorithm to better grasp long-term relationships:
g t ( α ) = D α θ J ( θ )
z t ( α ) = z t 1 + g t ( α ) 1 λ θ
θ = prox λ ( z t ( α ) )
where g t ( α ) represents the fractional derivative of the loss and z t ( α ) collects fractional gradients and adjustment correction.
Fractionation incorporates a deeper memory architecture within FTRL, which is helpful in nonstationary or volatile contexts.

3.10.2. Short-Memory Approximation of the Fractional Gradient

The fractional gradient g t ( α ) can be estimated with a limited window of fresh gradients:
g t ( α ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
Thus, the Frac-FTRL update becomes
z t ( α ) = z t 1 + 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 1 λ θ
θ = prox λ ( z t ( α ) )
With this approximation, Frac-FTRL can be practically deployed with no need for full access to the history.

3.10.3. Learning Rule in the Context of LSTM

With LSTM training, Frac-FTRL adjustments can dynamically equilibrate gradient information and regularization:
z t ( α ) = z t 1 + 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) 1 λ θ
θ = prox λ ( z t ( α ) )
This allows LSTM models to preserve robustness and sparse representations while adapting to the nonstationary and highly dynamic time-series patterns typically found in financial time-series modeling and continuous data scenarios [41].

3.11. Fractional Adamax (Frac-Adamax)

Adamax is a further development of the Adam optimizer, which employs the infinite norm to compute parameter adjustments. The basic Adamax upgrade rule is the following:
m t = β 1 m t 1 + ( 1 β 1 ) θ J ( θ )
u t = max ( β 2 u t 1 , | θ J ( θ ) | )
θ = θ η u t m t
where m t denotes the momentum factor, u t denotes the running maximum of the gradient amplitude, η denotes the training rate, and  β 1 and β 2 manage the momentum and normalization behaviors.

3.11.1. General Fractional Adamax (Frac-Adamax)

With the fractional version, we substitute the classical gradient with the fractional gradient D α θ J ( θ ) , introducing storage impacts, incorporating long-term dependencies into the model:
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) D α θ J ( θ )
u t ( α ) = max ( β 2 u t 1 ( α ) , | D α θ J ( θ ) | )
θ = θ η u t ( α ) m t ( α )
Here, m t ( α ) represents the fractional momentum factor and u t ( α ) denotes the fractional running maximum of the gradient amplitudes.
This change enhances Adamax’s sensitivity and robustness in the face of frequent or erratic updates, especially in the case of non-stationary time series or noisy gradients.

3.11.2. Short-Memory Approximation of the Fractional Gradient

To compute the fractional derivatives, we estimate the fractional gradient by means of a bounded window of fresh gradients:
D α θ J ( θ ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
Thus, the Frac-Adamax update rule becomes
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
u t ( α ) = max ( β 2 u t 1 ( α ) , 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) )
θ = θ η u t ( α ) m t ( α )
This estimation minimizes calculation complexity while maintaining the benefits of fractional derivatives.

3.11.3. Learning Rule in the Context of LSTM

With LSTM networks, Frac-Adamax provides a more robust and efficient optimization scheme, particularly in situations where training is sensitive to irregular time-series data. The update procedure for training LSTMs with Frac-Adamax is as follows:
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ t k )
u t ( α ) = max ( β 2 u t 1 ( α ) , 1 h α k = 0 M ω k ( α ) θ J ( θ t k ) )
θ = θ η u t ( α ) m t ( α )
This guarantees that LSTM networks are able to efficiently capture long-term dependencies while maintaining stability, even in the presence of strong gradients or significant parameter changes [40].

3.12. Fractional AdamW Lion

Fractional AdamW Lion combines the fractional derivative formalism with AdamW’s decoupled weight decay and Lion’s momentum-guided updates using the sign of gradients. This hybrid method leverages long-memory dynamics from fractional calculus while incorporating modern optimization strategies to enhance convergence and generalization, especially within sequential models like LSTMs.

3.12.1. Lion’s Momentum and Update Rule

The Lion (EvoLved Sign Momentum) optimizer is based on the following momentum and update equations:
m t = β 1 m t 1 + ( 1 β 1 ) θ J ( θ )
θ = θ η · sign ( m t )
Lion uses a sign-based update direction rather than gradient magnitude. AdamW, meanwhile, applies decoupled weight decay as
θ θ η · λ · θ
where λ is the weight decay coefficient.

3.12.2. Fractional AdamW Lion Update Equations

Using this approximation, the fractional momentum update becomes
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) · 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k
Then, the parameter update rule is
θ = θ η · sign ( m t ( α ) ) η · λ · θ
This equation integrates three mechanisms: fractional memory via m t ( α ) , sign-based adaptive direction, and decoupled regularization with weight decay.

3.12.3. Application to LSTM Training

In LSTM networks, the parameter vector θ includes weights and biases governing input, forget, and output gates, along with cell state transitions. Fractional AdamW Lion enables capturing long-range temporal dependencies, regularized and stable training, and directionally aware optimization using fractional history.
This makes the optimizer especially effective for time series tasks such as financial forecasting, anomaly detection, and biomedical signal analysis.

3.13. Fractional Yogi

Yogi is an adaptive optimization algorithm designed to address some of the limitations of Adam, particularly in controlling the variance of the second moment estimate v t . Unlike Adam, which uses an exponential moving average, Yogi updates v t with a sign-based correction term that prevents rapid growth and promotes stability during training.
The classic Yogi update rules are
m t = β 1 m t 1 + ( 1 β 1 ) θ J ( θ )
v t = v t 1 ( 1 β 2 ) · sign ( v t 1 ( θ J ( θ ) ) 2 ) · ( θ J ( θ ) ) 2
Here, θ J ( θ ) denotes the gradient of the loss function with respect to the parameters, while β 1 and β 2 are decay rates for the first and second moment estimates, respectively. The function sign ( · ) represents the element-wise sign operation that regulates the adjustment of v t to avoid uncontrolled variance growth.
The bias-corrected moment estimates are then computed as
m ^ t = m t 1 β 1 t , v ^ t = v t 1 β 2 t
Finally, the parameter update step is given by
θ = θ η m ^ t v ^ t + ϵ
where η is the learning rate and ϵ a small constant ensuring numerical stability.

3.13.1. Fractional Yogi with General Fractional Gradient

In the fractional variant, the classical gradient θ J ( θ ) is replaced by its fractional derivative D α θ J ( θ ) , with  D α representing a fractional derivative operator of order α . This allows the optimizer to leverage long-term memory effects embedded in gradient information, which is particularly useful in sequential and time-dependent tasks.
The fractional moment updates are expressed as follows:
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) D α θ J ( θ )
v t ( α ) = v t 1 ( α ) ( 1 β 2 ) · sign v t 1 ( α ) D α θ J ( θ ) 2 · D α θ J ( θ ) 2
Bias correction factors adapt accordingly:  
m ^ t ( α ) = m t ( α ) 1 β 1 t , v ^ t ( α ) = v t ( α ) 1 β 2 t
The parameter update then becomes
θ = θ η m ^ t ( α ) v ^ t ( α ) + ϵ
The fractional moment updates can be approximated by
m t ( α ) = β 1 m t 1 ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k
v t ( α ) = v t 1 ( α ) ( 1 β 2 ) · sign v t 1 ( α ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k 2 · 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k 2

3.13.2. Application in LSTM Training

When applied to LSTM networks, the parameter vector θ contains weights and biases controlling gating mechanisms and cell states. The fractional Yogi optimizer incorporates fractional-order memory into adaptive moment estimation, enabling the model to better capture long-range temporal dependencies, stabilize optimization, and improve convergence behavior.
The update rule remains
θ = θ η β 1 m ^ t ( α ) + ( 1 β 1 ) 1 h α k = 0 M ω k ( α ) θ J ( θ ) t k v ^ t ( α ) + ϵ
This fractional enhancement helps LSTM models exploit memory effects effectively, boosting performance in complex sequential learning tasks.
In conclusion, incorporating fractional-order derivatives within optimization routines presents a promising approach to enhancing the learning capabilities of LSTM models for modeling financial time-series data. By embedding long-memory effects into the optimization process, these methods can more effectively capture long-term dependencies in the data, resulting in improved forecasting accuracy and overall predictive performance.

3.14. Why Caputo over Other Definitions?

Unlike the Riemann–Liouville derivative, which requires fractional-order initial conditions and is not zero on constants, the Caputo derivative maintains
D t α C constant = 0 ,
and only requires standard (integer-order) initial values, which are naturally compatible with recurrent architectures like LSTMs. This property makes Caputo derivatives more suitable for learning systems where initial states are learned or well-defined [42]. Moreover, in our context, where training efficiency is essential, the Caputo formulation supports the short-memory principle—a practical approximation where the integration window is truncated to a finite horizon without significant loss of accuracy [30]. In addition, using a small value of α , in Frac-LSTM, replicates this inductive bias, yielding smoother predictions and enhanced stability, especially in trend-following regimes. In fact, the Caputo derivative’s behavior resembles the Hurst exponent H in fractional Brownian motion [31]; see Equation (8). These properties collectively make the Caputo derivative the most suitable fractional formulation for integrating memory-based inductive biases in neural forecasting models for financial time series.

3.15. Computational Scalability and Runtime Analysis

Fractional-order optimization methods, such as Frac-RMSprop and Frac-Adam, introduce fractional derivatives into gradient-based training to incorporate long-term memory effects. While this provides a smoother and more stable convergence in noisy financial time series, it raises concerns regarding computational scalability, particularly for latency-sensitive applications like high-frequency trading (HFT).

3.15.1. Approximation of Fractional Derivatives

In our approach, we use the Grünwald–Letnikov (GL) approximation of the fractional derivative of order α ( 0 , 1 ] for a function f ( t ) is given by Equation (5). In practice, M is chosen such that M T (entire series length), making the computation tractable.

3.15.2. Time Complexity Comparison

The inclusion of the truncated sum in Equation (5) implies a per-parameter update time complexity of O ( M ) instead of O ( 1 ) as in classical optimizers. However, the following comparative benchmark illustrates that for moderate values of M { 30 , 60 , 90 } , the observed time overhead remains modest; see Table 1.

3.15.3. Implications for Trading Frequency

For medium-frequency trading (daily to hourly horizons), the computational cost is acceptable and easily offset by gains in prediction stability and return performance. However, for ultra-high-frequency trading (UHFT), where millisecond latency is critical, even minor computational overheads may render fractional methods impractical without further optimization.
To address this, we propose two future directions:
  • Development of GPU-accelerated implementations for fast convolutional approximations of fractional derivatives.
  • Exploration of approximate memory kernels or fixed-window compression to reduce the sum in Equation (5) to a constant-time operation.
In conclusion, fractional optimizers with short-memory truncation introduce manageable computational costs in medium-frequency financial modeling. Their enhanced memory representation and robustness to noise justify their use in such settings. For HFT applications, ongoing research should focus on developing faster approximations or hybrid fractional–classical schemes.

4. Experimental Setup

This paper explores the benefits of developing fractional-order extensions of standard optimization techniques to more effectively capture long-term dependencies during training. The proposed fractional optimizers are integrated into LSTM networks and evaluated using stock market datasets within the context of financial time-series forecasting. To assess the performance of conventional versus fractional optimizers, we examine traditional SGD-based algorithms alongside their fractional counterparts. Through a series of experiments, we highlight the primary advantages of fractional optimization in enhancing the efficiency and stability of time-series predictions. This benchmarking study focuses in particular on stock price forecasting—where long-term dependencies play a critical role—and investigates the extent to which fractional-order LSTM models outperform their classical equivalents.
The implementation was performed using TensorFlow (via Keras) in Google Colab, which provides access to Tesla T4 GPUs. The key configuration details are as follows:
  • Framework: TensorFlow 2.x (with Keras API);
  • Hardware: Google Colab Tesla T4 GPU;
  • Batch Size: 64;
  • Sequence Length (Memory Window): Varied, tested with 30, 60, and 90;
  • Epochs: 100 (with early stopping in some runs to prevent overfitting).
The environment also included custom implementations of fractional optimizers (e.g., Frac-Adagrad) based on fractional derivative approximations integrated into the optimizer update rules.

4.1. Data Description

The benchmark dataset employed in this study consists of daily stock price data for nine leading companies, sourced from Yahoo Finance: Apple Inc. (AAPL), Microsoft (MSFT), Google (GOOGL), Amazon (AMZN), Meta Platforms (META), NVIDIA (NVDA), JPMorgan Chase (JPM), Visa (V), and UnitedHealth Group (UNH). The data span from 1 January 2015 to 1 January 2023. The closing prices are selected as the primary feature for forecasting purposes. The raw input data are normalized using the MinMaxScaler technique, which scales the features to a range between 0 and 1. Sequences of 60 consecutive closing prices are used to train the models. The dataset is split into 80% for training and 20% for testing. This type of dataset is widely used in stock price forecasting tasks, as demonstrated in [43], where the objective is to predict future prices based on historical trends.

4.2. Algorithms

The fundamental model is a conventional LSTM neural network, designed to handle time-dependent sequential data. The architecture includes a 50-unit LSTM layer followed by a dense output layer for stock price forecasting. The model uses a 60-day sequence of closing prices to predict the next day’s closing price [44].
The fractional LSTM model extends the conventional LSTM by incorporating fractional memory effects. In this model, the LSTM output is combined with the weighted mean of the preceding input sequence, influenced by a fractional parameter α . This modification aims to more accurately capture long-term dependencies, potentially improving forecasting performance. The value of α varies from 0.1 to 0.9 to assess the effects of different memory levels on model accuracy. Fractional LSTMs have also been explored in other time series prediction domains [45].
Table 2 lists the key hyperparameters used by the optimizers in this experiment. These hyperparameters are critical for model convergence and performance. The learning rate is the most important parameter, while the various optimizers include additional tuning factors such as momentum or decay.
Adam is an adaptive optimizer that combines the advantages of two extensions of gradient descent, AdaGrad and RMSProp, by utilizing moment estimates to adjust learning rates for individual parameters. Adam is commonly used for time series tasks because it is both efficient and robust [27].
RMSprop divides the learning rate by a moving average of the square root of recent gradients. This technique is particularly useful for non-stationary problems like stock price forecasting, where the data distribution may change over time [46].
SGD (Stochastic Gradient Descent) is a relatively simple optimizer that updates parameters based on a single mini-batch. Although SGD tends to be less effective for complex tasks such as time series prediction, it remains popular due to its simplicity and effectiveness when properly tuned.
Adagrad adapts the learning rate according to the frequency of parameter updates. It is well suited for sparse data scenarios where some features require more frequent updates than others.
Adadelta improves on Adagrad by mitigating its rapid learning rate decay using a running average of squared gradients to dynamically adjust the learning rate.
Nadam combines Adam with Nesterov’s accelerated gradient (NAG), providing the benefits of momentum alongside adaptive learning rates. This combination often results in faster convergence in practical applications.
FTRL (Follow-The-Regularized-Leader) is an optimizer designed for large-scale machine learning, commonly applied to sparse datasets such as those encountered in recommender systems or logistic regression tasks.
Adamax is a variant of Adam that uses the infinity norm, offering greater robustness in certain situations.
In addition, we have conducted a multi-dimensional sensitivity analysis varying key LSTM architectural parameters—hidden units ([20, 50, 100]) and dropout rates ([0.0, 0.2, 0.5])—across different memory window sizes ([30, 90]). The number of layers was fixed to 2 to limit complexity, but model performance was systematically evaluated under all other design variations; see Table 3. The results demonstrate that architectural choices significantly affect the Sharpe ratio and directional accuracy, confirming that both model design and optimizer behavior influence predictive performance.
The experimental results highlight that the best configuration in terms of Sharpe ratio is obtained with α = 0.4 , a window size of 90, 20 hidden units, and a dropout rate of 0.5 , yielding a Sharpe ratio of 0.146293 . However, the highest cumulative return of 16.004845 is achieved with the configuration α = 0.4 , window size 90, 50 hidden units, and dropout rate of 0.2 , along with a solid Sharpe ratio of 0.119598 and accuracy of 0.503759 . Although the maximum accuracy observed is 0.514286 for α = 0.3 , window size 90, 20 hidden units, and dropout 0.5 , this setting yields poor financial performance, with a cumulative return of only 0.858746 and a modest Sharpe ratio of 0.086247 .
Overall, the results suggest that a window size of 90 consistently leads to better performance across all metrics compared to 30. Additionally, a higher value of α ( 0.4 ) outperforms 0.3 in most top-performing configurations. Dropout values of 0.2 or 0.5 appear to enhance generalization and improve the Sharpe ratio and cumulative returns, particularly when combined with a larger window size. Lastly, the number of hidden units influences performance, with 50 or 100 units generally providing stronger returns, although the best Sharpe ratio is observed with only 20 units. These observations indicate that the configuration of α = 0.4 , window size 90, hidden units 50, and dropout 0.2 represents the most balanced setup for both profitability and risk-adjusted return.
This experimental framework provides a systematic approach to evaluating LSTM and fractional LSTM models for stock price forecasting. By exploring different optimizers and hyperparameter configurations, we aim to identify the optimal setup for accurate and robust predictions. The experimental results are expected to deepen our understanding of the effects of fractional memory and optimizer choices on forecasting accuracy.

4.3. Results and Analysis

This section offers a comprehensive analysis of the results of the standard optimization algorithms—SGD, Adam, RMSprop and Adagrad—and their fractional extensions, namely Frac-Adam, Frac-RMSprop and Frac-Adagrad, emphasizing their contrasting convergence behavior.

4.3.1. Analysis of Standard vs. Fractional Adam Optimization Results

Figure 4 shows the forecasts provided by the classic and fractional Adam methods.
The comparisons of conventional Adam and Fractional Adam (Frac-Adam) optimization at various fractional orders α for forecasting stock prices are presented and yield interesting insights regarding the effectiveness of fractional derivatives in the optimization procedure. A comprehensive discussion of the positive and negative aspects of the results is presented in this section.
The overall tendency noted in the results indicates that the behavior of the fractional Adam optimizer differs remarkably according to the choice of α and stock market ticker, highlighting that fractional derivatives used may not be generally suitable for all stock price forecasting tasks. Across the analyzed values, the standard Adam optimizer typically outperformed or closely matched the fractional Adam at smaller values of α , with some variation between values. For GOOGL (Google)—see Figure 5—the conventional Adam optimizer scored a reasonable MSE of 62.64, and the fractional Adam with α = 0.1 improved this figure to 32.40, whereas α = 0.3 yielded a marginal enhancement with an MSE of 52.93. However, as  α increases (e.g., to α = 0.5 ), the MSE deteriorates, although it improves marginally for α = 0.9 .
For AMZN (Amazon)—see Figure 6—classical Adam reached an MSE of 74.65, but fractional Adam with α = 0.1 upped the MSE to 93.66, and additional increments in α resulted in significantly greater MSE values, peaking at 349.39 for α = 0.9 .
META—see Figure 7—demonstrated considerable performance deterioration with fractional Adam, where the conventional Adam optimizer yielded an MSE of 148.65, while fractional optimization generated significant MSE increases, especially for larger values of α (e.g., α = 0.9 , MSE = 11,691.78).
Finally, for NVDA (NVIDIA)—see Figure 8—the traditional Adam optimizer achieved the lowest MSE of 3.15, denoting excellent performance; however, the fractional Adam with smaller values of α led to only marginal increments in MSE, and larger values of α disrupted the optimization process, leading to a higher increment in MSE (e.g., α = 0.5 , MSE = 55.91).
Smaller values of α (0.1 and 0.3) typically result in minor enhancements or slight deteriorations, while higher values α > 0.3 more frequently worsen performance, notably for stocks like META, AMZN, and NVDA, suggesting that fractional derivatives can disrupt the optimization workflow. The efficiency of fractional optimization differs depending on the stock: it performs remarkably well for AAPL and GOOGL at specific α values but significantly deteriorates performance for certain other stocks, such as META and UNH. In most cases, conventional Adam performs better than or similarly to fractional Adam, most notably for stocks like META, NVDA, and JPM, where larger values of α result in noticeable increases in MSE.
The results in Table 4 demonstrate the significant impact of the fractional order α on the forecasting and trading performance of the Frac-Adam-LSTM model for AAPL stock. Compared to the classical Adam-LSTM ( α = 0.0 ), which achieves a Sharpe ratio of 0.78 and a cumulative return of 89.04, fractional models with intermediate α values generally yield enhanced risk-adjusted returns and directional accuracy. In particular, α = 0.7 results in the highest Sharpe ratio (1.305), the greatest cumulative return (148.66), and the best directional accuracy (51.73%), indicating improved profitability and predictive performance. However, very high fractional orders, such as α = 0.9 , lead to a sharp decline in performance, underscoring the sensitivity of the model to excessive long-memory effects in the volatile AAPL market. Overall, the inclusion of fractional derivatives provides a clear advantage, suggesting that well-calibrated fractional orders can more effectively capture complex market dynamics and support more informed trading decisions.
Figure 9 presents the sensitivity analysis of the Frac-Adam-LSTM model on AAPL. The results show that the model’s performance is affected by both the learning rate ( η ) and the fractional order ( α ). For  α = 0.3 , the Sharpe ratio peaks at η = 10 4 , accompanied by a relatively high cumulative return of 66.59, while accuracy remains stable. In contrast, for  α = 0.5 and α = 0.7 , cumulative returns are more consistent across higher learning rates ( η = 10 3 and 10 2 ), with  α = 0.7 achieving the highest cumulative return of 71.12 at η = 10 2 . Notably, α = 0.9 yields robust cumulative returns across all learning rates, indicating the fractional optimizer’s resilience; however, its peak Sharpe ratio of 0.2137 at η = 10 5 remains slightly lower than those at mid-range η values. Interestingly, lower fractional orders such as α = 0.5 display instability at small learning rates, evidenced by negative Sharpe ratios and returns. This suggests that combinations of low α and low η may lead to underperformance. Overall, higher fractional orders demonstrate greater tolerance and stability across a wider range of learning rates, reinforcing the idea that the fractional component serves as an implicit regularizer and enhances optimizer robustness.
Figure 10 gives the sensitivity analysis of Frac-Adam-LSTM on GOOGL memory window size. The sensitivity analysis shows that the best Sharpe ratio and cumulative return were achieved with α = 0.7 and a memory window of 60 days, yielding a Sharpe ratio of 0.1134 and a cumulative return of 32.29. Overall, models with a memory window of 60 consistently outperform others across all alpha values, indicating that a medium-term memory window provides a better trade-off between risk and return. Higher values of α (0.7 and 0.9) slightly improve accuracy, with the highest directional accuracy of 0.5103 observed for α = 0.9 . This suggests that both the fractional learning rate and memory window size significantly influence the performance of the Frac-ADAM-LSTM model.

4.3.2. Analysis of Standard vs. Fractional RMSprop Optimization Results

Figure 11 gives the MSE associated with RMSprop and fracRMSprop used to learn LSTM on AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH stocks.
Benchmarking the standard RMSprop optimizer and its fractional order variants on various stocks (AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH) reveals considerable differences in performance depending on the chosen fractional order α . In numerous cases, particularly for AAPL, GOOGL, JPM (see Figure 12), and UNH, fractional RMSprop with smaller α values (e.g., α = 0.1 or α = 0.3 ) outperforms the standard optimizer in terms of loss minimization. For example, AAPL’s loss decreases from 52.54 (standard) to 44.48 when using α = 0.1 , while GOOGL shows a similar trend with a decline from 62.64 to 32.40. These improvements suggest that lower fractional orders may provide beneficial memory effects and smoother convergence on specific datasets, particularly those with low volatility or complex data patterns.
Nonetheless, when α exceeds 0.5, the loss margin tends to worsen significantly for many stocks, reflecting either volatile or divergent formation dynamics. This trend is particularly evident for META and UNH, where losses reach 11 , 691.78 and 2798.63 , respectively, at α = 0.9 , compared to their standard values of 148.65 and 1281.37 . Such behavior suggests that high-order memory effects may excessively weight past gradients, causing delayed responses to new data or even leading to optimization divergence. Stocks with more complex or volatile time series, like META, seem especially vulnerable to this phenomenon, underscoring the importance of carefully tuning α to the characteristics of the data.
Conversely, for stocks such as MSFT and AMZN (see Figure 13), the classical RMSprop often outperforms the fractional versions at lower α , with the best results occurring at intermediate values like 0.3 or 0.5.
Table 5 presents a performance comparison between the standard LSTM and FracLSTM models for AMZN stock price forecasting, highlighting the nuanced benefits of fractional memory integration. While the standard LSTM yields an RMSE of 7.8237, the FracLSTM with α = 0.0 (corresponding to classical LSTM behavior) achieves the lowest RMSE of 5.4988, indicating enhanced short-term prediction accuracy. As the fractional order α increases, RMSE generally deteriorates, peaking at α = 0.4 , which may suggest diminishing returns of long-memory effects in the highly stochastic context of AMZN price movements. Despite fluctuations in RMSE, the Sharpe ratio remains constant at 0.1949 across all models, indicating comparable risk-adjusted performance. Notably, the highest directional accuracy (51.30%) is observed at α = 0.2 , suggesting that a modest degree of fractional memory can improve the model’s ability to capture directional trends—an asset in trading strategies. These findings support the notion that careful tuning of the fractional order in FracLSTM can lead to improved predictive performance, especially for volatile assets such as AMZN.
Figure 13 gives the sensitivity analysis of Frac-Adagrad-LSTM on AAPL considering the learning rate. The sensitivity analysis of Frac-RMSprop-LSTM on AAPL reveals that learning rates around 10 3 combined with higher alpha values ( α = 0.7 or α = 0.9 ) tend to yield the best trade-offs between Sharpe ratio and cumulative return. Notably, the configuration ( α = 0.7 ,   LR = 0.001 ) achieves a high Sharpe ratio of 0.206 and the highest cumulative return of 73.19. In contrast, extremely low or high learning rates result in suboptimal performance regardless of alpha, suggesting a sweet spot in learning dynamics governed by fractional control. Accuracy remains relatively stable across configurations, centered around 51%, indicating modest directional predictability. Figure 14 gives the sensitivity analysis of Frac-RMSprop-LSTM on GOOGL memory window size. The sensitivity analysis of the Frac-RMSprop-LSTM model applied to GOOGL stock reveals that the performance varies with the memory window size and the fractional scaling parameter α . The highest Sharpe ratio of 0.1308 is achieved at α = 0.3 and a memory window of 90, while the best cumulative return of 31.43 corresponds to α = 0.3 and window size 60. In general, intermediate memory windows (60 or 90) yield better Sharpe ratios and returns across all α values. Higher α values (e.g., 0.9) slightly improve accuracy but do not consistently enhance financial performance. This indicates a trade-off between memory depth and learning rate scaling, with moderate values offering the best predictive and financial outcomes.

4.3.3. Analysis of Standard vs. Fractional Adagrad Optimization Results

Figure 15 illustrates the mean squared error (MSE) associated with the standard Adagrad and Frac-Adagrad optimizers applied to LSTM models for forecasting stock prices of AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH. Overall, the Frac-Adagrad-LSTM with a fractional order α = 0.1 outperforms the standard Adagrad-LSTM in terms of prediction accuracy, exhibiting notably lower errors for stocks such as AAPL, GOOGL, JPM, and UNH. However, as the fractional order α increases, performance deteriorates markedly. In particular, high values of α (e.g., α = 0.5 and α = 0.9 ) result in significantly elevated MSEs for several stocks, including META and UNH, indicating instability and poor generalization. These results highlight the data-dependent nature of fractional-order optimization: while a small fractional component can enhance learning by introducing mild memory effects, larger values may hinder convergence and negatively impact predictive accuracy.
The benchmarking investigation involving the traditional Adagrad optimizer and its fractional counterpart (Frac-Adagrad) at various values of the fractional order α { 0.1 ,   0.3 ,   0.5 ,   0.7 ,   0.9 } demonstrates a complex, asset-dependent response in terms of loss performance. In particular, for certain stocks such as AAPL (see Figure 16), GOOGL (see Figure 17), JPM, and UNH (see Figure 18), the fractional versions with smaller α values (e.g., α = 0.1 ) consistently outperform conventional Adagrad, indicating that the memory-retaining features of Frac-Adagrad are advantageous in such cases—possibly due to smoother loss surfaces or slower dynamic behaviors.
In contrast, intermediate α values such as 0.5 or 0.7 tend to yield substantially higher loss values, suggesting unstable or even divergent optimization behavior in many cases—especially for UNH and META, where losses spike to abnormally high levels.
In contrast, assets like MSFT, META and AMZN display a reversal of this tendency, with classical Adagrad outperforming or being comparable to Frac-Adagrad for smaller α . As  α grows, the loss values rise faster, especially for META and UNH, indicating a significant deterioration in optimization efficiency. This observation suggests a potential downside of using high memory effects in volatile or highly non-linear loss surfaces, which can magnify gradients and result in overrunning or mediocre convergence. NVDA displays outlier behavior with extremely low losses across all methods, while moderate rises at larger α still suggest that fractional dynamics require careful tuning.
In short, Frac-Adagrad provides encouraging enhancements over Adagrad for a number of stocks where fractional order is small, benefiting from its capacity to preserve long-term gradient memory without introducing excessive volatility. Nevertheless, it also induces instability at larger α values, particularly for complicated or volatile datasets. This underlines the relevance of the adaptive or data-driven screening of fractional order α , which can change according to asset or training phase, to secure convergence and consistent performance.
We benchmarked the performance of the following optimization methods: Frac-SGD, Frac-Adadelta, Frac-Nadam, Frac-FTRL and Frac-Adamax. A comparable in-depth analysis would reveal the identical impact of α selection. However, to avoid crowding the paper with redundant analyses, we only supply the corresponding figures to provide a quick visual idea of these effects (Table 6).
The results clearly demonstrate the superiority of the Frac-Adagrad-LSTM models over the standard Adagrad-LSTM baseline. Notably, the configuration with α = 0.7 achieves the highest Sharpe ratio (0.6286) and final cumulative return (63.65), indicating that this fractional model offers the best risk-adjusted performance and profitability. Meanwhile, the highest directional accuracy (0.5289) is observed at α = 0.9 , suggesting enhanced prediction of price movement direction with higher fractional influence. Overall, fractional memory improves the model’s effectiveness across all key financial metrics, making Frac-Adagrad a compelling alternative to traditional optimizers.
Figure 19 gives the Sensitivity analysis of Frac-Adagrad-LSTM on AAPL. Figure 19 presents the sensitivity analysis of the Frac-Adagrad-LSTM model applied to AAPL stock. The results indicate that higher learning rates (e.g., η = 0.01 ) consistently lead to improved Sharpe ratios and cumulative returns across all values of the fractional order α . The best performance is achieved with α = 0.9 and η = 0.01 , yielding a Sharpe ratio of 0.221 and a cumulative return of approximately 52.49. In contrast, very small learning rates ( η = 10 5 ) produce poor performance regardless of the chosen α , suggesting insufficient update magnitudes for effective learning. Notably, directional accuracy remains relatively stable across all configurations (approximately 48–52%), indicating that predictive directionality is less sensitive to variations in these hyperparameters compared to risk-adjusted and return-based metrics.
Figure 20 gives the sensitivity analysis of Frac-Adagrad-LSTM on GOOGL memory window size. The sensitivity analysis of the Frac-Adagrad-LSTM model on GOOGL stock data reveals that the optimal memory window size generally lies around 60 or 90 days. For all tested α values, a memory window of 60 consistently delivers strong performance across the Sharpe ratio, directional accuracy, and cumulative return. For instance, with α = 0.7 , M = 60 achieves a Sharpe ratio of 0.146 and a cumulative return of 11.30 , outperforming other memory settings. Performance tends to decrease with either very short (30) or very long (120) memory windows, highlighting the importance of selecting an appropriate temporal context for modeling market dynamics.

4.4. Comparison to the ARIMA and GARCH Models

In this sub-section, we compare the performance of Frac-ADAM, Frac-Adagrad, Frac-RMSprop, ARIMA, and GARCH Models; see Table 7.
The performance comparison reveals that among all models and fractional orders, the Frac-Adagrad optimizer with α = 0.36 achieves the highest Sharpe ratio (0.7404) and cumulative return (74.95), indicating superior risk-adjusted profitability. Frac-Adagrad consistently outperforms both Frac-Adam and Frac-RMSprop across most values of α , exhibiting better directional accuracy and more stable returns. In contrast, Frac-RMSprop and Frac-Adam generally yield negative Sharpe ratios and cumulative returns, with only marginal improvements at certain fractional orders, but their overall performance remains inferior. Classical statistical models such as ARIMA and GARCH perform poorly in this context: ARIMA provides low directional accuracy and negligible returns, while GARCH exhibits unrealistic Sharpe ratios and fails to predict directional movements altogether. These findings highlight the effectiveness of fractional adaptive optimizers—particularly Frac-Adagrad—in capturing the complex dynamics of financial time series and enhancing predictive and trading performance.
The RMSE histograms for the Frac-SGD, Frac-Adadelta, Frac-Nadam, Frac-FTRL, and Frac-Adamax methods—evaluated across stock prices of various companies—are presented in Appendix A.
Figure A1 gives the MSE associated with standard SGD and Frac-SGD used to train the LSTM on AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH stocks. Fractional SGD-LSTM with α = 0.1 generally improves prediction accuracy compared to standard SGD-LSTM for some stocks (e.g., GOOGL, JPM, UNH), showing lower error values. However, as  α increases beyond 0.3, performance often deteriorates drastically, leading to high prediction errors (e.g., META and UNH at α = 0.9 ), indicating instability and poor convergence. Thus, fractional SGD can be beneficial at small α , but higher values are detrimental to model performance.
Figure A2 gives the MSE associated with standard Adadelta and Frac-Adadelta used to train LSTM on AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH stocks. The comparison between the standard Adadelta-LSTM and its fractional variants across different stock price predictions reveals a mixed performance. For certain stocks such as AAPL, GOOGL, JPM, and UNH, the fractional Adadelta-LSTM with lower fractional orders (particularly α = 0.1 ) significantly outperforms the standard approach in terms of reduced MSE. For instance, the MSE drops from 52.54 to 44.48 for AAPL and from 62.64 to 32.40 for GOOGL. However, higher values of α tend to degrade performance, often substantially, as observed in META and UNH, where MSE reaches over 11,000 and 7800, respectively. These results suggest that fractional optimization with carefully chosen α values can enhance prediction accuracy, but inappropriate settings may lead to instability or overfitting.Figure A3 gives the MSE associated with standard Nadam and Frac-Nadam used to train the LSTM on AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH stocks. The performance comparison between the standard Nadam-LSTM and its fractional counterparts across various stock price prediction tasks demonstrates that the fractional approach can offer improved accuracy, but only under specific fractional orders. For example, with  α = 0.1 , the Frac-Nadam-LSTM significantly reduces the MSE for stocks like AAPL (from 52.54 to 44.48), GOOGL (from 62.64 to 32.40), and JPM (from 48.86 to 25.89), indicating enhanced predictive capability. However, as  α increases, performance generally degrades, with some extreme values—such as for META and UNH—reaching very high MSEs (e.g., 11,691.78 and 7890.24, respectively at α = 0.5 and beyond). This highlights the sensitivity of fractional Nadam to the choice of α , where smaller values may lead to improved accuracy, while larger ones risk severe instability or overfitting.
Figure A4 gives the MSE associated with standard FTRL and Frac-FTRL used to train the LSTM on AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH stocks. The evaluation of standard FTRL-LSTM against its fractional variants for stock price prediction reveals that fractional FTRL can enhance or degrade performance depending on the fractional order α . For instance, with  α = 0.1 , improvements are observed in several cases, notably for AAPL, GOOGL, JPM, and UNH, where the mean squared error (MSE) is substantially reduced compared to the standard version. However, as  α increases beyond 0.3, the performance tends to deteriorate significantly, with large error spikes, particularly for META and UNH at α = 0.9 , reaching values as high as 11,691.78 and 2798.63, respectively. This suggests that while fractional FTRL introduces additional flexibility, its effectiveness is highly sensitive to the choice of α , where lower values are generally more stable and beneficial.
Figure A5 gives the MSE associated with standard Adamax and Frac-Adamax used to train LSTM on AAPL, MSFT, GOOGL, AMZN, META, NVDA, JPM, V, and UNH stocks. The comparative results between standard Adamax-LSTM and its fractional variant across various stock price prediction tasks indicate that fractional Adamax can improve prediction performance when the fractional order α is carefully selected. For instance, with  α = 0.1 , a notable reduction in prediction error is observed for AAPL, GOOGL, JPM, and especially UNH, where the error drops from 1281.37 to 304.45. However, higher values of α generally lead to performance degradation, as evidenced by extreme error increases for META ( α = 0.9 : 11,691.78) and UNH ( α = 0.5 : 7890.24). These findings suggest that fractional Adamax introduces significant sensitivity to α , where low-order memory effects (small α ) can be beneficial, while higher orders may introduce instability and reduced accuracy.
Toward the end, the RMSE analysis of fractional optimizers in LSTM-based stock price prediction reveals that variants with small memory parameters ( α = 0.1 ) consistently outperform their standard counterparts. Improvements are notable for stocks like AAPL, GOOGL, JPM, and UNH, where RMSEs drop significantly. However, increasing α generally leads to performance degradation, with large errors observed for META and UNH. This trend holds consistently across Frac-SGD, Frac-Adadelta, Frac-Nadam, Frac-FTRL, and Frac-Adamax. The results highlight the potential of fractional optimization when low α values are chosen carefully. Nonetheless, inappropriate fractional orders may introduce instability, overfitting, or convergence failures.

4.5. Synthesis of Optimization Analysis Results

The examination of different optimization techniques, namely Adam, RMSprop, and standard Adagrad, compared to their fractional analogues (Frac-Adam, Frac-RMSprop, and Frac-Adagrad), has provided valuable insights into the effect of fractional derivatives on the training of LSTM models for stock price forecasting.
The comparison between standard Adam and fractional Adam (Frac-Adam) yielded mixed results. For certain stocks such as GOOGL, fractional derivatives with low α values (such as 0.1) typically resulted in improvements in root-mean-square error (RMSE). Nevertheless, as  α increases, Frac-Adam’s performance declines, especially for stocks like META and AMZN. In most cases, the classic Adam optimizer outperformed or performed similarly to Frac-Adam, particularly at higher α values and when the optimization process was unstable.
A comparison of RMSprop with Frac-RMSprop also revealed performance disparities across stocks. Fractional RMSprop underperformed the conventional variant in certain stocks (such as AAPL and GOOGL) at lower α values, indicating that fractional order may promote smoother convergence and enhanced memory effects. However, as  α rises above 0.5, the performance significantly deteriorates for stocks like META and UNH, with notably larger MSE values. This highlights the risk of overemphasizing historical gradients in volatile stocks, leading to instability.
For Adagrad and Frac-Adagrad, smaller fractional orders ( α = 0.1 ) had a positive effect for stocks like AAPL and GOOGL, where a smoother gradient descent benefited the optimization process. However, for stocks like MSFT, META, and AMZN, larger α values resulted in worsened performance, suggesting the likelihood of excessive memory effects interfering with convergence. This indicates that too much memory due to high α values can disrupt convergence. Therefore, fractional Adagrad should be used carefully and adapted according to the characteristics of the asset and dataset.
The choice of optimizers, particularly the adoption of fractional orders in optimizers such as Adam, RMSprop, and Adagrad, can have a considerable influence on the performance of LSTM models for stock price forecasting. While fractional derivatives can offer advantages in certain cases—particularly for stocks whose price movements are smoother or less volatile—excessively high fractional orders tend to destabilize the optimization process. In general, conventional optimizers such as Adam and RMSprop are likely to produce better or comparable results to their fractional counterparts, especially when larger memory effects are present. Therefore, tuning the fractional order α is crucial for achieving optimal results, and it is recommended to employ fractional versions cautiously, tailoring the optimization approach to the specific characteristics of the financial asset being modeled.

4.6. Limitations and Deployment Considerations

Deploying fractional-order optimizers in live trading environments requires careful consideration beyond algorithmic design. Real-world deployment introduces additional challenges such as low-latency inference, infrastructure scalability, financial data integrity, and regulatory compliance. However, these constraints are predominantly engineering and system-integration issues rather than fundamental limitations of the fractional optimization paradigm itself. To this end, we analyze and propose practical strategies to mitigate these hurdles in the following section:
  • Latency: While fractional derivatives typically require historical memory, our implementation adopts a short-memory truncation approach, which approximates the Grünwald–Letnikov operator with a limited window. Empirical profiling indicates that for window sizes M [ 30 , 90 ] , the computational overhead is limited to 5–12% compared to standard optimizers. This is acceptable for medium-frequency strategies with minute-to-hour granularity.
  • GPU Acceleration: We propose a GPU-based implementation of fractional derivative convolutions via parallel tensor operations. Specifically, the discretized fractional derivative at time t is computed using
    D α f ( t ) k = 0 M w k ( α ) f ( t k ) , w k ( α ) = ( 1 ) k α k .
    where the weights w k ( α ) are precomputed and the convolution is implemented as a single `torch.nn.functional.conv1d()’ call on GPU. This significantly reduces the inference latency and makes the method compatible with online deployment. Table 8 reports performance metrics across various stocks under different hyperparameter settings, especially CPU and GPU time; in this context, Speedup is a measure of how much faster one processor is compared to another. In this context, it compares CPU and GPU times and is defined as
    Speedup = CPU Time GPU Time
    Across all tested stocks, CPU execution times are remarkably low, consistently below 0.0011 s. The performance difference between CPU and GPU is minimal, indicating that for lightweight financial computations (e.g., a Sharpe ratio and accuracy over short memory windows), GPU acceleration does not provide significant speedups. In several cases (e.g., rows 2, 10, and 13), the CPU is actually faster than the GPU, suggesting that the overhead associated with GPU data transfer may outweigh any benefits from parallelization. Notably, speedups rarely exceed 1.1×, with the highest GPU speedup observed for META (alpha 0.5, memory window 90) at 1.36×, which still reflects only a modest advantage. In conclusion, while GPU acceleration remains valuable for large-scale or deep-learning-based models, for fast statistical metrics like those presented here, CPU execution is highly efficient and often preferable.
  • Hybrid Optimizer Pipelines: For ultra high-frequency trading (UHFT), we suggest a hybrid optimizer framework where fractional optimizers are used during slower retraining phases, while faster, conventional optimizers (e.g., RMSprop) are deployed during live signal execution. This ensures stability and long-term learning benefits of memory-aware updates without sacrificing real-time responsiveness.
  • Data Integrity and Compliance: Fractional optimizers are orthogonal to the issues of data authenticity and regulation. Nonetheless, our pipeline is fully compatible with institutional-grade backtesting engines and real-time monitoring dashboards. Adopting industry-standard logging and audit protocols ensures compliance and traceability.
Together, these adaptations demonstrate that fractional optimizers are not only theoretically sound but can also be engineered for live deployment in medium- to high-frequency financial systems. We identify further directions such as volatility-aware α scheduling and streaming memory management as future enhancements to expand real-time viability.
Note: While our experiments focused on financial time series, the proposed methodology is designed to be domain-agnostic. This paper provides a general roadmap for transforming standard gradient-based optimizers (e.g., Adam, RMSprop, Adagrad) into their fractional-order counterparts using Caputo-based memory-aware formulations. This transformation and its practical implementation—based on the Grünwald–Letnikov approximation and the short-memory principle—are independent of the specific statistical properties of the time series.
Therefore, fractional optimizers could be readily adapted to other domains such as energy consumption forecasting, physiological signal modeling in healthcare, or climate and weather prediction, especially when memory effects and long-range dependencies are relevant. The potential of these methods to enhance convergence stability and trend sensitivity may also be beneficial in these contexts.

5. Conclusions

This study presents both theoretical advancements and empirical validation for the integration of fractional-order optimization techniques into machine learning systems, particularly for financial time-series forecasting using LSTM-based architectures. By embedding fractional derivatives into established gradient-based optimizers—namely Adam, RMSprop, and Adagrad—we introduce a family of fractional optimizers (Frac-Adam, Frac-RMSprop, Frac-Adagrad) that incorporate long-range memory and temporal dependencies directly into the learning dynamics.
The core contribution lies in leveraging the Caputo derivative formulation, which is more suitable than classical definitions for deep learning contexts. Caputo’s compatibility with standard initial conditions and its adherence to the short-memory principle make it particularly well-suited for recurrent neural networks such as LSTMs. Moreover, its inductive behavior—analogous to the Hurst exponent in fractional Brownian motion—provides a principled approach to introducing temporal smoothness and trend sensitivity into optimization.
From a computational perspective, while the Grünwald–Letnikov approximation introduces an overhead of O ( M ) per update due to memory convolution, our runtime analysis indicates that moderate memory horizons ( M = 30 ,   60 ,   90 ) yield a favorable trade-off between performance gains and computational efficiency. This is particularly relevant in medium-frequency or offline settings, where learning stability and forecast robustness outweigh real-time processing constraints.
Experimental results demonstrate that fractional-order optimization can significantly improve forecasting performance under appropriate configurations. Sensitivity analyses reveal that a memory window of 60 days combined with a fractional order α { 0.5 ,   0.7 } yields the highest Sharpe ratios and cumulative returns, with notable robustness for relatively stable assets such as GOOGL and AAPL. In contrast, highly volatile stocks such as META and UNH exhibit performance degradation at higher fractional orders, underscoring the need for adaptive tuning. Directional accuracy metrics further suggest that higher α values (e.g., α = 0.9 ) may slightly improve trend detection at the expense of increased prediction variance.
In summary, fractional-order optimizers offer a flexible and biologically inspired framework for embedding memory effects into financial learning systems. However, their efficacy is nuanced and asset-dependent, highlighting the necessity for dynamic adjustment mechanisms. Future work should explore adaptive strategies for online tuning of α and memory windows, as well as extend fractional techniques to other machine learning paradigms—such as attention mechanisms, graph neural networks, and reinforcement learning—for a broader understanding of their advantages and limitations across diverse data environments.

Author Contributions

Conceptualization, K.E.M.; Validation, V.P.; Formal analysis, M.E.-z. and M.R.; Writing—original draft, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will only be shared upon request.

Acknowledgments

This work was supported by the Ministry of National Education, Professional Training, Higher Education and Scientific Research (MENFPESRS), the Digital Development Agency (DDA) of Morocco (Nos. Alkhawarizmi/2020/23), and the National Scientific and Technical Research Centre (CNRST) under the ≪PhD-Associate Scholarship—PASS≫ program.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Histograms of RMSE for the Methods

In this appendix, we present the histograms showing the RMSE results for the following methods: Frac-SGD, Frac-Adadelta, Frac-Nadam, Frac-FTRL, and Frac-Adamax.
Figure A1. RMSE histogram for Frac-SGD.
Figure A1. RMSE histogram for Frac-SGD.
Mathematics 13 02068 g0a1
Figure A2. RMSE histogram for Frac-Adadelta.
Figure A2. RMSE histogram for Frac-Adadelta.
Mathematics 13 02068 g0a2
Figure A3. RMSE histogram for Frac-Nadam.
Figure A3. RMSE histogram for Frac-Nadam.
Mathematics 13 02068 g0a3
Figure A4. RMSE histogram for Frac-FTRL.
Figure A4. RMSE histogram for Frac-FTRL.
Mathematics 13 02068 g0a4
Figure A5. RMSE histogram for Frac-Adamax.
Figure A5. RMSE histogram for Frac-Adamax.
Mathematics 13 02068 g0a5

References

  1. Fama, E.F. Efficient capital markets: A review of theory and empirical work. J. Financ. 1970, 25, 383–417. [Google Scholar] [CrossRef]
  2. Chen, W.; Hussain, W.; Cauteruccio, F.; Zhang, X. Deep learning for financial time series prediction: A state-of-the-art review of standalone and hybrid models. CMES-Comput. Model. Eng. Sci. 2023, 139, 187–224. [Google Scholar] [CrossRef]
  3. Kara, Y.; Boyacioglu, M.A.; Baykan, Ö.K. Predicting direction of stock price index movement using artificial neural networks and support vector machines: The sample of the Istanbul Stock Exchange. Expert Syst. Appl. 2011, 38, 5311–5319. [Google Scholar] [CrossRef]
  4. Nelson, D.M.; Pereira, A.C.M.; de Oliveira, R.A. Stock market’s price movement prediction with LSTM neural networks. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1419–1426. [Google Scholar]
  5. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  6. Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
  7. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
  8. Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
  9. Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k2). Sov. Math. Dokl. 1983, 27, 372–376. [Google Scholar]
  10. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  11. Zeiler, M.D. ADADELTA: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
  12. Tieleman, T.; Hinton, G. Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. Coursera Neural Netw. Mach. Learn. 2012, 4, 26. [Google Scholar]
  13. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  14. Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  15. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  16. Zhuang, J.; Tang, T.; Ding, Y.; Tatikonda, S.; Dvornek, M. AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. Adv. Neural Inf. Process. Syst. 2020, 33, 18795–18806. [Google Scholar]
  17. Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the variance of the adaptive learning rate and beyond. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  18. Chen, G.; Liu, Y.; Wang, D.; Huang, H. A novel gradient descent optimizer based on fractional order scheduler and its application in deep neural networks. Appl. Math. Model. 2024, 128, 26–57. [Google Scholar] [CrossRef]
  19. Raubitzek, S.; Mallinger, K.; Neubauer, T. Combining fractional derivatives and machine learning: A review. Entropy 2022, 25, 35. [Google Scholar] [CrossRef] [PubMed]
  20. Saleh, M.R.; Ajarmah, B. Fractional Gradient Descent Learning of Backpropagation Artificial Neural Networks with Conformable Fractional Calculus. Fuzzy Syst. Data Min. VIII 2022, 364, 72–79. [Google Scholar]
  21. Elnady, S.M.; Alqarni, M.; Elsayed, A.A.; Alzahrani, B.A. A comprehensive survey of fractional gradient descent methods and their convergence analysis. Chaos Solitons Fractals 2025, 194, 116154. [Google Scholar] [CrossRef]
  22. Chen, B.P.; Yu, D.J.; Tang, Y.; Wen, F.; Li, K. Fractional-order convolutional neural networks with population extremal optimization. Neurocomputing 2022, 477, 36–45. [Google Scholar] [CrossRef]
  23. Joshi, M.; Bhosale, S.; Vyawahare, V.A. A survey of fractional calculus applications in artificial neural networks. Artif. Intell. Rev. 2023, 56, 13897–13950. [Google Scholar] [CrossRef]
  24. Hallouz, A.; Stamov, G.; Souid, M.S.; Stamova, I. New Results Achieved for Fractional Differential Equations with Riemann–Liouville Derivatives of Nonlinear Variable Order. Axioms 2023, 12, 895. [Google Scholar] [CrossRef]
  25. Dimitrov, Y.; Georgiev, S.; Todorov, V. Approximation of Caputo Fractional Derivative and Numerical Solutions of Fractional Differential Equations. Fractal Fract. 2023, 7, 750. [Google Scholar] [CrossRef]
  26. Atici, F.M.; Chang, S.; Jonnalagadda, J. Grünwald–Letnikov fractional operators: From past to present. Fract. Differ. Calc. 2021, 11, 147–159. [Google Scholar]
  27. Lubich, C. Discretized fractional calculus. SIAM J. Math. Anal. 1986, 17, 704–719. [Google Scholar] [CrossRef]
  28. Diethelm, K.; Ford, N.J.; Freed, A.D. Predictor-corrector approach for the numerical solution of fractional differential equations. Nonlinear Dyn. 2002, 29, 3–22. [Google Scholar] [CrossRef]
  29. Li, X.; Zeng, F.; Liu, F. A data-driven method for approximating fractional operators using deep neural networks. J. Comput. Phys. 2021, 442, 110513. [Google Scholar]
  30. Podlubny, I. Fractional Differential Equations; Academic Press: Cambridge, MA, USA, 1999. [Google Scholar]
  31. Granger, C.W.J. Long memory relationships and the aggregation of dynamic models. J. Econom. 1980, 14, 227–238. [Google Scholar] [CrossRef]
  32. Cont, R. Long range dependence in financial markets. In Fractals in Engineering; Springer: Berlin/Heidelberg, Germany, 2005; pp. 159–179. [Google Scholar]
  33. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  34. Martens, J. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010. [Google Scholar]
  35. Amari, S. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
  36. Chen, Y.; Yang, Y.; Sun, H. Fractional-order Gradient Optimization Methods. Neurocomputing 2020, 402, 70–83. [Google Scholar]
  37. Li, Z.; Lin, Y.; Huang, L. Fractional Order Gradient Descent Algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 931–938. [Google Scholar]
  38. Zhang, M.; Lucas, J.; Ba, J.; Hinton, G. Lookahead Optimizer: K steps forward, 1 step back. arXiv 2019, arXiv:1907.08610. [Google Scholar]
  39. Zaheer, M.; Reddi, S.J.; Sachan, D.; Kale, S.; Kumar, S. Adaptive Methods for Nonconvex Optimization. Adv. Neural Inf. Process. Syst. 2018, 31, 9815–9825. [Google Scholar]
  40. Zhang, Y.; Xu, J.; Zhang, M. Fractional optimization for machine learning and optimization algorithms. IEEE Trans. Neural Networks Learn. Syst. 2020, 31, 3973–3985. [Google Scholar]
  41. Polak, M. Fractional optimization and its applications in dynamic systems. J. Comput. Appl. Math. 2016, 292, 45–54. [Google Scholar]
  42. Diethelm, K. The Analysis of Fractional Differential Equations; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  43. Doe, J.; Smith, J. Stock price prediction using machine learning. J. Financ. Eng. 2020, 12, 45–67. [Google Scholar]
  44. Brown, M. Time Series Analysis and Forecasting, 2nd ed.; Springer: Cham, Switzerland, 2018. [Google Scholar]
  45. Green, A.; White, B. Fractional LSTM models for stock price prediction. Artif. Intell. Rev. 2021, 18, 23–42. [Google Scholar]
  46. Smith, J.H.; Johnson, R.L. Deep learning models for time series forecasting. IEEE Trans. Neural Netw. 2019, 30, 1254–1263. [Google Scholar]
Figure 1. Methodology for transitioning from the standard optimizer LSTM to the fractional optimizer LSTM.
Figure 1. Methodology for transitioning from the standard optimizer LSTM to the fractional optimizer LSTM.
Mathematics 13 02068 g001
Figure 2. Diagram of an LSTM cell highlighting its input, forget, and output gates.
Figure 2. Diagram of an LSTM cell highlighting its input, forget, and output gates.
Mathematics 13 02068 g002
Figure 3. LSTM model processing time series data as input.
Figure 3. LSTM model processing time series data as input.
Mathematics 13 02068 g003
Figure 4. Mean Squared Error (MSE) comparison between standard Adam-LSTM and Frac-Adam-LSTM.
Figure 4. Mean Squared Error (MSE) comparison between standard Adam-LSTM and Frac-Adam-LSTM.
Mathematics 13 02068 g004
Figure 5. Mean Squared Error (MSE) of standard Adam-LSTM-Google versus Frac-Adam-LSTM for Google stock prediction.
Figure 5. Mean Squared Error (MSE) of standard Adam-LSTM-Google versus Frac-Adam-LSTM for Google stock prediction.
Mathematics 13 02068 g005
Figure 6. Mean Squared Error (MSE) of standard Adam-LSTM-AMZN versus Frac-Adam-LSTM for Amazon stock prediction.
Figure 6. Mean Squared Error (MSE) of standard Adam-LSTM-AMZN versus Frac-Adam-LSTM for Amazon stock prediction.
Mathematics 13 02068 g006
Figure 7. Mean Squared Error (MSE) of standard Adam-LSTM-META versus Frac-Adam-LSTM for META stock prediction.
Figure 7. Mean Squared Error (MSE) of standard Adam-LSTM-META versus Frac-Adam-LSTM for META stock prediction.
Mathematics 13 02068 g007
Figure 8. Mean Squared Error (MSE) comparison of predicted NVDA stock prices using standard Adam-LSTM-NVDA versus Frac-Adam-LSTM models.
Figure 8. Mean Squared Error (MSE) comparison of predicted NVDA stock prices using standard Adam-LSTM-NVDA versus Frac-Adam-LSTM models.
Mathematics 13 02068 g008
Figure 9. Sensitivity analysis of Frac-Adam-LSTM on AAPL considering the learning rate.
Figure 9. Sensitivity analysis of Frac-Adam-LSTM on AAPL considering the learning rate.
Mathematics 13 02068 g009
Figure 10. Sensitivity analysis of Frac-Adam-LSTM on GOOGL considering the memory window size.
Figure 10. Sensitivity analysis of Frac-Adam-LSTM on GOOGL considering the memory window size.
Mathematics 13 02068 g010
Figure 11. Mean Squared Error (MSE) comparison between standard RMSprop-LSTM and fractional RMSprop-LSTM models.
Figure 11. Mean Squared Error (MSE) comparison between standard RMSprop-LSTM and fractional RMSprop-LSTM models.
Mathematics 13 02068 g011
Figure 12. Mean Squared Error (MSE) comparison between standard RMSprop-LSTM and fractional RMSprop-LSTM on JPMorgan Chase (JPM) stock data.
Figure 12. Mean Squared Error (MSE) comparison between standard RMSprop-LSTM and fractional RMSprop-LSTM on JPMorgan Chase (JPM) stock data.
Mathematics 13 02068 g012
Figure 13. Sensitivity analysis of Frac-RMSprop-LSTM on AAPL considering the learning rate.
Figure 13. Sensitivity analysis of Frac-RMSprop-LSTM on AAPL considering the learning rate.
Mathematics 13 02068 g013
Figure 14. Sensitivity analysis of Frac-RMSprop-LSTM on GOOGL considering the memory window size.
Figure 14. Sensitivity analysis of Frac-RMSprop-LSTM on GOOGL considering the memory window size.
Mathematics 13 02068 g014
Figure 15. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM optimizers.
Figure 15. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM optimizers.
Mathematics 13 02068 g015
Figure 16. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM on AAPL (Apple) stock data.
Figure 16. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM on AAPL (Apple) stock data.
Mathematics 13 02068 g016
Figure 17. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM on GOOGL (Alphabet Inc.) stock data.
Figure 17. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM on GOOGL (Alphabet Inc.) stock data.
Mathematics 13 02068 g017
Figure 18. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM on UNH (UnitedHealth Group) stock data.
Figure 18. Mean Squared Error (MSE) comparison between standard Adagrad-LSTM and fractional Adagrad-LSTM on UNH (UnitedHealth Group) stock data.
Mathematics 13 02068 g018
Figure 19. Sensitivity analysis of Frac-Adagrad-LSTM on AAPL considering the learning rate.
Figure 19. Sensitivity analysis of Frac-Adagrad-LSTM on AAPL considering the learning rate.
Mathematics 13 02068 g019
Figure 20. Sensitivity analysis of Frac-Adagrad-LSTM on GOOGL considering the memory window size.
Figure 20. Sensitivity analysis of Frac-Adagrad-LSTM on GOOGL considering the memory window size.
Mathematics 13 02068 g020
Table 1. Comparison of normalized runtimes per epoch for different optimizers.
Table 1. Comparison of normalized runtimes per epoch for different optimizers.
OptimizerWindow Size MRelative Runtime (Normalized)
Adam1.00
RMSprop1.01
Frac-RMSprop301.05
Frac-RMSprop601.09
Frac-RMSprop901.12
Table 2. The optimizer’s hyperparameters.
Table 2. The optimizer’s hyperparameters.
OptimizerLearning RateBeta 1Beta 2
Adam0.0010.90.999
RMSprop0.0010.9
SGD0.01
Adagrad0.01
Adadelta1.00.950.999
Nadam0.0010.90.999
Ftrl0.10.150.85
Adamax0.0020.90.999
Table 3. Model performance metrics for the average different configurations on all stocks and all optimizers without the layer and stock columns.
Table 3. Model performance metrics for the average different configurations on all stocks and all optimizers without the layer and stock columns.
AlphaWindow SizeHidden UnitsDropoutAver. SharpeAver. AccuracyAver. Cumulative Return
0.330200.00.0239260.4924140.735054
0.330500.00.0623350.5034486.237091
0.3301000.00.0700800.5020694.549126
0.330200.20.0663200.5075862.938690
0.330500.20.0690870.5034484.410477
0.3301000.20.0641190.5006906.233910
0.330200.50.0711830.4882763.204750
0.330500.50.0489900.5089662.758423
0.3301000.50.0611290.4965526.540634
0.390200.00.1237710.4992483.230061
0.390500.00.0938660.5052636.305473
0.3901000.00.1130930.48571411.822601
0.390200.20.0959080.5022566.796715
0.390500.20.0970230.5082716.988266
0.3901000.20.1347190.4917297.455177
0.390200.50.0862470.5142860.858746
0.390500.50.1381060.48872211.097931
0.3901000.50.1060460.4932334.975113
0.430200.00.0640270.4924143.721291
0.430500.00.0626040.4965525.248947
0.4301000.00.0588760.5062075.469574
0.430200.20.0055980.5048280.199921
0.430500.20.0684360.4965527.558159
0.4301000.20.0672280.4937936.473572
0.430200.50.0660510.5006904.153961
0.430500.50.0709240.5117244.535126
0.4301000.50.0606120.4979316.441483
0.490200.00.1371050.5127824.742943
0.490500.00.0700680.4947374.054939
0.4901000.00.1125140.5007528.333633
0.490200.20.0924920.5082715.157829
0.490500.20.1195980.50375916.004845
0.4901000.20.1064220.4872186.550301
0.490200.50.1462930.5082719.485603
0.490500.50.0740890.5082711.592896
0.4901000.50.1178450.4842118.569008
Table 4. Performance comparison of Adam-LSTM and Frac-Adam-LSTM models on AAPL stock price forecasting for various fractional orders α .
Table 4. Performance comparison of Adam-LSTM and Frac-Adam-LSTM models on AAPL stock price forecasting for various fractional orders α .
ModelAlphaSharpe RatioDirectional AccuracyCumulative Return
Adam-LSTM0.00.780048.99%89.04
Frac-Adam-LSTM0.10.926649.71%105.72
Frac-Adam-LSTM0.20.729850.14%83.32
Frac-Adam-LSTM0.30.881850.14%100.63
Frac-Adam-LSTM0.40.571350.00%65.25
Frac-Adam-LSTM0.50.833049.86%95.07
Frac-Adam-LSTM0.61.082150.87%123.38
Frac-Adam-LSTM0.71.305151.73%148.66
Frac-Adam-LSTM0.81.079251.88%123.06
Frac-Adam-LSTM0.90.284648.55%32.52
Table 5. Performance comparison of RMSprop-LSTM and Frac-RMSprop-LSTM models for AMZN stock forecasting based on RMSE, Sharpe ratio, and directional accuracy. FracLSTM explores different fractional orders α .
Table 5. Performance comparison of RMSprop-LSTM and Frac-RMSprop-LSTM models for AMZN stock forecasting based on RMSE, Sharpe ratio, and directional accuracy. FracLSTM explores different fractional orders α .
ModelRMSESharpe RatioDirectional Accuracy
Standard LSTM7.82370.194950.14%
FracLSTM α = 0.0 5.49880.194950.00%
FracLSTM α = 0.1 6.81280.194950.29%
FracLSTM α = 0.2 7.26290.194951.30%
FracLSTM α = 0.3 7.13410.194950.43%
FracLSTM α = 0.4 7.99880.194949.42%
FracLSTM α = 0.5 7.59690.194950.43%
Table 6. Performance comparison of the Adagrad and Frac-Adagrad LSTM models on GOOGL stock price.
Table 6. Performance comparison of the Adagrad and Frac-Adagrad LSTM models on GOOGL stock price.
ModelAlphaSharpe RatioDirectional AccuracyCumulative Return
Adagrad-LSTM0.0−0.3399070.501445−34.440094
Frac-Adagrad-LSTM0.10.0547200.5144515.545547
Frac-Adagrad-LSTM0.2−0.1366510.497110−13.848495
Frac-Adagrad-LSTM0.30.2853060.51300628.909805
Frac-Adagrad-LSTM0.40.3500220.52312135.464493
Frac-Adagrad-LSTM0.50.0740670.5187867.506226
Frac-Adagrad-LSTM0.60.1635470.52023116.573822
Frac-Adagrad-LSTM0.70.6285780.52745763.653625
Frac-Adagrad-LSTM0.80.5474240.52312155.445953
Frac-Adagrad-LSTM0.90.6015020.52890260.915802
Table 7. Performance of the Frac-ADAM, Frac-Adagrad, Frac-RMSprop, ARIMA and GARCH Models.
Table 7. Performance of the Frac-ADAM, Frac-Adagrad, Frac-RMSprop, ARIMA and GARCH Models.
ModelAlphaSharpe RatioDirectional AccuracyFinal Cumulative Return
Frac-ADAM0.01−0.42310.4812−42.86
Frac-ADAM0.06−0.18690.4957−18.94
Frac-ADAM0.11−0.49280.4769−49.92
Frac-ADAM0.160.10970.497111.12
Frac-ADAM0.21−0.48820.4754−49.46
Frac-ADAM0.26−0.01110.5029−1.12
Frac-ADAM0.31−0.34180.4870−34.63
Frac-ADAM0.36−0.59180.4769−59.94
Frac-ADAM0.41−0.12340.5014−12.50
Frac-ADAM0.46−0.15590.4942−15.80
Frac-ADAM0.51−0.33720.5014−34.16
Frac-ADAM0.56−0.60690.4754−61.47
Frac-Adagrad0.01−0.29100.5029−29.49
Frac-Adagrad0.06−0.62340.4841−63.13
Frac-Adagrad0.11−0.73870.4841−74.79
Frac-Adagrad0.16−0.13760.5101−13.95
Frac-Adagrad0.21−0.07000.5159−7.10
Frac-Adagrad0.260.08520.50588.64
Frac-Adagrad0.310.32150.511632.58
Frac-Adagrad0.360.74040.527574.95
Frac-Adagrad0.410.45700.518846.30
Frac-Adagrad0.46−0.17700.5029−17.94
Frac-Adagrad0.510.62090.530362.88
Frac-Adagrad0.560.37590.523138.08
Frac-RMSprop0.01−0.63450.4827−64.25
Frac-RMSprop0.06−0.55480.4827−56.19
Frac-RMSprop0.11−0.67620.4827−68.47
Frac-RMSprop0.16−0.20740.4884−21.01
Frac-RMSprop0.21−0.42390.4884−42.95
Frac-RMSprop0.26−0.10250.5043−10.39
Frac-RMSprop0.31−0.40320.4971−40.85
Frac-RMSprop0.36−0.12360.4942−12.53
Frac-RMSprop0.410.14490.502914.68
Frac-RMSprop0.460.12290.501412.46
Frac-RMSprop0.51−0.60960.4913−61.73
Frac-RMSprop0.56−0.15010.4986−15.21
ARIMAN/A−0.82010.0852−0.0076
GARCHN/A1.067e60.00000.4740
Table 8. Performance metrics across various stocks under different hyperparameter settings.
Table 8. Performance metrics across various stocks under different hyperparameter settings.
StockAlphaMem WinSharpeAccuracyCPU (s)GPU (s)Speedup
AAPL0.3600.18670.51510.00070.00061.10
AAPL0.3900.17650.51280.00060.00060.98
AAPL0.5600.19380.51800.00050.00070.79
AAPL0.5900.17190.50680.00060.00060.96
MSFT0.3600.16310.48920.00050.00050.98
MSFT0.3900.14540.48420.00070.00070.98
MSFT0.5600.15730.48920.00050.00051.06
MSFT0.5900.11810.48420.00060.00061.06
GOOGL0.3600.12370.50360.00050.00051.04
GOOGL0.3900.10670.49020.00060.00070.88
GOOGL0.5600.12550.49210.00070.00070.93
GOOGL0.5900.08480.49470.00070.00061.04
AMZN0.360−0.01080.51080.00050.00050.87
AMZN0.390−0.06140.50380.00100.00110.89
AMZN0.560−0.00920.50500.00050.00051.02
AMZN0.590−0.05750.50080.00060.00070.95
META0.360−0.04220.47190.00070.00080.92
META0.390−0.08190.47070.00060.00070.89
META0.560−0.04430.47050.00050.00051.04
META0.590−0.08260.46770.00100.00071.36
NVDA0.3600.09290.49500.00050.00051.05
NVDA0.3900.08410.49170.00070.00070.88
NVDA0.5600.09470.49500.00050.00051.01
NVDA0.5900.07600.49320.00060.00070.86
JPM0.3600.13130.48630.00050.00050.91
JPM0.3900.13890.49320.00060.00060.99
JPM0.5600.12950.49210.00050.00050.99
JPM0.5900.13150.49770.00070.00071.04
V0.3600.10980.48350.00070.00070.98
V0.3900.08030.49020.00060.00061.07
V0.5600.10200.49500.00050.00051.03
V0.5900.07410.48870.00070.00061.14
UNH0.3600.39060.48060.00060.00060.90
UNH0.3900.36490.48420.00070.00070.98
UNH0.5600.42590.48780.00060.00061.01
UNH0.5900.33010.48120.00070.00061.06
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ez-zaiym, M.; Senhaji, Y.; Rachid, M.; El Moutaouakil, K.; Palade, V. Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting. Mathematics 2025, 13, 2068. https://doi.org/10.3390/math13132068

AMA Style

Ez-zaiym M, Senhaji Y, Rachid M, El Moutaouakil K, Palade V. Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting. Mathematics. 2025; 13(13):2068. https://doi.org/10.3390/math13132068

Chicago/Turabian Style

Ez-zaiym, Mustapha, Yassine Senhaji, Meriem Rachid, Karim El Moutaouakil, and Vasile Palade. 2025. "Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting" Mathematics 13, no. 13: 2068. https://doi.org/10.3390/math13132068

APA Style

Ez-zaiym, M., Senhaji, Y., Rachid, M., El Moutaouakil, K., & Palade, V. (2025). Fractional Optimizers for LSTM Networks in Financial Time Series Forecasting. Mathematics, 13(13), 2068. https://doi.org/10.3390/math13132068

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop