Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method

Gomolka, Zbigniew; Zeslawska, Ewa; Olbrot, Lukasz; Komsa, Michal; Ćwiąkała, Adrian

doi:10.3390/app152413046

Open AccessArticle

Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method

by

Zbigniew Gomolka

^1,*

,

Ewa Zeslawska

¹

,

Lukasz Olbrot

²,

Michal Komsa

³ and

Adrian Ćwiąkała

¹

Institute of Computer Science, Faculty of Exact and Technical Sciences, University of Rzeszow, 16C Tadeusza Rejtana Avenue, 35-959 Rzeszow, Poland

²

FIBRAIN Sp. z o.o., Zaczernie 190F, 36-062 Głogów Małopolski, Poland

³

Student Faculty of Exact and Technical Sciences, University of Rzeszow, 16C Tadeusza Rejtana Avenue, 35-959 Rzeszow, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13046; https://doi.org/10.3390/app152413046

Submission received: 10 November 2025 / Revised: 3 December 2025 / Accepted: 8 December 2025 / Published: 11 December 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The incorporation of fractional-order derivatives into neural networks presents a novel approach to improving gradient flow and adaptive learning dynamics. This paper introduces a fractional-order LSTM model, leveraging the Grünwald–Letnikov (GL) method to modify both activation functions and backpropagation mechanics. By redefining the transition functions of LSTM gates with fractional derivatives, the model achieves a smoother gradient adaptation while maintaining consistency across forward and backward passes. This is the first study integrating the Grünwald–Letnikov operator directly into both forward and backward LSTM computations, ensuring a consistent fractional framework throughout the entire learning process. We apply this approach to anomaly detection in fiber optic cable manufacturing, where small deviations in production parameters can significantly impact quality. A dataset containing time-series sensor measurements was used to train the fractional LSTM, demonstrating improved generalization and stability compared to classical LSTM models. Numerical stability analysis confirms that the fractional derivative framework allows convergent learning, preventing both vanishing and exploding gradients. Experimental results show that the fractional-order LSTM outperforms standard architectures in detecting manufacturing anomalies, with the optimal fractional order

ν = 0.95

providing a balance between accuracy and computational complexity. The findings suggest that fractional calculus can enhance deep learning architectures by introducing a continuous and flexible transition between neuron activations, paving the way for adaptive neural networks with tunable memory effects.

Keywords:

fractional derivatives; LSTM networks; Grünwald–Letnikov method; anomaly detection; adaptive neuron modeling

1. Introduction

Fractional calculus has recently emerged as a robust mathematical framework for modeling dynamic systems with long-term memory and hereditary effects. Its integration into artificial neural networks has opened a new research direction, as explored in [1,2,3,4,5,6,7,8], where mathematical formulations of fractional-order neural networks have demonstrated improved learning stability and adaptability. Beyond neural modeling, fractional calculus has also been successfully applied to describing complex dynamic systems that cannot be captured by classical integer-order derivatives. Its usefulness has been demonstrated in modeling nonlinear temporal dependencies across diverse scientific domains, including tumor growth [9,10], ecological competition [9], hydrodynamic and oceanic systems, and biological oscillations [11]. Fractional-order differential equations have also been applied to chaotic systems and nonlinear oscillators [12,13], showing that fractional differentiation provides an effective mathematical tool for analyzing stability, multistability, and chaotic transitions. These findings indicate that fractional derivatives enable a more accurate description of long-term dependencies and complex feedback structures in dynamic systems.

Research combining fractional calculus and artificial intelligence has evolved along three main directions. First, studies addressing the mathematical and numerical foundations of fractional-order operators have focused on their discrete approximations and stability [14,15,16,17,18]. Second, neural modeling approaches have embedded fractional operators directly into learning architectures, improving gradient flow and convergence [19,20,21]. Finally, several applied works have demonstrated the potential of fractional frameworks in real-world contexts, including ecological prediction [22], image-based diagnostics [23], and biomedical signal processing [24]. Despite these advances, no previous study has proposed a unified recurrent formulation in which fractional differentiation operates consistently across both forward and backward computations. A comparative summary of key research directions and representative works is presented in Table 1.

Building upon these foundations, this study introduces a Fractional Derivative Long Short-Term Memory (FD–LSTM) network that incorporates the Grünwald–Letnikov fractional operator directly into both the forward activation and backward propagation mechanisms. To demonstrate its practical relevance, the proposed model is validated on real-world data from fiber-optic cable manufacturing, a process highly sensitive to micro-scale irregularities that can affect signal quality and mechanical durability. The classical LSTM architecture, introduced by Hochreiter and Schmidhuber [25], relies on predefined transition functions with first-order derivatives. The proposed model extends this structure by embedding fractional differentiation consistently across all computation stages, ensuring mathematical coherence between forward and backward gradient flow. To the best of our knowledge, this is the first approach that provides a unified fractional-order learning mechanism within recurrent neural networks. The FD–LSTM enhances the adaptability of gate functions, enabling smooth representation of long-term dependencies and stable gradient propagation.

Several fractional-order LSTM variants have been proposed in recent years, including architectures based on Caputo derivatives [19], Caputo–Fabrizio models [21], Atangana–Baleanu formulations [7], and fractional memory units [8]. However, these works typically apply fractional operators only to selected components of the architecture, such as the forward activation function or the cell-state transition, while retaining classical backpropagation. As a result, the forward and backward memory mechanisms are governed by different mathematical rules.

In contrast, the FD–LSTM proposed in this study applies a unified Grünwald–Letnikov operator consistently to both the forward-pass gate transformations and the full backward gradient computation. This ensures mathematical coherence between activation dynamics and gradient flow, and, to the best of our knowledge, has not been addressed in previous fractional-order LSTM formulations. Table 2 summarises conceptual differences between existing fractional LSTM formulations and the unified GL-based architecture proposed in this work. In addition to unifying the fractional mechanisms in forward and backward passes, the proposed formulation derives the LSTM gate activations from GL-based fractional derivatives of smooth base potentials, enabling a continuous modulation of gate shape as a function of the fractional order

ν

.

The challenges of anomaly detection in fiber-optic cable production have been extensively analyzed in [26]. Broader perspectives on AI-driven anomaly detection in industrial environments can be found in [27,28,29,30,31,32,33,34]. Experimental validation on real-world data confirms that the fractional-order LSTM network improves generalization, stability, and detection accuracy compared to classical and Caputo-based recurrent models, demonstrating its potential for real-time industrial monitoring and adaptive control.

2. Materials and Methods

2.1. Mathematical Model of the Proposed Fractional Order LSTM Cell

In this section, we introduce the mathematical model of the proposed LSTM network based on standard architectures like [35,36,37]. The implementation of the novel LSTM cell is based on the fractional derivative formulated using the direct Grünwald–Letnikov definition, as presented in Equation (1), where

(\binom{ν}{i})

denotes the Newton binomial,

ν

is the order of fractional derivative of basis function

f_{B} (x)

, n is the number of coefficients, and h is the step of discretization [38,39].

D^{ν} f_{B} (x) = lim_{h \to 0^{+}} \frac{\sum_{i = 0}^{n} {(- 1)}^{i} (\binom{ν}{i}) f_{B} (x - i h)}{h^{ν}}

(1)

The implementation of the GL method in this work was constructed within the mathematical interface of the TensorFlow library to be used in the most efficient way in its neural network models. Figure 1 shows the classical model of the LSTM cell proposed by Sepp Hochreiter in [25].

The LSTM cell consists of two types of transitions: logic gates and the candidate cell state. The implementation of an LSTM cell based on fractional-order derivatives required modifications to the operation of these cells to enable smooth adjustment of the transition functions. Additionally, during backpropagation, it was necessary to modify the computation of gradients to ensure the correct approximation of the derivatives. This was crucial for obtaining the first-order derivatives of the transition functions, which were derived using fractional-order derivatives, enabling effective gradient propagation through the network.

2.2. Forward Model

In fractional-order neural networks, activation functions can be generalized using fractional derivatives. Instead of directly defining the activation function, we introduce a general base function

f_{B} (x)

, whose fractional derivative reproduces the desired neuron nonlinearity [40,41,42,43,44]. The standard sigmoid activation function is given by

σ (x) = \frac{1}{1 + e^{- x}}

(2)

To derive the sigmoid function from a fractional derivative, we choose the softplus function as the base function:

f_{B_{σ}} (x) = log (1 + e^{x})

(3)

and its classical first derivative is

\frac{d}{d x} f_{B_{σ}} (x) = \frac{e^{x}}{1 + e^{x}} = σ (x)

(4)

For fractional-order derivatives using the Grünwald–Letnikov definition, we generalize this as

D^{ν} f_{B_{σ}} (x) = \sum_{k = 0}^{\infty} \frac{{(- 1)}^{k}}{h^{ν}} (\binom{ν}{k}) log (1 + e^{x - k h})

(5)

For

ν = 1

, this recovers the standard sigmoid function. Per analogy, the hyperbolic tangent activation function can be expressed as

tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

(6)

can be obtained with the use of log(cosh) function as the base function

f_{B_{tanh}} (x)

denoted:

f_{B_{tanh}} (x) = log (cosh (x))

(7)

Taking its classical first derivative,

\frac{d}{d x} f_{B_{tanh}} (x) = tanh (x) .

(8)

The fractional-order Grünwald–Letnikov derivative is

D^{ν} f_{B_{tanh}} (x) = \sum_{k = 0}^{\infty} \frac{{(- 1)}^{k}}{h^{ν}} (\binom{ν}{k}) log (cosh (x - k h))

(9)

Using these base functions, we can smoothly generalize activation functions using fractional-order derivatives. This allows us to continuously transition between different activation behaviours by tuning

ν

, offering greater flexibility in neural network modeling.

In practical implementations, the infinite Grünwald–Letnikov series in Equation (9) is replaced by a truncated formulation. Based on our convergence experiments, we adopt a coefficient length of

N = 20

for all experiments reported in this study. Values of

N \geq 20

ensure numerical stability and suppress oscillatory behaviour for both sigmoid-like and tanh-like base functions, while keeping the computational overhead moderate. Shorter stencils (

N < 15

) tend to produce visible oscillations and degrade gradient stability, whereas increasing the stencil size beyond

N = 25

yields negligible improvements relative to the additional computational cost. Therefore,

N = 20

provides a practical trade-off between convergence reliability and runtime efficiency and is used consistently throughout the Section 3. The fractional order

ν

plays a central role in shaping both the activation function and the resulting gradient dynamics. In the Grünwald–Letnikov formulation,

ν

controls the decay rate of the fractional coefficients

w_{k} (ν) = {(- 1)}^{k} (\binom{ν}{k})

(10)

which determine how strongly past shifted evaluations

f (x - k h)

contribute to the fractional derivative. Lower values of

ν

increase the influence of long-range history, but also amplify oscillatory behaviour of the derivative, resulting in unstable gradient propagation for

ν < 0.7

. Conversely, values in the range

ν \in [0.85, 0.98]

provide smooth gradient flow and stable convergence, as the fractional weights decay steadily while retaining sufficient memory depth. In practice, the tested values

ν = 0.80, 0.90, 0.95

span the most commonly adopted regime in fractional neural modelling: they preserve stability, modulate the nonlinearity of the fractional activation functions, and yield smooth transitions between classical (

ν = 1

) and fractional memory behaviours. For

ν = 1

, all expressions recover the standard LSTM activations and gradients, while for

ν < 1

the fractional operator introduces a controllable attenuation of gradient magnitudes, acting as a natural smoothing mechanism in both forward and backward passes.

2.2.1. Gates

The LSTM cell has three logic gates, which are responsible for adjusting the values of the short-term memory vector

(h)

and the long-term memory vector

(c)

. All of them operate as dense layers, implementing the sigmoid activation function. Each of these cells is computed using an almost identical logical value calculation. Equations (11)–(13) present the operations performed by the gates, where

σ

denotes the sigmoid function,

W_{x}

is the corresponding weight matrix,

x_{(t)}

is the next candidate, and

h_{(t - 1)}

is the short-term memory vector from the last timestep.

f_{(t)} = σ (W_{x f}^{T} x_{(t)} + W_{h f}^{T} h_{(t - 1)} + b_{f})

(11)

i_{(t)} = σ (W_{x i}^{T} x_{(t)} + W_{h i}^{T} h_{(t - 1)} + b_{i})

(12)

o_{(t)} = σ (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o})

(13)

One of the elements of the proposed network is enabling smooth adjustment of the sigmoid function, including in the gates. For this purpose, the softplus function was selected as the base function, which is the classical first-order integral of the sigmoid function, allowing for the approximation of functions surrounding the sigmoid function. After introducing the fractional derivatives, the cell equations took the forms presented in Equations (14)–(16).

f_{(t)} = D^{ν} f_{B_{σ}} (W_{x f}^{T} x_{(t)} + W_{h f}^{T} h_{(t - 1)} + b_{f})

(14)

i_{(t)} = D^{ν} f_{B_{σ}} (W_{x i}^{T} x_{(t)} + W_{h i}^{T} h_{(t - 1)} + b_{i})

(15)

o_{(t)} = D^{ν} f_{B_{σ}} (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o})

(16)

Figure 2a presents the result of approximating fractional-order derivatives for the softplus base function, which were used in the gate computation process. Figure 2b presents the approximations of fractional-order derivatives for the base function presented in Equation (18).

2.2.2. Candidate Cell State

Another type of transition in the LSTM cell is the candidate cell state, which is responsible for calculating the base value of the current timestep. Similar to the gates, it is a dense layer, but it implements a different transition function—tanh. Candidate cell state is calculated using the same formula as the logic gates, presented in Equation (17).

g_{(t)} = t a n h (W_{x g}^{T} x_{(t)} + W_{h g}^{T} h_{(t - 1)} + b_{g})

(17)

Substituting the fractional-order derivative into the calculating function results we obtain

g_{(t)} = D^{ν} f_{B_{t a n h}} (W_{x g}^{T} x_{(t)} + W_{h g}^{T} h_{(t - 1)} + b_{g})

(18)

2.2.3. Update of Memory Vectors

The computed values of the gates and the candidate cell state are used by the network to update the short-term and long-term memory vectors. Equations (19) and (20) present the update of the long-term memory vector, which uses gates that employ fractional-order derivatives for approximating the transition functions.

c_{(t)} = f_{(t)} \otimes c_{(t - 1)} + i_{(t)} \otimes g_{(t)}

(19)

\begin{matrix} c_{(t)} & = D^{ν} f_{B_{σ}} (W_{x f}^{T} x_{(t)} + W_{h f}^{T} h_{(t - 1)} + b_{f}) \otimes c_{(t - 1)} + D^{ν} f_{B_{σ}} (W_{x i}^{T} x_{(t)} + W_{h i}^{T} h_{(t - 1)} + b_{i}) \\ \otimes D^{ν} f_{B_{t a n h}} (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o}) \end{matrix}

(20)

Similarly, Equations (21) and (22) present the update of the short-term memory vector using fractional-order derivative mechanisms.

h_{(t)} = o_{(t)} \otimes tanh (c_{(t)})

(21)

h_{(t)} = D^{ν} f_{B_{σ}} (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o}) \otimes tanh (c_{(t)})

(22)

2.3. Backpropagation Through Time

Backpropagation Through Time (BPTT) is a fundamental optimization algorithm for training recurrent neural networks by unfolding them over time and applying backpropagation to compute gradients across multiple timesteps. By leveraging weight sharing, BPTT enables the effective learning of temporal dependencies in sequential data. Despite its computational complexity and sensitivity to vanishing or exploding gradients, it remains a core method for training more advanced architectures such as LSTM and Gated Recurrent Units [35,45]. The LSTM cell is trained using the loss function presented in Equation (23), where y denotes the real value, and

h_{(t)}

is the prediction of the LSTM cell.

L = - \frac{1}{2} {(y - h_{(t)})}^{2}

(23)

We assumed the following equations representing the update operations for the weights in the BPTT process for the classic LSTM cell:

W_{x o} = W_{x o} - α \frac{\partial L}{\partial W_{x o}}; W_{h o} = W_{h o} - α \frac{\partial L}{\partial W_{h o}}; b_{o} = b_{o} - α \frac{\partial L}{\partial b_{o}}

(24)

W_{x f} = W_{x f} - α \frac{\partial L}{\partial W_{x f}}; W_{h f} = W_{h f} - α \frac{\partial L}{\partial W_{h f}}; b_{f} = b_{f} - α \frac{\partial L}{\partial b_{f}}

(25)

W_{x i} = W_{x i} - α \frac{\partial L}{\partial W_{x i}}; W_{h i} = W_{h i} - α \frac{\partial L}{\partial W_{h i}}; b_{i} = b_{i} - α \frac{\partial L}{\partial b_{i}}

(26)

W_{x g} = W_{x g} - α \frac{\partial L}{\partial W_{x g}}; W_{h g} = W_{h g} - α \frac{\partial L}{\partial W_{h g}}; b_{g} = b_{g} - α \frac{\partial L}{\partial b_{g}}

(27)

In each equation,

α

represents the learning rate. For clarity of notation, let us introduce additional symbols.

f_{i n} = W_{h f} h_{(t - 1)} + W_{x f} x_{(t)} + b_{f}; f_{o u t} = σ (f_{i n})

(28)

o_{i n} = W_{h o} h_{(t - 1)} + W_{x o} x_{(t)} + b_{o}; o_{o u t} = σ (o_{i n})

(29)

i_{i n} = W_{h i} h_{(t - 1)} + W_{x i} x_{(t)} + b_{i}; i_{o u t} = σ (i_{i n})

(30)

g_{i n} = W_{x g}^{T} x_{(t)} + W_{h g}^{T} h_{(t - 1)} + b_{g}; g_{o u t} = tanh (g_{i n})

(31)

Starting with the forget gate, Equations (32)–(34) present the expansion of the previous formulas using the chain rule.

\frac{\partial L}{\partial W_{x o}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial o_{o u t}} \frac{\partial o_{o u t}}{\partial o_{i n}} \frac{\partial o_{i n}}{\partial W_{x o}} = (y - h_{(t)}) \cdot tanh (c_{(t)}) \cdot o_{o u t} \cdot (1 - o_{o u t}) \cdot x_{(t)}

(32)

\frac{\partial L}{\partial W_{h o}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial o_{o u t}} \frac{\partial o_{o u t}}{\partial o_{i n}} \frac{\partial o_{i n}}{\partial W_{h o}} = (y - h_{(t)}) \cdot tanh (c_{(t)}) \cdot o_{o u t} \cdot (1 - o_{o u t}) \cdot h_{(t - 1)}

(33)

\frac{\partial L}{\partial b_{o}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial o_{o u t}} \frac{\partial o_{o u t}}{\partial o_{i n}} \frac{\partial o_{i n}}{\partial b_{o}} = (y - h_{(t)}) \cdot tanh (c_{(t)}) \cdot o_{o u t} \cdot (1 - o_{o u t})

(34)

The mathematical model of the backpropagation algorithm utilizing fractional-order derivatives can be derived by replacing the relevant partial derivatives with their approximated version from the GL formula. Equations (35)–(37) present the updates of the output gate components in the process utilizing fractional-order derivatives.

\frac{\partial L}{\partial W_{x o}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial o_{o u t}} \frac{\partial o_{o u t}}{\partial o_{i n}} \frac{\partial o_{i n}}{\partial W_{x o}} = (y - h_{(t)}) \cdot tanh (c_{(t)}) \cdot D^{ν + 1} f_{B_{σ}} (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o}) \cdot x_{(t)}

(35)

\frac{\partial L}{\partial W_{h o}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial o_{o u t}} \frac{\partial o_{o u t}}{\partial o_{i n}} \frac{\partial o_{i n}}{\partial W_{h o}} = (y - h_{(t)}) \cdot tanh (c_{(t)}) \cdot D^{ν + 1} f_{B_{σ}} (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o}) \cdot h_{(t - 1)}

(36)

\frac{\partial L}{\partial b_{o}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial o_{o u t}} \frac{\partial o_{o u t}}{\partial o_{i n}} \frac{\partial o_{i n}}{\partial b_{o}} = (y - h_{(t)}) \cdot tanh (c_{(t)}) \cdot D^{ν + 1} f_{B_{σ}} (W_{x o}^{T} x_{(t)} + W_{h o}^{T} h_{(t - 1)} + b_{o})

(37)

Next, the classical update of the forget gate components was considered. Equations (38)–(40) present the formulas for updating the components of the forget gate.

\frac{\partial L}{\partial W_{x f}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial f_{o u t}} \frac{\partial f_{o u t}}{\partial f_{i n}} \frac{\partial f_{i n}}{\partial W_{x f}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \cdot f_{o u t} \cdot (1 - f_{o u t}) \cdot x_{(t)}

(38)

\frac{\partial L}{\partial W_{h f}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial f_{o u t}} \frac{\partial f_{o u t}}{\partial f_{i n}} \frac{\partial f_{i n}}{\partial W_{h f}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \cdot f_{o u t} \cdot (1 - f_{o u t}) \cdot h_{(t - 1)}

(39)

\frac{\partial L}{\partial b_{f}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial f_{o u t}} \frac{\partial f_{o u t}}{\partial f_{i n}} \frac{\partial f_{i n}}{\partial b_{f}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \cdot f_{o u t} \cdot (1 - f_{o u t})

(40)

Equations (41)–(43) present the update of the forget gate components in the backpropagation process utilizing fractional-order derivatives.

\begin{matrix} \frac{\partial L}{\partial W_{x f}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial f_{o u t}} \frac{\partial f_{o u t}}{\partial f_{i n}} \frac{\partial f_{i n}}{\partial W_{x f}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \\ \cdot c_{(t - 1)} \cdot D^{ν + 1} f_{B_{σ}} (W_{x f}^{T} x_{(t)} + W_{h f}^{T} h_{(t - 1)} + b_{f}) \cdot x_{(t)} \end{matrix}

(41)

\begin{matrix} \frac{\partial L}{\partial W_{h f}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial f_{o u t}} \frac{\partial f_{o u t}}{\partial f_{i n}} \frac{\partial f_{i n}}{\partial W_{h f}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \\ \cdot c_{(t - 1)} \cdot D^{ν + 1} f_{B_{σ}} (W_{x f}^{T} x_{(t)} + W_{h f}^{T} h_{(t - 1)} + b_{f}) \cdot h_{(t - 1)} \end{matrix}

(42)

\begin{matrix} \frac{\partial L}{\partial b_{f}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial f_{o u t}} \frac{\partial f_{o u t}}{\partial f_{i n}} \frac{\partial f_{i n}}{\partial b_{f}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \\ \cdot c_{(t - 1)} \cdot D^{ν + 1} f_{B_{σ}} (W_{x f}^{T} x_{(t)} + W_{h f}^{T} h_{(t - 1)} + b_{f}) \end{matrix}

(43)

Next, an identical analysis was conducted for the input gate. Equations (44)–(46) present the formulas for updating the components of the input gate.

\frac{\partial L}{\partial W_{x i}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial i_{o u t}} \frac{\partial i_{o u t}}{\partial i_{i n}} \frac{\partial i_{i n}}{\partial W_{x i}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \cdot i_{o u t} \cdot (1 - i_{o u t}) \cdot x_{(t)}

(44)

\frac{\partial L}{\partial W_{h i}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial i_{o u t}} \frac{\partial i_{o u t}}{\partial i_{i n}} \frac{\partial i_{i n}}{\partial W_{h i}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \cdot i_{o u t} \cdot (1 - i_{o u t}) \cdot h_{(t - 1)}

(45)

\frac{\partial L}{\partial b_{i}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial i_{o u t}} \frac{\partial i_{o u t}}{\partial i_{i n}} \frac{\partial i_{i n}}{\partial b_{i}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \cdot i_{o u t} \cdot (1 - i_{o u t})

(46)

Equations (47)–(49) present the formulas for updating the components of the input gate using the GL method.

\begin{matrix} \frac{\partial L}{\partial W_{x i}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial i_{o u t}} \frac{\partial i_{o u t}}{\partial i_{i n}} \frac{\partial i_{i n}}{\partial W_{x i}} \\ = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \\ \cdot D^{ν + 1} f_{B_{σ}} (W_{x i}^{T} x_{(t)} + W_{h i}^{T} h_{(t - 1)} + b_{i}) \cdot x_{(t)} \end{matrix}

(47)

\begin{matrix} \frac{\partial L}{\partial W_{h i}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial i_{o u t}} \frac{\partial i_{o u t}}{\partial i_{i n}} \frac{\partial i_{i n}}{\partial W_{h i}} \\ = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \\ \cdot D^{ν + 1} f_{B_{σ}} (W_{x i}^{T} x_{(t)} + W_{h i}^{T} h_{(t - 1)} + b_{i}) \cdot h_{(t - 1)} \end{matrix}

(48)

\begin{matrix} \frac{\partial L}{\partial b_{i}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial i_{o u t}} \frac{\partial i_{o u t}}{\partial i_{i n}} \frac{\partial i_{i n}}{\partial b_{i}} \\ = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot c_{(t - 1)} \\ \cdot D^{ν + 1} f_{B_{σ}} (W_{x i}^{T} x_{(t)} + W_{h i}^{T} h_{(t - 1)} + b_{i}) \end{matrix}

(49)

Finally, in Equations (50)–(52), an analysis of the classical update of the candidate cell state components was conducted.

\frac{\partial L}{\partial W_{x g}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial g_{o u t}} \frac{\partial g_{o u t}}{\partial g_{i n}} \frac{\partial g_{i n}}{\partial W_{x g}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot i_{o u t} \cdot (1 - {tanh}^{2} (g_{i n})) \cdot x_{t}

(50)

\frac{\partial L}{\partial W_{h g}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial g_{o u t}} \frac{\partial g_{o u t}}{\partial g_{i n}} \frac{\partial g_{i n}}{\partial W_{h g}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot i_{o u t} \cdot (1 - {tanh}^{2} (g_{i n})) \cdot h_{(t - 1)}

(51)

\frac{\partial L}{\partial b_{g}} = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial g_{o u t}} \frac{\partial g_{o u t}}{\partial g_{i n}} \frac{\partial g_{i n}}{\partial b_{g}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \cdot i_{o u t} \cdot (1 - {tanh}^{2} (g_{i n}))

(52)

Equations (53)–(55) present the process of updating the components of the candidate cell state in the backpropagation process utilizing the GL method.

\begin{matrix} \frac{\partial L}{\partial W_{x g}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial g_{o u t}} \frac{\partial g_{o u t}}{\partial g_{i n}} \frac{\partial g_{i n}}{\partial W_{x g}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \\ \cdot i_{o u t} \cdot D^{ν + 1} f_{B_{t a n h}} (W_{x g}^{T} x_{(t)} + W_{h g}^{T} h_{(t - 1)} + b_{g}) \cdot x_{t} \end{matrix}

(53)

\begin{matrix} \frac{\partial L}{\partial W_{h g}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial g_{o u t}} \frac{\partial g_{o u t}}{\partial g_{i n}} \frac{\partial g_{i n}}{\partial W_{h g}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \\ \cdot i_{o u t} \cdot D^{ν + 1} f_{B_{t a n h}} (W_{x g}^{T} x_{(t)} + W_{h g}^{T} h_{(t - 1)} + b_{g}) \cdot h_{(t - 1)} \end{matrix}

(54)

\begin{matrix} \frac{\partial L}{\partial b_{g}} & = \frac{\partial L}{\partial h_{(t)}} \frac{\partial h_{(t)}}{\partial c_{(t)}} \frac{\partial c_{(t)}}{\partial g_{o u t}} \frac{\partial g_{o u t}}{\partial g_{i n}} \frac{\partial g_{i n}}{\partial b_{g}} = (y - h_{(t)}) \cdot o_{o u t} \cdot (1 - {tanh}^{2} (c_{(t)})) \\ \cdot i_{o u t} \cdot D^{ν + 1} f_{B_{t a n h}} (W_{x g}^{T} x_{(t)} + W_{h g}^{T} h_{(t - 1)} + b_{g}) \end{matrix}

(55)

By explicitly incorporating

f_{B_{σ}} (x)

and

f_{B_{tanh}} (x)

in the fractional backpropagation framework, the learning process maintains a smooth gradient flow while allowing adaptive activation transitions. This generalization ensures that the fractional LSTM model retains the expressive power of classical LSTMs while introducing continuous tunability of neuron dynamics.

2.4. Methodology for Convergence Analysis

To evaluate the numerical behaviour and convergence properties of the fractional Grünwald–Letnikov derivatives, we performed 3D simulations over the domain: input:

x \in [- 5, 5]

, step size:

h \in {0.01, 0.05, 0.1, 0.2}

, fractional order:

ν \in [0, 1]

. For each fractional order

ν

, we evaluated the smoothness and convergence behaviour of the computed derivatives (see Figure 3).

Beyond the empirical gradient-norm trajectories, the stabilising behaviour of the Grünwald–Letnikov operator is also supported by established analytical properties of fractional differences. In the GL formulation, the derivative is expressed as a convolution with a sequence of coefficients that decay monotonically and form a Mittag–Leffler-type memory kernel. Such kernels are known to suppress high-frequency oscillations and to attenuate abrupt changes in iterative updates, providing a natural damping mechanism absent in integer-order derivatives. Prior studies on fractional-order dynamical systems ([14,17,18]) demonstrate that GL-based fractional differences lead to bounded iterative behaviour and mitigate explosive growth in discretised recurrence relations. While a full spectral eigenvalue analysis of the unrolled FD–LSTM Jacobian is mathematically non-trivial and beyond the scope of this work, these theoretical insights align with the empirically observed reduction in gradient-norm variance and the absence of vanishing or exploding gradients for

ν \in [0.85, 0.98]

.

The results for the sigmoid-like function reveal that for small

ν

(e.g.,

ν \approx 0.2

), the fractional derivative closely resembles a linear function, behaving like an identity mapping. As

ν \to 1

, the fractional derivative smoothly transitions toward the standard sigmoid derivative, confirming that the Grünwald–Letnikov approximation preserves the expected activation behaviour. Larger step sizes h lead to oscillations, but for

h \leq 0.1

, the computation remains stable. For the tanh-like function, we observe smooth convergence for all

ν

, indicating numerical stability. Smaller

ν

values lead to an attenuated gradient, which can help prevent exploding gradients in backpropagation. The Mittag-Leffler decay property appears for intermediate

ν

, demonstrating the memory effect of fractional derivatives. These findings confirm that fractional derivatives generalize standard activation functions while maintaining numerical stability. By using fractional-order activations, neural networks gain enhanced flexibility in function approximation while maintaining reliable convergence properties.

3. Results and Discussion

3.1. Task Definition and Data

We address anomaly detection in fiber-optic cable manufacturing from streaming industrial process data. The model continuously classifies the last 256 s of signals into predefined operational states to enable real-time intervention and improved product quality. The dataset comprises ∼5690 windows recorded at 1 Hz with 294 raw features; after variance and correlation filtering, we retain 40 indicators.

The raw dataset provides a coarse expert-verified indicator at the session level: 0 denotes a session with no abnormal behaviour, while 1 indicates that at least one anomaly occurred somewhere within the session. Importantly, this label does not specify the exact timing, duration, or type of the anomaly, nor does it apply uniformly to all windows extracted from that session. A session marked as anomalous may contain normal windows, windows with mild deviations, and windows exhibiting different physical fault mechanisms. Because the session-level label is not sufficiently informative for window-level training, we apply a weakly supervised relabelling procedure. Specifically, K-means clustering with

k = 6

is performed on the subset of windows extracted from anomaly-marked sessions. This yields six operationally distinct anomaly subtypes with different temporal signatures and physical origins. These refined categories provide meaningful window-level supervision and allow the FD–LSTM to learn sensitivity to heterogeneous anomaly patterns rather than a single coarse anomaly flag. All variables are standardized using the Z-score transformation before model training:

x^{'} = \frac{x - μ}{σ}

(56)

and segmented into 256-timestep windows with 50% overlap, producing ∼2500 labeled sequences for modeling.

To ensure the methodological correctness of the evaluation and to eliminate potential temporal leakage, we adopted a strictly time-aware data-splitting protocol. All sequence windows were ordered chronologically and were not shuffled at any stage of the experimental pipeline. The dataset was partitioned using a forward-in-time split, with 90% of the windows used for training and validation and the remaining 10%—corresponding to the most recent segment of the production timeline—reserved as a held-out test set. This ratio is widely adopted in industrial time-series modelling, as it preserves a representative evaluation segment while ensuring that the training set captures the full variability of operational conditions. The 10% test portion includes complete production cycles and all anomaly categories, enabling reliable generalisation assessment without compromising model learning capacity. Evaluating the model exclusively on future, unseen temporal segments reflects a realistic deployment scenario and prevents the network from accessing any information derived from later time intervals. Although the windows were generated with 50% overlap to capture smooth temporal dynamics, their chronological order was preserved throughout, ensuring that no leakage occurred between training and test partitions. The held-out portion of the dataset also contains all six anomaly categories obtained through weakly supervised relabelling, providing representative coverage of the underlying process behaviour. This time-aware dataset organisation enables a fair and robust assessment of the generalisation performance of the proposed FD–LSTM model under real industrial conditions.

3.2. Experimental Setup

All experiments were implemented in TensorFlow 2.14 (Python 3.11, CUDA 12.2) on an NVIDIA RTX 4090 GPU. We trained for 200 epochs with batch size 64, learning rate 0.01, early stopping (patience = 20), and SGD with Nesterov momentum 0.9. Results were averaged over 10 runs with different seeds. The FD–LSTM architecture is as follows: Input: 40 features × 256 timesteps, Hidden: 256 LSTM units implementing the Grünwald–Letnikov fractional derivative operator, Output: 6 softmax units (fault categories). We used categorical cross-entropy and report accuracy, F1, and the goodness-of-fit index

Φ

.

3.3. Overall Results and Learning Dynamics

Figure 4 confirms smooth convergence across fractional orders

ν

. Table 3 summarizes the metrics from the author’s experiments; the best performance is obtained at

ν = 0.95

with an accuracy of 0.9185 and

Φ = 0.9145

on the held-out test set.

To further analyse the optimisation process of the proposed fractional mechanism, we compared the learning dynamics across several recurrent baselines trained under an identical protocol (200 epochs, batch size 64, SGD with Nesterov momentum). This comparative evaluation enables a clearer interpretation of how the GL-based fractional operator influences convergence speed, gradient stability, and generalisation performance relative to classical and fractional recurrent architectures (see Table 4). The “Classical LSTM” reported in Table 4 corresponds to the standard TensorFlow LSTM baseline trained under the unified protocol described in Section 3.2. In contrast, the “Classic Model” in Table 3 represents the GL-based architecture evaluated at

ν = 1.0

within the fractional-order sweep. These two baselines belong to conceptually different experimental setups, and therefore, their accuracy values are not expected to be identical.

A complementary view of the learning dynamics is provided by the extended loss curves (Figure 5), which visualise the optimisation trajectory of each model throughout the 200 training epochs.

The extended loss trajectories reveal several important trends. First, FD–LSTM (GL) decreases the loss more rapidly than all other examined architectures during the initial 40–60 epochs, indicating more efficient gradient propagation and faster error minimisation. This behaviour reflects the smoothing properties of the Grünwald–Letnikov operator, which mitigate both vanishing and exploding gradient phenomena. Second, the fractional model exhibits substantially reduced mid-training oscillations compared with classical LSTM, GRU and Caputo-based fractional LSTM, suggesting that the GL operator acts as an adaptive regulariser for both forward activations and backward gradients. Finally, FD–LSTM stabilises notably earlier (around epoch 80) than Bi–LSTM (epoch 100) and classical recurrent models (110–130 epochs). This indicates that GL-based fractional memory provides a more efficient representation of long-term dependencies, directly translating into superior generalisation performance, as reflected by the accuracy and F1 values reported in Table 4. Table 3 reports the performance of the FD–LSTM(GL) architecture across different fractional orders

ν

. This ablation study includes only models constructed within the GL–based fractional framework. The row

ν = 1.0

corresponds to the integer-order limit of the FD–LSTM(GL) formulation and is included for comparison within this unified fractional setting. The best-performing model used

ν = 0.95

, achieving a test accuracy of 91.85% and a goodness-of-fit

Φ = 0.9145

.

To quantify the contribution of the fractional mechanism, we compare FD–LSTM (GL) against standard LSTM, Bi-LSTM, and a Caputo-based fractional LSTM. Table 5 shows that FD–LSTM (GL) provides the most stable generalization (Accuracy/F1), while preserving a competitive

Φ

.

Compared to classical LSTM, FD–LSTM exhibits: ∼25% faster early loss decrease (first 50 epochs), lower variance of gradient norms (

σ^{2} = 0.003

vs.

0.009

), and no vanishing/exploding gradients for

ν \in [0.9, 1.1]

. A specific numerical consideration is the horizontal shift inherent in truncated GL series: static GL coefficients approximate derivatives over backward samples, which may induce a minor phase lag in learning dynamics. In practice, this is mitigated by (a) adequate coefficient length N, (b) smaller step h, or (c) bidirectional GL stencils. Empirically, adopting

N \geq 20

suffices for stable training while keeping runtime moderate.

Although the Grünwald–Letnikov operator introduces a linear-time overhead during backpropagation due to the coefficient stencil, this additional cost does not translate into a substantial practical slowdown. As reported in Table 6, the per-epoch runtime increases by 11–21%; however, the fractional model reaches convergence significantly earlier than the classical LSTM (typically by epoch 80 compared to 120–130), which effectively compensates for the modest per-epoch increase. The GL operator also provides a smoothing effect on the optimisation landscape, reducing gradient-norm variance and stabilising the learning trajectory, which further decreases effective training time. Importantly, inference-time complexity remains unchanged, as the fractional stencil is applied only during training. Consequently, the FD–LSTM retains real-time feasibility for industrial monitoring applications while offering improved optimisation stability.

Our findings are consistent with recent fractional neural literature (e.g., feed-forward and biophysical networks [46,47,48]) and extend them to recurrent settings with consistent fractional forward propagation and BPTT. We hypothesize that GL’s discrete shift-invariance aligns well with uniformly sampled industrial time series, which helps explain its empirical edge over Caputo or Riemann-Liouville formulations.

In production, FD–LSTM improves the sensitivity to subtle drifts while preserving stability, making it suitable for early warning in fiber-optic lines. Future work includes meta-learning of

ν

, bidirectional GL stencils to reduce phase lag, and integration with an ALMM-based decision layer [26] for end-to-end adaptive control.

3.4. Industrial Addendum: Compact RLM-Line Evaluation

To complement the method-centric experiments in this paper, we report a compact, non-overlapping evaluation on the FIBRAIN RLM fiber-tube production line (1 Hz sampling, window 256 s, stride 5 s, 232 sensors). This slice is distinct from the dataset used in our earlier application-focused work. We retain the same training regimen (optimizer, batch size, early stopping) but use the plant’s five-state taxonomy (one nominal + four anomaly states) without K-means relabeling, to mirror shop-floor interpretation. Models are trained once per

ν

with identical seeds.

On this real-line slice, FD–LSTM (GL) consistently matches or exceeds the classical LSTM, peaking at

ν = 0.80

(Table 7). In contrast, in our main study that uses a reduced 40-indicator feature set with weakly-supervised relabeling, the best result occurs at

ν = 0.95

(Accuracy

0.9185

). This divergence is consistent with the role of

ν

as a tunable memory-depth control; its optimum shifts with the input representation and label protocol. The comparison with common baselines under the same RLM slice (Table 8) indicates FD–LSTM’s edge is not only with regard to its classical counterpart but also vs. standard learners:

The GL fractional mechanism improves real-line accuracy and stabilizes training on the RLM slice, while the optimal

ν

is data- and label-protocol-dependent. This complements the main results and supports the view of

ν

as a practical, continuous knob for adapting memory depth to industrial time-series regimes.

4. Conclusions

This study introduces a novel extension of recurrent neural networks that integrates fractional-order calculus into the learning dynamics of LSTM architectures. By embedding the Grünwald–Letnikov derivative directly into the neuron activation and gradient propagation mechanisms, the proposed FD–LSTM achieves a mathematically grounded and computationally tractable enhancement of classical backpropagation.

The research demonstrates that the fractional approach:

Stabilizes gradient flow and eliminates the vanishing-gradient bottleneck without architectural modifications.
Introduces a continuous control dimension through the fractional order $ν$ , enabling smooth adaptation of the neuron’s transfer function between linear and nonlinear regimes.
Improves convergence and generalization, achieving up to 1.6× faster training and higher accuracy in industrial anomaly-detection tasks.
Bridges discrete and continuous learning representations, establishing a unified framework for modeling memory depth and temporal smoothness in recurrent systems.

The empirical results confirm that fractional differentiation provides a new degree of freedom for deep learning-extending the search space of trainable parameters beyond weights and biases. The optimal configuration (

ν = 0.95

) ensures an effective balance between numerical stability and dynamic adaptability, outperforming both classical and Caputo-based LSTM variants.

From an application perspective, the FD–LSTM enhances sensitivity to subtle temporal deviations in fiber-optic manufacturing, supporting earlier and more reliable anomaly detection. The framework’s discrete-shift invariance and compatibility with existing GPU pipelines make it directly applicable in real-time industrial monitoring.

Although this work focuses on a single real-world dataset, the fiber-optic manufacturing domain provides a challenging benchmark characterised by heterogeneous sensor interactions, long-range temporal dependencies, and multiple anomaly subtypes obtained through weakly supervised relabelling. Nevertheless, extending the evaluation to additional anomaly detection scenarios represents an important direction for future research and will allow for a broader assessment of cross-domain robustness.

In future work, fractional-order parameters could be adaptively learned alongside network weights, leading to self-tuning memory dynamics. Further research may explore bidirectional GL operators, fractional gating in Transformer-like architectures, and the integration of the proposed model into symbolic decision layers for fully interpretable, hybrid intelligent systems.

Another promising research direction concerns decentralised and federated anomaly detection. Many industrial systems operate across multiple machines or production sites where data cannot be centrally aggregated due to bandwidth limitations, privacy requirements, or heterogeneous hardware constraints. Recent works on collaborative adaptation and balance-recovery strategies for federated fault diagnosis demonstrate the potential of distributed learning frameworks in such environments [49]. Because the proposed GL-based FD–LSTM does not rely on any mechanism that prevents distributed optimisation, integrating it into a federated learning framework represents a natural extension. Such an approach would enable consistent anomaly detection across heterogeneous machine groups while preserving data decentralisation and respecting local data-sovereignty constraints.

Finally, recent advancements such as MSTAD and Transformer-based anomaly detection frameworks highlight the growing role of attention mechanisms in modelling complex, long-range temporal dependencies. While our modelling focus in this study was deliberately restricted to recurrent architectures to enable a controlled analysis of fractional gating dynamics, a natural next step is to evaluate the GL-based fractional mechanism within Transformer encoder blocks and to benchmark FD–LSTM against state-of-the-art attention-driven models. This comparison will be particularly relevant for datasets with longer temporal horizons, where self-attention can fully exploit global temporal context.

In summary, the proposed FD–LSTM model transforms fractional calculus from a theoretical construct into a practical deep learning mechanism, bridging mathematical rigor, computational efficiency, and industrial impact.

Author Contributions

Conceptualisation, Z.G.; methodology, Z.G. and E.Z.; software, Z.G., E.Z. and M.K.; validation, Z.G. and E.Z.; formal analysis, Z.G.; investigation, Z.G., E.Z. and L.O.; resources, L.O.; data curation, E.Z., M.K. and A.Ć.; writing—original draft preparation, Z.G. and E.Z.; writing—review and editing, Z.G., E.Z. and A.Ć.; visualisation, Z.G., E.Z. and M.K.; supervision, Z.G.; project administration, Z.G.; funding acquisition, L.O., Z.G. and E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This project received funding from the National Centre for Research and Development under grant agreement No. POIR.01.01.01-00-1425/20.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

We gratefully acknowledge the support of the NCBR, as part of the competition: 6/1.1.1/2020 6/1.1.1/2020 SS Duze/MSP/JN 4, project number POIR.01.01.01-00-1425/20-00 and Fibrain Sp. z.o.o, which made this research possible.

Conflicts of Interest

Author Zbigniew Gomolka performed commissioned research work for FIBRAIN Sp. z o.o. as part of the NCBiR project POIR.01.01.01-00-1425/20. Author Lukasz Olbrot was employed by FIBRAIN Sp. z o.o. The remaining authors declare that they have no commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations and symbols are used in this manuscript:

FD–LSTM	Fractional Derivative Long Short-Term Memory network
LSTM	Long Short-Term Memory (recurrent neural network)
GL	Grünwald–Letnikov (fractional derivative operator)
BPTT	Backpropagation Through Time
RNN	Recurrent Neural Network
CNN	Convolutional Neural Network
SGD	Stochastic Gradient Descent
GPU	Graphics Processing Unit
CPU	Central Processing Unit

References

Gomolka, Z.; Dudek-Dyduch, E.; Kondratenko, Y. From Homogeneous Network to Neural Nets with Fractional Derivative Mechanism. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 11–15 June 2017; Volume 5, pp. 52–63. [Google Scholar] [CrossRef]
Gomolka, Z. Backpropagation algorithm with fractional derivatives. ITM Web Conf. 2018, 21, 00004. [Google Scholar] [CrossRef]
Gomolka, Z. Neurons’ Transfer Function Modeling with the Use of Fractional Derivative. In Proceedings of the International Conference on Dependability and Complex Systems, Brunow, Poland, 2–6 July 2018; Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J., Eds.; pp. 218–227. [Google Scholar]
Gomolka, Z. Fractional Backpropagation Algorithm—Convergence for the Fluent Shapes of the Neuron Transfer Function. In Proceedings of the International Conference on Neural Information Processing, Bangkok, Thailand, 18–22 November 2020; pp. 580–588. [Google Scholar] [CrossRef]
Kumar, M.; Mehta, U. Enhancing the performance of CNN models for pneumonia and skin cancer detection using novel fractional activation function. Appl. Soft Comput. 2025, 168, 112500. [Google Scholar] [CrossRef]
Liu, C.G.; Wang, J.L. Passivity of fractional-order coupled neural networks with multiple state/derivative couplings. Neurocomputing 2021, 455, 379–389. [Google Scholar] [CrossRef]
Mohanrasu, S.; Priyanka, T.; Gowrisankar, A.; Kashkynbayev, A.; Udhayakumar, K.; Rakkiyappan, R. Fractional derivative of Hermite fractal splines on the fractional-order delayed neural networks synchronization. Commun. Nonlinear Sci. Numer. Simul. 2025, 140, 108399. [Google Scholar] [CrossRef]
Wei, J.L.; Wu, G.C.; Liu, B.Q.; Zhao, Z. New semi-analytical solutions of the time-fractional Fokker–Planck equation by the neural network method. Optik 2022, 259, 168896. [Google Scholar] [CrossRef]
Solís-Pérez, J.; Gómez-Aguilar, J.; Atangana, A. A fractional mathematical model of breast cancer competition model. Chaos Solitons Fractals 2019, 127, 38–54. [Google Scholar] [CrossRef]
Ganji, R.; Jafari, H.; Moshokoa, S.; Nkomo, N. A mathematical model and numerical solution for brain tumor derived using fractional operator. Results Phys. 2021, 28, 104671. [Google Scholar] [CrossRef]
Vignesh, D.; Banerjee, S. Dynamical analysis of a fractional discrete-time vocal system. Nonlinear Dyn. 2023, 111, 4501–4515. [Google Scholar] [CrossRef]
Al-Qurashi, M.; Asif, Q.U.A.; Chu, Y.M.; Rashid, S.; Elagan, S. Complexity analysis and discrete fractional difference implementation of the Hindmarsh–Rose neuron system. Results Phys. 2023, 51, 106627. [Google Scholar] [CrossRef]
Alsharidi, A.K.; Rashid, S.; Elagan, S.K. Short-memory discrete fractional difference equation wind turbine model and its inferential control of a chaotic permanent magnet synchronous transformer in time-scale analysis. AIMS Math. 2023, 8, 19097–19120. [Google Scholar] [CrossRef]
Brzeziński, D.W.; Ostalczyk, P. About accuracy increase of fractional order derivative and integral computations by applying the Grünwald-Letnikov formula. Commun. Nonlinear Sci. Numer. Simul. 2016, 40, 151–162. [Google Scholar] [CrossRef]
MacDonald, C.L.; Bhattacharya, N.; Sprouse, B.P.; Silva, G.A. Efficient computation of the Grünwald–Letnikov fractional diffusion derivative using adaptive time step memory. J. Comput. Phys. 2015, 297, 221–236. [Google Scholar] [CrossRef]
Türkmen, M.R. Outlier-Robust Convergence of Integer- and Fractional-Order Difference Operators in Fuzzy-Paranormed Spaces: Diagnostics and Engineering Applications. Fractal Fract. 2025, 9, 667. [Google Scholar] [CrossRef]
Öğünmez, H.; Türkmen, M.R. Statistical Convergence for Grünwald-Letnikov Fractional Differences: Stability, Approximation, and Diagnostics in Fuzzy Normed Spaces. Axioms 2025, 14, 725. [Google Scholar] [CrossRef]
Yao, Z.; Yang, Z.; Gao, J. Unconditional stability analysis of Grünwald Letnikov method for fractional-order delay differential equations. Chaos Solitons Fractals 2023, 177, 114193. [Google Scholar] [CrossRef]
Zuñiga Aguilar, C.; Gómez-Aguilar, J.; Alvarado-Martínez, V.; Romero-Ugalde, H. Fractional order neural networks for system identification. Chaos Solitons Fractals 2020, 130, 109444. [Google Scholar] [CrossRef]
Panda, S.K.; Kalla, K.S.; Nagy, A.; Priyanka, L. Numerical simulations and complex valued fractional order neural networks via (ε − μ)-uniformly contractive mappings. Chaos Solitons Fractals 2023, 173, 113738. [Google Scholar] [CrossRef]
Sivalingam, S.M.; Kumar, P.; Govindaraj, V. A neural networks-based numerical method for the generalized Caputo-type fractional differential equations. Math. Comput. Simul. 2023, 213, 302–323. [Google Scholar] [CrossRef]
Anwar, N.; Raja, M.A.Z.; Kiani, A.K.; Ahmad, I.; Shoaib, M. Autoregressive exogenous neural structures for synthetic datasets of olive disease control model with fractional Grünwald-Letnikov solver. Comput. Biol. Med. 2025, 187, 109707. [Google Scholar] [CrossRef]
El Akhal, H.; Ben Yahya, A.; Moussa, N.; El Belrhiti El Alaoui, A. A novel approach for image-based olive leaf diseases classification using a deep hybrid model. Ecol. Inform. 2023, 77, 102276. [Google Scholar] [CrossRef]
Jones, A.M.; Itti, L.; Sheth, B.R. Expert-level sleep staging using an electrocardiography-only feed-forward neural network. Comput. Biol. Med. 2024, 176, 108545. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Gomolka, Z.; Zeslawska, E.; Olbrot, L. Using Hybrid LSTM Neural Networks to Detect Anomalies in the Fiber Tube Manufacturing Process. Appl. Sci. 2025, 15, 1383. [Google Scholar] [CrossRef]
Pittino, F.; Puggl, M.; Moldaschl, T.; Hirschl, C. Automatic Anomaly Detection on In-Production Manufacturing Machines Using Statistical Learning Methods. Sensors 2020, 20, 2344. [Google Scholar] [CrossRef]
Abdelli, K.; Cho, J.Y.; Azendorf, F.; Griesser, H.; Tropschug, C.; Pachnicke, S. Machine Learning-based Anomaly Detection in Optical Fiber Monitoring. arXiv 2022, arXiv:2204.07059. [Google Scholar] [CrossRef]
Abdallah, M.; Joung, B.G.; Lee, W.J.; Mousoulis, C.; Raghunathan, N.; Shakouri, A.; Sutherland, J.W.; Bagchi, S. Anomaly Detection and Inter-Sensor Transfer Learning on Smart Manufacturing Datasets. Sensors 2023, 23, 486. [Google Scholar] [CrossRef]
Guo, W.; Jiang, P. Weakly Supervised anomaly detection with privacy preservation under a Bi-Level Federated learning framework. Expert Syst. Appl. 2024, 254, 124450. [Google Scholar] [CrossRef]
Iqbal Basheer, M.Y.; Mohd Ali, A.; Abdul Hamid, N.H.; Mohd Ariffin, M.A.; Osman, R.; Nordin, S.; Gu, X. Autonomous anomaly detection for streaming data. Knowl.-Based Syst. 2024, 284, 111235. [Google Scholar] [CrossRef]
Kang, B.; Zhong, Y.; Sun, Z.; Deng, L.; Wang, M.; Zhang, J. MSTAD: A masked subspace-like transformer for multi-class anomaly detection. Knowl.-Based Syst. 2024, 283, 111186. [Google Scholar] [CrossRef]
Lyu, S.; Mo, D.; Wong, W. REB: Reducing biases in representation for industrial anomaly detection. Knowl.-Based Syst. 2024, 290, 111563. [Google Scholar] [CrossRef]
Shen, L.; Wei, Y.; Wang, Y.; Li, H. AFMF: Time series anomaly detection framework with modified forecasting. Knowl.-Based Syst. 2024, 296, 111912. [Google Scholar] [CrossRef]
Williams, R.J.; Zipser, D. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Comput. 1989, 1, 270–280. [Google Scholar] [CrossRef]
Ribeiro, E.; Mancho, R.A. Incremental construction of LSTM recurrent neural network. Res. Comput. Sci. 2002, 1, 171–184. [Google Scholar]
Staudemeyer, R.C.; Morris, E.R. Understanding LSTM—A tutorial into Long Short-Term Memory Recurrent Neural Networks. arXiv 2019, arXiv:1909.09586. [Google Scholar] [CrossRef]
Podlubny, I. Fractional Differential Equations: An Introduction to Fractional Derivatives, Fractional Differential Equations, to Methods of Their Solution and Some of Their Applications; Mathematics in Science and Engineering; Academic Press: London, UK, 1999. [Google Scholar]
Ortigueira, M.D.; Tenreiro Machado, J. What is a fractional derivative? J. Comput. Phys. 2015, 293, 4–13. [Google Scholar] [CrossRef]
Wei, J.L.; Wu, G.C.; Liu, B.Q.; Nieto, J.J. An optimal neural network design for fractional deep learning of logistic growth. Neural Comput. Appl. 2023, 35, 10837–10846. [Google Scholar] [CrossRef]
Area, I.; Nieto, J. Power series solution of the fractional logistic equation. Phys. A Stat. Mech. Its Appl. 2021, 573, 125947. [Google Scholar] [CrossRef]
Fan, Q.; Wu, G.C.; Fu, H. A Note on Function Space and Boundedness of the General Fractional Integral in Continuous Time Random Walk. J. Nonlinear Math. Phys. 2022, 29, 95–102. [Google Scholar] [CrossRef]
Gai, M.; Cui, S.; Liang, S.; Liu, X. Frequency distributed model of Caputo derivatives and robust stability of a class of multi-variable fractional-order neural networks with uncertainties. Neurocomputing 2016, 202, 91–97. [Google Scholar] [CrossRef]
Sabir, Z.; Ali, M.R. Analysis of perturbation factors and fractional order derivatives for the novel singular model using the fractional Meyer wavelet neural networks. Chaos Solitons Fractals X 2023, 11, 100100. [Google Scholar] [CrossRef]
Werbos, P. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
Huang, Z.; Haider, Q.; Sabir, Z.; Arshad, M.; Siddiqui, B.K.; Alam, M.M. A neural network computational structure for the fractional order breast cancer model. Sci. Rep. 2023, 13, 22756. [Google Scholar] [CrossRef] [PubMed]
Chu, Y.M.; Alzahrani, T.; Rashid, S.; Rashidah, W.; ur Rehman, S.; Alkhatib, M. An advanced approach for the electrical responses of discrete fractional-order biophysical neural network models and their dynamical responses. Sci. Rep. 2023, 13, 18180. [Google Scholar] [CrossRef] [PubMed]
Panda, S.K.; Abdeljawad, T.; Nagy, A.M. On uniform stability and numerical simulations of complex valued neural networks involving generalized Caputo fractional order. Sci. Rep. 2024, 14, 4073. [Google Scholar] [CrossRef]
Yang, B.; Lei, Y.; Li, N.; Li, X.; Si, X.; Chen, C. Balance recovery and collaborative adaptation approach for federated fault diagnosis of inconsistent machine groups. Knowl.-Based Syst. 2025, 317, 113480. [Google Scholar] [CrossRef]

Figure 1. LSTM cell model with fractional order of the sigmoid and tanh transfer functions, respectively.

Figure 2. Approximations of derivatives of softplus functions (a) and

t a n h

(b).

Figure 2. Approximations of derivatives of softplus functions (a) and

t a n h

(b).

Figure 3. Convergence of the fractional derivative of

f_{B_{σ}} (x)

for different fractional orders

ν

(a) and convergence of the fractional derivative of

f_{B_{tanh}} (x)

, respectively (b).

Figure 3. Convergence of the fractional derivative of

f_{B_{σ}} (x)

for different fractional orders

ν

(a) and convergence of the fractional derivative of

f_{B_{tanh}} (x)

, respectively (b).

Figure 4. Training loss progression for LSTM with the GL method. The different colors correspond to the changing values of

ν

.

Figure 4. Training loss progression for LSTM with the GL method. The different colors correspond to the changing values of

ν

.

Figure 5. Extended comparison of training loss progression across recurrent architectures. FD–LSTM (GL) achieves the fastest early loss decrease and the most stable long-term behaviour.

Table 1. Representative groups of studies related to fractional-order neural modeling.

Category	Representative Works	Method/Approach	Focus/Model	Key Contribution	Limitations
Fractional neural foundations	[1,2,3,4,5,6,7,8]	Fractional derivatives (Caputo, GL) applied to neural structures	Theoretical formulation of fractional-order neurons and backpropagation	Demonstrated improved adaptability and convergence in learning	Limited validation on recurrent or large-scale models
Fractional dynamic systems and applications	[9,10,11,12,13]	Fractional differential equations; chaotic and oscillatory models	Modeling long-term memory, stability, and nonlinear transitions	Showed fractional differentiation captures complex temporal dependencies	No integration with neural or learning frameworks
Mathematical foundations of GL operators	[14,15,16,17,18]	Grünwald–Letnikov discrete formulations; adaptive-memory solvers	Numerical accuracy, convergence, and stability of discrete fractional operators	Provided proofs of robustness and efficiency for GL discretization	Lack of neural application or learning-based context
Fractional neural architectures	[19,20,21]	Fractional operators embedded in FONN, complex-valued, and Caputo-type networks	Neural architectures with fractional gradients or activations	Improved gradient smoothness and dynamic response	Applied only to feed-forward or static networks
Applied fractional models	[22,23,24]	Hybrid CNN and autoregressive GL solvers; biomedical signal analysis	Ecological prediction, image-based classification, biomedical signal interpretation	Demonstrated cross-domain benefits of fractional frameworks	Lacking unified recurrent backpropagation or consistency in training

Table 2. Comparison of existing fractional-order LSTM formulations. Our model is the first to apply the same fractional operator consistently to both forward and backward computations.

Work	Fractional Operator	Forward Pass	Backward Pass
C.J. Zuñiga Aguilar et al. [19]	Caputo	Fractional activation only	Classical
Sivalingam S.M. et al. (2021) [21]	Caputo–Fabrizio	Fractional cell update	Classical
Mohanrasu et al. (2022) [7]	A–B	Partial fractional gates	Classical
Wei et al. (2022) [8]	FO memory units	Fractional state update	Classical
This work	GL	Full fractional gates	Full fractional backpropagation

Table 3. Evaluation metrics of the trained FD–LSTM models for different fractional orders

ν

.

Table 3. Evaluation metrics of the trained FD–LSTM models for different fractional orders

ν

.

$ν$	Test Dataset (10%)		Full Dataset
$ν$	Accuracy	Goodness-of-Fit $Φ$	Accuracy	Goodness-of-Fit $Φ$
0.90	0.9000	0.8940	0.9580	0.9526
0.95	0.9185	0.9145	0.9599	0.9543
1.00	0.9148	0.9051	0.9595	0.9550
1.05	0.8925	0.8863	0.9517	0.9462
1.10	0.8935	0.8867	0.9443	0.9387

Table 4. Comparison of convergence behaviour and final performance across recurrent architectures. The FD–LSTM (GL) model exhibits the fastest early convergence and the highest stability of gradients.

Model	Early Loss Drop	Epoch of Stabilization	Accuracy	F1
Classical LSTM	Moderate	∼120	0.914	0.905
Bi–LSTM	Fast	∼100	0.921	0.912
GRU	Moderate	∼110	0.909	0.900
Fractional LSTM (Caputo)	Slow–Moderate	∼130	0.910	0.908
FD–LSTM (GL)	Fastest	∼80	0.9185	0.915

Table 5. Baseline comparison under an identical training protocol.

Model	Mechanism	Accuracy	F1	$Φ$
Classical LSTM	tanh/sigmoid	0.914	0.905	0.959
Bi–LSTM	bidirectional	0.921	0.912	0.960
Fractional LSTM (Caputo)	Caputo gradient	0.910	0.908	0.957
FD–LSTM (GL)	GL fractional operator	0.9185	0.915	0.9599

Table 6. Runtime overhead for different fractional orders

ν

(seconds per epoch).

Table 6. Runtime overhead for different fractional orders

ν

(seconds per epoch).

Model	$ν$	Coefficients N	s/Epoch	Overhead
Classical LSTM	1.0	–	1.00	–
FD–LSTM	0.90	20	1.11	+11%
FD–LSTM	0.95	20	1.15	+15%
FD–LSTM	1.05	25	1.21	+21%

Table 7. RLM compact slice: held-out test accuracy across fractional orders

ν

. Best value in bold.

Table 7. RLM compact slice: held-out test accuracy across fractional orders

ν

. Best value in bold.

Model	$ν$	Test Accuracy
Classical LSTM	$1.00$	0.9333
FD–LSTM (GL)	$0.80$	0.9667
FD–LSTM (GL)	$0.85$	0.9630
FD–LSTM (GL)	$0.95$	0.9370
FD–LSTM (GL)	$1.00$	0.9296
FD–LSTM (GL)	$1.05$	0.9370
FD–LSTM (GL)	$1.15$	0.9630
FD–LSTM (GL)	$1.20$	0.9519

Table 8. RLM compact slice: summary metrics across models (single best configuration per model).

Model	Accuracy	Precision	Recall	F1
FD–LSTM (GL)	0.9667	0.96	0.96	0.93
Classical LSTM	0.9333	>0.94	>0.96	>0.91
Random Forest	∼0.85	∼0.85	∼0.83	∼0.84
SVM	∼0.88	∼0.88	∼0.85	∼0.86
RNN (vanilla)	∼0.87	∼0.87	∼0.86	∼0.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gomolka, Z.; Zeslawska, E.; Olbrot, L.; Komsa, M.; Ćwiąkała, A. Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method. Appl. Sci. 2025, 15, 13046. https://doi.org/10.3390/app152413046

AMA Style

Gomolka Z, Zeslawska E, Olbrot L, Komsa M, Ćwiąkała A. Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method. Applied Sciences. 2025; 15(24):13046. https://doi.org/10.3390/app152413046

Chicago/Turabian Style

Gomolka, Zbigniew, Ewa Zeslawska, Lukasz Olbrot, Michal Komsa, and Adrian Ćwiąkała. 2025. "Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method" Applied Sciences 15, no. 24: 13046. https://doi.org/10.3390/app152413046

APA Style

Gomolka, Z., Zeslawska, E., Olbrot, L., Komsa, M., & Ćwiąkała, A. (2025). Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method. Applied Sciences, 15(24), 13046. https://doi.org/10.3390/app152413046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fractional Derivative in LSTM Networks: Adaptive Neuron Shape Modeling with the Grünwald–Letnikov Method

Abstract

1. Introduction

2. Materials and Methods

2.1. Mathematical Model of the Proposed Fractional Order LSTM Cell

2.2. Forward Model

2.2.1. Gates

2.2.2. Candidate Cell State

2.2.3. Update of Memory Vectors

2.3. Backpropagation Through Time

2.4. Methodology for Convergence Analysis

3. Results and Discussion

3.1. Task Definition and Data

3.2. Experimental Setup

3.3. Overall Results and Learning Dynamics

3.4. Industrial Addendum: Compact RLM-Line Evaluation

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI