A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function

Carcangiu, Sara; Montisci, Augusto

doi:10.3390/a18020104

Open AccessArticle

A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function

by

Sara Carcangiu

and

Augusto Montisci

^*

Department of Electrical and Electronic Engineering, University of Cagliari, 09124 Cagliari, Italy

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(2), 104; https://doi.org/10.3390/a18020104

Submission received: 6 January 2025 / Revised: 5 February 2025 / Accepted: 12 February 2025 / Published: 14 February 2025

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this paper, a new algorithm for the training of Locally Recurrent Neural Networks (LRNNs) is presented, which aims to reduce computational complexity and at the same time guarantee the stability of the network during the training. The main feature of the proposed algorithm is the capability to represent the gradient of the error in an explicit form. The algorithm builds on the interpretation of Fibonacci’s sequence as the output of an IIR second-order filter, which makes it possible to use Binet’s formula that allows the generic terms of the sequence to be calculated directly. Thanks to this approach, the gradient of the loss function during the training can be explicitly calculated, and it can be expressed in terms of the parameters, which control the stability of the neural network.

Keywords:

closed-form error gradient; dynamic systems; gradient-based training; locally recurrent neural networks

1. Introduction

Machine learning techniques, when they are applied to dynamic systems, are preferred to have in their turn a dynamic structure, namely the variable time being included in the algebraic structure of the model. More specifically, in the case of Artificial Neural Networks, the dynamics of the model are obtained by including delay blocks in the structure [1]. The dynamics can be introduced in the neural network by maintaining its feedforward structure (Time Delay Neural Networks—TDNN [2,3,4,5]), or by introducing feedback [6]. In the latter case, the delays are mandatory, otherwise, the calculation of the neuron output cannot be resolved. Both feedforward and feedback neural networks are suitable for modeling dynamic systems; thus, the choice of which paradigm to use depends on the requirements of the problem to be addressed and the available resources. The feedforward paradigm has the advantage of leveraging the same algorithms for training static NNs. In particular, in the case the delay blocks are foreseen only in a delay line at the input (Focused Time Delay Neural Networks—FTDNN [7]), the downstream part of the NN is structured as a static one, and so any static paradigm can be implemented. A different training strategy is adopted depending on the stationary or non-stationary behavior of the system to be modeled. In the first case, the whole evolution of the system, assumed as a training set, can be used to train the NN iteratively in batch mode, in the same way the static NNs are trained. In case the physical system is not stationary, the batch mode is no longer suitable, and the NN model must be adapted dynamically during the evolution of the system [8]. To this purpose, the training set at each iteration is constituted only by the last few samples of the physical signal, while the sensitivity of the model with respect to the past samples tends to vanish with time. From this point of view, the training strategy is the same as adapting filters [9,10], with the added value of exploiting nonlinearity. From a topological point of view, FTDNNs can be seen as the cascade of a Finite Impulsive Response (FIR) linear filter with a Multi-Layer Perceptron [11]. As for the FIRs, these kinds of NNs have a short-term memory, represented by the samples stored in the delay line, and a long-term memory, represented by the weights of the connections. The FIR filters take their name from the fact that the impulsive response has a duration equal to the number of delays in the delay line (Memory Depth), after which it is null. This implies that a proper number of delays must be defined a priori to guarantee that the dynamics of the system under study will be properly modeled. As such, since the dynamics are unknown a priori, a trial-and-error procedure is adopted to design the delay line. As an alternative, feedback can be introduced in the structure of the NNs [12,13,14]. Different strategies are used depending on whether the feedback connects each pair of neurons in the network, or neurons belonging to different layers, or if the output of the NN is fed back to the input, or, finally, if the feedback is localized within the neurons. This last category is called Locally Recurrent Neural Networks (LRNNs [15,16]), and for many applications, it represents the best compromise between performance and computational burden. The advantage of feedback is that the memory depth can be arranged by modifying a parameter, rather than changing the topology of the NN. The drawbacks are a larger computational cost with respect to feedforward NNs, and the fact that they are subject to instabilities [4,17]. LRNNs allow one to limit the computational cost, but the stability issue remains. The global structure of these networks is the same as that of a Multi-Layer Perceptron (MLP), but internally they are structured as an Infinite Impulsive Response (IIR) linear filter, optionally combined with an FIR filter, while the nonlinearity is placed downstream of the linear filter. The IIR filters have a delay line as the FIR filters, but the taps are connected to the input rather than to the output. They take their name from the fact that the duration of the impulsive response is theoretically infinite, even if after a sufficiently long period of time, the response is negligible. The main advantage of the IIR filters is that the vanishing time of the impulsive response can be set by changing the feedback parameters and by keeping the topology unchanged. Unfortunately, the values of such parameters could make the impulsive response unstable; therefore, some measures are needed to prevent this event.

The standard algorithm for the training of the feedback NNs is the Backpropagation Through Time (BPTT [18]), which can be adapted to any feedback topology of NN. This algorithm has been adapted to LRNNs in [17,18,19,20], substantially developing the feedback loop a number of times, sufficient to consider the impulsive response extinct. Such a measure allows one to adapt the structure in spite of the feedback structure being non-casual, and it makes it possible to use the same procedures defined for feedforward structures for training. Nonetheless, some issues remain, as the number of times the loop should be developed is unknown a priori, so the assumed value could be too small, giving rise to interference among different impulsive responses, or too large, in which case it would oversize the computational cost. Furthermore, the stability of the impulsive response remains a main issue to solve [17].

In the present work, a new training algorithm is presented, which at the same time overcomes the problems of training and stability, in this way allowing one to extend the applicability of the LRNNs. The organization of this paper is structured as follows. Section 2 introduces the neural model and provides a detailed description of the newly proposed training algorithm. In Section 3, the method is applied to two different forecasting problems: the prediction of a chaotic time series and the estimation of power demand. Finally, Section 4 presents the conclusions, highlighting the main findings and discussing potential implications.

2. Neural Model

The global structure of an LRNN is like that of MLP, where neurons are organized in layers, but dynamic properties are achieved using neurons with internal feedback. In Figure 1, the assumed structure of NN is shown. For the sake of simplicity and without prejudice to generality, the NN has a single-input, single-output structure with only one hidden layer, where the dynamic element of the network is concentrated, and a linear activation function is assigned to the output neuron. In the rest of the paper, we will refer to this neural structure since such treatment has the advantage of simplicity and matches the exigencies of the paper.

As shown in Figure 1,

\{i\}

represents the input vector signal,

\underline{W} = [w_{1}, w_{2}, \dots, w_{K}]

is the weights matrix of the links between the input neuron and the hidden layer,

K

is the number of the hidden neurons,

\{\underline{s}\} = [\{s_{1}\}, \{s_{2}\}, \dots, \{s_{K}\}]

and

\{\underline{y}\} = [\{y_{1}\}, \{y_{2}\}, \dots, \{y_{K}\}]

are, respectively, the input and the output vector succession of the hidden layer,

\underline{V} = [v_{1}, v_{2}, \dots, v_{K}]

is the weights matrix of the links between the hidden layer and the output neuron, and

\{u\}

is the output vector signal.

The dynamic part of the network is an ARMA filter, where the parameters

a_{i}

create the IIR part and

b_{i}

the FIR part. The FIR part is a feedforward structure, for which the literature provides a layout of efficient methodologies; therefore, this work will focus only on the IIR part.

The output

u (t)

of the NN is calculated as

u (t) = \sum_{k = 1}^{K} v_{k} \cdot y_{k} (t)

(1)

Let us consider the calculation of the state

x (t)

of the

k

-th hidden neuron

x (t) = {\underline{a}}^{T} \cdot \underline{x} (t) + s_{k} (t)

(2)

where

\underline{a}

is the vector of feedback gains,

{(\cdot)}^{T}

indicates the transposal operator,

\underline{x} (t)

is the vector state of the delay line with depth

r

, at the time

t = 0, 1, 2, \dots \infty

, which includes samples of the state sequence from

x (t - 1)

to

x (t - r)

,

s_{k} (t)

is the current input of the neuron. By referring to Figure 1b, the output of the

k - t h

hidden neuron is calculated as:

y_{k} (t) = f [b_{0} \cdot s_{k} (t) + {\underline{b}}^{T} \cdot \underline{x} (t)]

(3)

where

\underline{b}

is the vector of forward gains,

b_{0}

is the a-dynamic weight, and

f (\cdot)

is the activation function of the neuron. Equation (3) allows to write a dynamic loss function to be used for the training (or adapting) of the NN. Let

{d}

be the desired output sequence of the NN as an answer to the input sequence

{i}

. A loss function can be defined as the mean squared error of the output with respect to the desired sequence:

J = \frac{1}{2} \sum_{t = 1}^{T} {[u (t) - d (t)]}^{2}

(4)

where

T

is the duration of the sequence for which the NN must be trained. The simplest procedure to minimize the (4) is based on the gradient of

J

calculated with respect to all the parameters of the NN. Referring to Figure 1, no difficulties arise in computing the derivatives of

J

with respect to the global parameters

w_{k}

, as well as the internal parameters

b_{j}

, with

j = 0, \dots, r

. Concerning the parameters

v_{k}

, the gradient method is not convenient, as the optimal solution corresponds to the regression hyperplane of training samples in the product space

Y \times U

, the former being the output space of the hidden layer and the latter the output space of the neural network. Therefore, once all the parameters upstream of the hidden layer are established, the optimal set of weights of the output connections is univocally defined [21]. However, calculating the derivatives with respect to the internal parameters

a_{j}

, with

j = 1, \dots, r

is troublesome, as the derivative of a given sample depends on all the previous ones:

\frac{\partial J}{\partial a_{k m}} = \sum_{t = 1}^{T} [u (t) - d (t)] \cdot \frac{\partial u (t)}{\partial a_{k m}} = \sum_{t = 1}^{T} [u (t) - d (t)] {\cdot v}_{k} \cdot {f'}_{k} (t) \cdot \sum_{j = 1}^{r} b_{j} \frac{\partial x (t - j)}{\partial a_{k m}}

(5)

The last derivative cannot be solved explicitly, as each component

x (t - j)

of the state vector, because the feedback depends on the entire previous sequence. Consequently, the derivative of the loss function can be calculated only offline. Nonetheless, many applications require the online adaptation of the network, so that the (5) should be expressed in explicit terms, and the generic term of the impulsive response of the IIR can be calculated. This issue has been fixed in [19] introducing the Casual Backpropagation Through Time (CBPTT) algorithm, which consists of establishing a limit of the past time to be considered to calculate the derivatives. In practice, the feedback is developed, and an equivalent forward structure (FIR filter) is used to represent it. The impulsive response of a feedback filter (IIR) is theoretically infinite, but in practice, it is appreciable only for a few time constants. Therefore, the equivalent FIR filter must have a memory depth equal to the time the impulsive response of the IIR is appreciable. This trick allows one to approximate the calculation of the derivative (5) without considering the whole past sequence.

Nevertheless, two main issues persist. The first one is that, during training, the appropriate memory depth to be assigned to the equivalent FIR filter is unknown a priori, as it depends on the updated feedback coefficient. Consequently, an initial estimate is assigned to the memory depth of the FIR filter, which must subsequently be validated after the feedback coefficient has been updated. To the best of our knowledge, no general criterion exists for determining this parameter, which means that multiple iterations may be required to find an appropriate value. The second issue concerns the stability of the network following each update of the parameters [17,22]. The stability depends on the poles of the transfer function of the IIR filter, which, in turn, are influenced by the feedback coefficients. Since the parameters updated during network training are the feedback coefficients, the training process lacks direct control over the poles and, consequently, over the stability of the system.

The method presented in this paper provides a solution to both issues described, as the impulsive response of the IIR filter is expressed in exact terms and in explicit form. Secondly, the training is performed by directly updating the parameters, which control the stability of the network. This is obtained by generalizing Binet’s formula to calculate the Fibonacci’s series, as described in the next subsection.

2.1. The Fibonacci’s Series and Binet’s Formula

The Fibonacci’s series

F_{n} = \{1, 1, 2, 3, 5, 8, \dots\}

is a numeric sequence, which owes its popularity to the fact that it reflects a ubiquitous scheme of growth in nature. The generic term of the sequence is described analytically by the following finite equation:

F_{n} = F_{n - 1} + F_{n - 2}, w i t h n \geq 2

(6)

As can be readily demonstrated, the Fibonacci sequence can also be interpreted as the impulse response of a second-order IIR filter with unitary recursive weights, as depicted in Figure 2. The structure of this filter closely resembles the initial portion of the internal architecture of a hidden neuron, as illustrated in Figure 1b.

As can be seen, Equation (6) makes the sequence implicit as each term is the sum of the two last terms before. Fortunately, Jacques Philippe Marie Binet (1786–1858) provided a formula that allows one to directly calculate any value of the Fibonacci series without calculating the previous ones, stating it as follows:

The n-th term of the Fibonacci sequence denoted as

F_{n}

, can be expressed explicitly using Binet’s formula:

F_{n} = \frac{ϕ^{n} - ψ^{n}}{\sqrt{5}}

(7)

where

ϕ = \frac{1 + \sqrt{5}}{2}

represents the golden ratio and

ψ = \frac{1 - \sqrt{5}}{2}

is its conjugate.

Here, the development of that formula is briefly summarized.

Demonstration of Binet’s formula. As said before, the Fibonacci sequence is defined recursively as:

F_{0} = 0, F_{1} = 1, \dots, F_{n} = F_{n - 1} + F_{n - 2}

Let us assume that an explicit function, which provides the terms of the Fibonacci series, exists, and for the n-th term, it has the general expression:

F_{n} = C \cdot z^{n}

(8)

with

C

and

z

constant values to be determined. By substituting (8) in (6) the following expression is obtained

C \cdot z^{n} = C \cdot z^{n - 1} + C \cdot z^{n - 2} \Rightarrow C \cdot z^{n} - C \cdot z^{n - 1} - C \cdot z^{n - 2} = 0

(9)

from which:

C \cdot z^{n - 2} (z^{2} - z^{1} - 1) = 0

(10)

Equation (10) has two trivial solutions (

C = 0; z = 0

), which must be excluded because they could not generate the series, and two non-trivial solutions, namely the roots of the polynomial within brackets. These two solutions are:

\begin{matrix} z_{1} = \frac{1 + \sqrt{5}}{2}; & z_{2} = \frac{1 - \sqrt{5}}{2} \end{matrix}

(11)

It is worth noting that the first root

z_{1}

in (11) is the golden ratio value. Given that the recurrence relation (9) is linear, the sum of the two solutions is also a solution. The sought function can therefore be obtained as a linear combination of the two solutions corresponding to the two roots

z_{1}

and

z_{2}

:

F_{n} = C_{1} z_{1}^{n} + C_{2} z_{2}^{n}

(12)

with

C_{1}

and

C_{2}

to be determined. To this end, we can impose the correspondence with two arbitrary values of the series, for example, the first two: 1, 1.

\{\begin{matrix} F_{1} = {1 \Leftrightarrow C}_{1} \cdot z_{1} + C_{2} \cdot z_{2} = 1 \\ F_{2} = {1 \Leftrightarrow C}_{1} \cdot z_{1}^{2} + C_{2} \cdot z_{2}^{2} = 1 \end{matrix}

(13)

The solution of the system (13) is

\begin{matrix} C_{1} = \frac{1}{\sqrt{5}}; & C_{2} = - \frac{1}{\sqrt{5}} \end{matrix}

from which the following expression of Binet’s formula comes:

F_{n} = \frac{1}{\sqrt{5}} [{(\frac{1 + \sqrt{5}}{2})}^{n} - {(\frac{1 - \sqrt{5}}{2})}^{n}]

(14)

2.2. Exploitation of Binet’s Formula to Calculate the IIR Impulsive Response

The method to determine Binet’s formula can be applied by considering the discrete time

t

instead of the index

n

of the Fibonacci sequence and the generic number

r

of delays, to calculate the impulsive response of an IIR filter used to represent the AR part of the

k - t h

neuron. When the input of the neuron is a unitary impulse, i.e.,

\{s_{k}\} = {1, 0, 0, 0, \dots},

the (2) gives us the impulse response

\{h_{k}\} = \{x_{k}\}

.

Let there be an IIR filter, such as the one described in Figure 3, whose impulsive response is:

\begin{array}{c} h (1) = 1 \\ h (2) = a_{1} h (1) \\ h (3) = a_{1} h (2) + a_{2} h (1) \\ ⋮ \\ h (t) = a_{1} h (t - 1) + \dots + a_{r} h (t - r) \end{array}

(15)

Let us assume that a function exists, which provides the arbitrary term of its impulsive response.

h (t) = C \cdot z^{t}

(16)

By replacing (16) within the generic expression in (15), it results in:

C \cdot z^{t} = a_{1} C \cdot z^{t - 1} + \dots + a_{r} C \cdot z^{t - r}

(17)

and finally:

C \cdot z^{t - r} \cdot (z^{r} - a_{1} z^{r - 1} - \dots - a_{r - 1} z + a_{r}) = 0

(18)

The non-trivial solutions of (18) are the

r

roots of the polynomial between brackets. Therefore, by combining the r solutions of the polynomial in (18), it is possible to derive the following general solution, which contains r degrees of freedom:

h (t) = a_{1} C_{1} {\cdot z}_{1}^{t} + \dots + a_{r} C_{r} z_{r}^{t}

(19)

with

C_{1}, \dots, C_{r}

to be determined. To this end, the first

r

samples of the impulsive response are calculated and imposed in the following linear equations system:

\{\begin{matrix} h (1) = {a_{1} C}_{1} \cdot z_{1} + \dots + {a_{r} C}_{r} \cdot z_{r} \\ h (2) = a_{1} C_{1} \cdot z_{1}^{2} + \dots + a_{r} C_{r} \cdot z_{r}^{2} \\ ⋮ \\ h (r) = {a_{1} C}_{1} \cdot z_{1}^{r} + \dots + a_{r} C_{r} \cdot z_{r}^{r} \end{matrix}

(20)

The solution of system (20) represents the coefficients of the function (19).

2.3. Derivative of the Loss Function with Respect to the Feedback Parameters

Equation (5) describes the derivative of the loss function with respect to the generic feedback parameter. As remarked in the previous sections, such derivatives require the sequence of all the previous samples; therefore, an explicit calculation is impossible unless a limit is imposed to the duration of the impulsive response. As said before, the vanishing time of the impulsive response depends on the feedback parameters we are calculating, so that the assumption is subject to uncertainty. The procedure described in Section 2.2 allows one to fix this problem. The IIR filter is a linear system; therefore, its answer to a generic input sequence can be expressed as the convolution product between the input signal and the impulsive response:

Given the impulse response (19) and denoting with

*

the convolution product, the sequence of the state variable

\{x_{k}\}

can be expressed as

\{x_{k}\} = \{s_{k} * h_{k}\}

(21)

Equation (21) allows us to calculate the derivatives with respect to the roots

z_{j}

rather than the feedback coefficients

a_{j}

. This makes it possible to take control of the stability of the NN. In fact, as the state of the neurons depends on the roots

z_{j}

, if they are constrained to have the module less than 1, no matter the other parameters, the stability of the network is guaranteed. Therefore, it is convenient to train the network by adapting the zeros of the polynomials, and then calculating the corresponding feedback parameters, which are the coefficients of the polynomials having the

z_{j}

as roots. Equation (5) to calculate the derivatives of the loss function with respect to the feedback parameters is substituted by the following one:

\frac{\partial J}{\partial z_{k m}} = \sum_{t = 1}^{T} [u (t) - d (t)] {\cdot v}_{k} \cdot {f'}_{k} (t) \cdot \sum_{j = 1}^{r} b_{j} [s_{k} * \frac{\partial h_{k}}{\partial z_{k m}}] (t - j)

(22)

where

f' (t)

is the derivative of the activation function of the hidden layer. From (19) comes that

h_{k} (j)

is a polynomial in

z_{k}

and then only one term of its derivative is not null. The derivative finally writes:

\frac{\partial J}{\partial z_{k m}} = \sum_{t = 1}^{T} [u (t) - d (t)] {\cdot v}_{k} \cdot w_{k} \cdot {f'}_{k} (t) \cdot C_{k m} \sum_{j = 1}^{r} b_{j} \cdot i (t - j) * z_{k m}^{t - j - 1}

(23)

The convolution product in (23) implies that a limit must be assumed for the duration of the impulsive response, but thanks to the use of the roots, this term can be established a priori.

2.4. The Training Algorithm

A simple gradient descent method [23] is used to train the LRNN.

Γ_{p + 1} = Γ_{p} - η \cdot \nabla J

(24)

where

Γ

is the set of all the parameters of the NN, either forward or feedback parameters,

p

is the iteration index,

η

is the learning rate. The gradient is calculated with respect to all the independent parameters, considering that both the feedback coefficients

a_{j}

and the coefficients of the impulsive responses

C_{j}

are univocally determined once the values of the roots

z_{j}

have been established. It is worth noting that the values of the coefficients

C_{j}

cannot affect the stability of the NN. The only parameters that affect stability are the roots

z_{j}

, so the iterative procedure applying the (24) must be meant in constrained terms. This means that, when updating the set of parameters

Γ

, the roots with a module greater than 1 must be avoided by truncating the increment, or by projecting properly the move. The use of more advanced procedures, even if possible, is beyond the scope of this study.

Two stop criteria have been assumed: maximum number of iterations and maximum number of iterations without improvements.

3. Results

To assess the effectiveness and versatility of the proposed algorithm, two distinct benchmark tests have been analyzed. The first test case, in the field of chemistry, focuses on the modeling of the Willamowski–Rössler reaction [24,25], a nonlinear dynamical system widely studied for its complex oscillatory behavior. The second test case pertains to power demand forecasting within a medium-voltage (MV) distribution network [26]. By evaluating the algorithm in these two different domains, the study aims to demonstrate its robustness and applicability across diverse scientific and engineering contexts.

3.1. Test Case 1: Willamoski–Rössler Reaction

This benchmark is widely known because it has been the first case that showed that deterministic chaos can be generated by a chemical reaction. The process consists of a multi-step catalytic reaction in the open system, which involves five species, between initiators and products and three intermediates. The following equations describe the five steps of the process:

\begin{array}{c} A_{1} + X \begin{matrix} K_{1} \\ ⇋ \\ K_{- 1} \end{matrix} 2 X \\ X + Y \begin{matrix} K_{2} \\ ⇋ \\ K_{- 2} \end{matrix} 2 Y \\ A_{5} + Y \begin{matrix} K_{3} \\ ⇋ \\ K_{- 3} \end{matrix} A_{2} \\ X + Z \begin{matrix} K_{4} \\ ⇋ \\ K_{- 4} \end{matrix} A_{3} \\ A_{4} + Z \begin{matrix} K_{5} \\ ⇋ \\ K_{- 5} \end{matrix} 2 Z \end{array}

(25)

where

A_{1}, A_{4}

, and

A_{5}

are the initiators,

A_{2}

and

A_{3}

are the products, and

X, Y,

and

Z

are the intermediates. By assuming the five-step equations, the following overall reaction is obtained:

A_{1} + A_{4} + A_{5} \Leftrightarrow A_{2} + A_{3}

(26)

In an open system, the concentration of both initiators and products is constant in nominal conditions, while the concentrations of the three intermediate species assume chaotic behavior. By assuming

X, Y,

and

Z

as state variables, the phase space can be represented graphically. The evolution of the state variables can be studied by means of the following set of differential equations:

\{\begin{array}{l} \dot{X} = K_{1} X - K_{- 1} X^{2} - K_{2} X Y + K_{- 2} Y^{2} - K_{4} X Y + K_{- 4} \\ \dot{Y} = K_{2} X Y - K_{- 2} Y^{2} - K_{3} Y + K_{- 3} \\ \dot{Z} = - K_{4} X Z + K_{5} Z - K_{5} Z^{2} + K_{- 4} \end{array}

(27)

In Figure 4, the phase space trajectory of the Willamowski–Rössler model, depicted in a three-dimensional coordinate system, is shown corresponding to the following reaction rates:

K_{1} = 30

,

K_{- 1} = 0.25

,

K_{2} = 1

,

K_{- 2} = 10^{- 4}

,

K_{3} = 10

,

K_{- 3} = 10^{- 3}

,

K_{4} = 1

,

K_{- 4} = 10^{- 3}

,

K_{5} = 16.5

,

K_{- 5} = 0.5

, and initial conditions are:

X_{0} = 0.21, Y_{0} = 0.01,

and

Z_{0} = 0.12

[25]. The plotted trajectory, shown in blue, illustrates the evolution of the system’s state over time, revealing the characteristic nonlinear and oscillatory behavior of the model. The trajectory exhibits a complex structure with looping and spiraling patterns, which indicate the presence of non-trivial dynamics. Initially, the system follows a transient trajectory before settling into a more structured oscillatory regime.

In the chemical process, chaotic behavior is necessary to obtain an efficient mixing of the reacting species, avoiding spending a great quantity of energy. Therefore, it is important to forecast any variation in the parameters before the chaotic behavior of the system is lost.

Figure 5 illustrates the evolution of the system’s state resulting from variations in the state variables. The trends depicted in Figure 5 have been obtained by setting

K_{1} = 29 .

The results clearly demonstrate that the chaotic behavior characteristic of the system is entirely suppressed across all state variables after only a few seconds. This loss of chaotic dynamics highlights the sensitivity of the system’s behavior to changes in its parameters, as well as the stabilizing effect introduced by the specific value of

K_{1}

.

Three LRNNs have been trained, one for each state variable, to predict the value of the respective variable one time step ahead. The optimal network structure has been determined through a trial-and-error approach to maximize forecasting accuracy. The neural network used to perform the test has a 1-25-1 structure (

K = 25

), with only one delay

(r = 1)

for each hidden neuron and no delay in the output neuron. As an activation function, the hyperbolic tangent has been assigned to the hidden neurons, while the output neuron is linear. A sequence of 20 s of each state variable of the system in nominal conditions has been acquired with a sample time of 5 ms, obtaining sequences of 4000 samples, and these sequences have been used to train the neural networks. Figure 6 presents the evolution of the Mean Squared Error (MSE) observed during the training phase for variable X. The plot demonstrates that the MSE decreases steadily during training, reaching a value of approximately

0.003

after around 350 iterations. This indicates that the adapted NN effectively approximates the chaotic behavior of the system, successfully learning and reproducing the complex dynamics inherent to the system with a high level of accuracy. Furthermore, as expected, taking under control the absolute value of the roots, the neural network does not exhibit unstable behavior during the adaptation.

3.2. Test Case 2: Load Forecasting

To train and validate the LRNN employed for load forecasting, active power consumption data provided by the Commission for Energy Regulation (CER) [27] have been utilized. This dataset contains half-hourly energy consumption records collected from smart meters (SMs) installed in 6445 customers’ premises as part of the Electricity Smart Metering Customer Behavior Trials [28]. The data, gathered over an 18-month period (from 14 July 2009 to 31 December 2010) from various distribution network locations in Ireland, represent different customer categories, including residential users, small and medium enterprises (SMEs), and others. In this study, the model has been applied to forecast the power demand of residential loads. The same dataset has been used in [26], where a multilayer perceptron (MLP) NN has been trained to predict power demand one step ahead at each medium-voltage node with limited error, using both exogenous and historical measurements.

The dataset consists of approximately 25,000 data points (Figure 7), of which the first 2000 have been used for network training. As can be observed from Figure 7, there is a higher energy demand during the summer period and a lower demand during the winter period. In Figure 8, the energy demand over a week of data acquisition is shown. The observed trend exhibits a repetitive pattern, indicating a periodic variation in energy consumption. This periodicity suggests a consistent demand cycle, likely influenced by daily or operational routines. The data points, represented by markers, highlight fluctuations in power demand.

The optimized neural network adopts a 1-10-1 architecture (

K = 10

), with six delays

(r = 6)

for each hidden neuron and no delay in the output neuron. After about 40 iterations, the MSE decreases, reaching a value of approximately

10^{- 7} .

To compare the performance of the LRNN model, with the MLP in [26], the Mean Absolute Error (MAE), defined as in the following, has been used:

M A E = \frac{1}{n} \sum_{i = 0}^{n} |o_{i} - {\hat{o}}_{i}|

(28)

where

o_{i}

is the actual load value,

{\hat{o}}_{i}

is the corresponding predicted load value, and

n

is the number of training or test samples. The forecasting performance is better when the MAE values are smaller.

Table 1 presents a comparison of the performance achieved using the proposed LRNN and the one implemented in [26]. As can be noted, the MLP model achieves lower error during training but exhibits a higher test error, suggesting potential overfitting. In contrast, the LRNN has a slightly higher training error but a lower test error, indicating better generalization.

Finally, in Figure 9, two plots illustrating the results of the load forecasting model implemented using the LRNN trained with the proposed algorithm are reported. The upper plot presents the power demand over a dataset of 8000 samples. A vertical dashed line separates the training and test phases, with the training set comprising the first 2000 samples and the test set extending beyond this point. The model successfully captures the underlying periodic patterns of the power demand, though some high-frequency variations appear more pronounced in the target data. The lower plot, labeled as the last 400 samples, focuses on the final segment of this portion of the dataset, providing a detailed comparison of the target and output values. The predicted values closely follow the actual power demand, with minimal deviation, indicating a high level of accuracy in the model’s forecasting capability. The periodic nature of the load is well preserved, further demonstrating the effectiveness of the proposed training algorithm in capturing and generalizing the temporal dependencies in the dataset.

4. Discussion and Conclusions

The present work aims to introduce a new paradigm for the training of Locally Recurrent Neural Networks. The relevance of such neural networks comes from the fact that they conjugate the advantages of feedback with relatively moderate computational complexity. This is because dynamics are obtained by means of local linear structures, such as FIR and IIR filters. There is wide literature that confirms that this simplification does not affect the potential of the networks with respect to the global feedback, but better performance is achieved because they can be trained with a moderate computational cost. Nonetheless, open issues remain due to the implicit dependence on past samples and the stability that cannot be established a priori. The method introduced in this paper provides a solution to both such issues. By extending the formula of Binet for the calculation of the Fibonacci series terms, a method is presented, which allows for the direct calculation of the terms of the impulsive response of the IIR structure. Subsequently, the effect of a change in the parameters of the filter can be estimated a priori. Furthermore, rather than in terms of feedback coefficients, the IIR is expressed in terms of polynomials, which makes it possible to evaluate a priori the effect of a parameter change in terms of stability. The efficiency of the method has been evaluated by developing a predictor for the chaotic behavior of a chemical reactor and addressing a load forecasting problem. The aim of the present paper was to present the theoretical basis of the method and to demonstrate its efficiency in solving non-trivial problems. The method has been applied to several benchmarks retrieved from the literature, and in no case drawbacks or limitations have emerged. Anyway, an extensive comparison with other methods from the literature was beyond the scope of the paper, and it will be the subject of future work.

Author Contributions

Conceptualization, A.M.; Methodology, S.C. and A.M.; Software, S.C. and A.M.; Validation, S.C.; Formal Analysis, A.M.; Investigation, S.C.; Resources, S.C.; Data Curation, S.C.; Writing—Original Draft Preparation, A.M.; Writing—Review and Editing, S.C.; Visualization, S.C.; Supervision, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been developed within the project funded under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3—Call for tender No. 1561 of 11.10.2022 of Ministero dell’Università e della Ricerca (MUR); funded by the European Union–NextGenerationEU. Award Number: Project Code code PE0000021, Concession Decree No. 1561 of 11.10.2022 adopted by Ministero dell’Università e della Ricerca (MUR), CUP F53C22000770007, Project title “Network 4 Energy Sustainable Transition–NEST”.

Data Availability Statement

Research data are readily provided upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Han, Y.; Huang, G.; Song, S.; Yang, L.; Wang, H.; Wang, Y. Dynamic Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7436–7456. [Google Scholar] [CrossRef] [PubMed]
Waibel, A. Modular Construction of Time-Delay Neural Networks for Speech Recognition. Neural Comput. 1989, 1, 39–46. [Google Scholar] [CrossRef]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature Verification using a “Siamese” Time Delay Neural Network. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 29 November–2 December 1993; Morgan-Kaufmann: San Francisco, CA, USA, 1993; Volume 6. [Google Scholar]
Cao, J.; Wang, J. Global asymptotic and robust stability of recurrent neural networks with time delays. IEEE Trans. Circuits Syst. Regul. Pap. 2005, 52, 417–426. [Google Scholar] [CrossRef]
Huang, W.; Yan, C.; Wang, J.; Wang, W. A time-delay neural network for solving time-dependent shortest path problem. Neural Netw. 2017, 90, 21–28. [Google Scholar] [CrossRef] [PubMed]
Nerrand, O.; Roussel-Ragot, P.; Personnaz, L.; Dreyfus, G.; Marcos, S. Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms. Neural Comput. 1993, 5, 165–199. [Google Scholar] [CrossRef]
Htike, K.K.; Khalifa, O.O. Rainfall forecasting models using focused time-delay neural networks. In Proceedings of the International Conference on Computer and Communication Engineering (ICCCE’10), Kuala Lumpur, Malaysia, 11–12 May 2010; pp. 1–6. [Google Scholar]
Grossberg, S. Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Netw. 1988, 1, 17–61. [Google Scholar] [CrossRef]
Haykin, S.S. Adaptive Filter Theory; Pearson Education: London, UK, 2002; ISBN 978-81-317-0869-9. [Google Scholar]
Widrow, B.; Glover, J.R.; McCool, J.M.; Kaunitz, J.; Williams, C.S.; Hearn, R.H.; Zeidler, J.R.; Dong, E., Jr.; Goodlin, R.C. Adaptive noise cancelling: Principles and applications. Proc. IEEE 1975, 63, 1692–1716. [Google Scholar] [CrossRef]
Pedro, J.C.; Maas, S.A. A comparative overview of microwave and wireless power-amplifier behavioral modeling approaches. IEEE Trans. Microw. Theory Tech. 2005, 53, 1150–1163. [Google Scholar] [CrossRef]
Krotov, D. A new frontier for Hopfield networks. Nat. Rev. Phys. 2023, 5, 366–367. [Google Scholar] [CrossRef]
Jordan, M.I. Chapter 25—Serial Order: A Parallel Distributed Processing Approach. In Advances in Psychology; Donahoe, J.W., Packard Dorsel, V., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1997; Volume 121, pp. 471–495. [Google Scholar]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Back, A.D.; Tsoi, A.C. FIR and IIR Synapses, a New Neural Network Architecture for Time Series Modeling. Neural Comput. 1991, 3, 375–385. [Google Scholar] [CrossRef] [PubMed]
Tsoi, A.C. Recurrent neural network architectures: An overview. In Adaptive Processing of Sequences and Data Structures: International Summer School on Neural Networks “E.R. Caianiello” Vietri sul Mare, Tutorial Lectures, Salerno, Italy, 6–13 September 1997; Giles, C.L., Gori, M., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 1–26. ISBN 978-3-540-69752-7. [Google Scholar]
Campolucci, P.; Piazza, F. Intrinsic stability-control method for recursive filters and neural networks. IEEE Trans. Circuits Syst. II Analog Digit. Signal Process. 2000, 47, 797–802. [Google Scholar] [CrossRef]
Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef]
Campolucci, P.; Uncini, A.; Piazza, F. Causal back propagation through time for locally recurrent neural networks. In Proceedings of the 1996 IEEE International Symposium on Circuits and Systems, Circuits and Systems Connecting the World, ISCAS 96, Atlanta, GA, USA, 12 May 1996; Volume 3, pp. 531–534. [Google Scholar]
Campolucci, P.; Uncini, A.; Piazza, F. Fast adaptive IIR-MLP neural networks for signal processing applications. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 6, pp. 3529–3532. [Google Scholar]
Montisci, A. A free from local minima algorithm for training regressive MLP neural networks. arXiv 2023, arXiv:2308.11532. [Google Scholar] [CrossRef]
Carini, A.; Mathews, V.J.; Sicuranza, G.L. Sufficient stability bounds for slowly varying discrete-time recursive linear filters. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 21–24 April 1997; Volume 3, pp. 1877–1880. [Google Scholar]
Battiti, R. First- and Second-Order Methods for Learning: Between Steepest Descent and Newton’s Method. Neural Comput. 1992, 4, 141–166. [Google Scholar] [CrossRef]
Willamowski, K.-D.; Rössler, O.E. Irregular Oscillations in a Realistic Abstract Quadratic Mass Action System. Z. Für Naturforschung A 1980, 35, 317–318. [Google Scholar] [CrossRef]
Niu, H.; Wang, H.; Zhang, Q. The Chaos Anti-control in the Willamowski-Rössler Reaction. In Proceedings of the 2010 International Workshop on Chaos-Fractal Theories and Applications, Kunming, China, 29–31 October 2010; pp. 87–91. [Google Scholar]
Carcangiu, S.; Fanni, A.; Pegoraro, P.A.; Sias, G.; Sulis, S. Forecasting-Aided Monitoring for the Distribution System State Estimation. Complexity 2020, 2020, 4281219. [Google Scholar] [CrossRef]
Irish Social Science Data Archive. Home, Irish Social Science Data Archive. Irish Social Science Data Archive. Available online: https://www.ucd.ie/issda/data/commissionforenergyregulationcer/ (accessed on 4 February 2025).
Martin, G. Electricity Smart Metering Customer Behaviour Trials Findings Report; Technical Report; CER Commission for Energy Regulation: Calgary, AB, USA, 2011; pp. 1–146. [Google Scholar]

Figure 1. Structure of the LRNN. In (a) the global structure is shown, which is identical to that of MLP. The symbols between curl brackets are time sequences, the values without brackets indicate the weights. In (b) the internal structure of a hidden neuron is represented. The blocks with

z^{- 1}

constitute the delay line. The battery of

b_{i}

gains are the weights of the FIR filter, while the

a_{i}

gains are the IIR weights. The gain

b_{0}

represents the a-dynamic connection with the output, while

f

is the nonlinear activation function of the neuron.

Figure 1. Structure of the LRNN. In (a) the global structure is shown, which is identical to that of MLP. The symbols between curl brackets are time sequences, the values without brackets indicate the weights. In (b) the internal structure of a hidden neuron is represented. The blocks with

z^{- 1}

constitute the delay line. The battery of

b_{i}

gains are the weights of the FIR filter, while the

a_{i}

gains are the IIR weights. The gain

b_{0}

represents the a-dynamic connection with the output, while

f

is the nonlinear activation function of the neuron.

Figure 2. IIR filter of the second order with unitary recursive weights, whose impulsive response corresponds to the Fibonacci series.

Figure 3. IIR filter with a generic memory depth of

r

.

Figure 3. IIR filter with a generic memory depth of

r

.

Figure 4. Phase space trajectory of the Willamowski–Rössler model.

Figure 5. State evolution in presence of a 3% deviation of the reaction rate

K_{1}

.

Figure 5. State evolution in presence of a 3% deviation of the reaction rate

K_{1}

.

Figure 6. Prediction error of the state variable X during the adaptation of the NN.

Figure 7. Trend of residential load

r e s_{1}

, over a period of 18 months (from 14 July 2009 to 31 December 2010), obtained by aggregating the energy consumption of 523 residential loads randomly chosen.

Figure 7. Trend of residential load

r e s_{1}

, over a period of 18 months (from 14 July 2009 to 31 December 2010), obtained by aggregating the energy consumption of 523 residential loads randomly chosen.

Figure 8. Power demand over a week of data acquisition.

Figure 9. Load forecasting results. The upper plot shows the overall power demand prediction, with a clear distinction (vertical dark dashed line) between the training and test phases. The lower plot focuses on the last 400 samples. The target (true values) and output (model predictions) are depicted in blue and orange, respectively.

Table 1. MAE performances.

	MLP [26]		LRNN (Present Work)
MAE [kW]	Train	Test	Train	Test
MAE [kW]	10.7	16.9	12.4	15.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Carcangiu, S.; Montisci, A. A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function. Algorithms 2025, 18, 104. https://doi.org/10.3390/a18020104

AMA Style

Carcangiu S, Montisci A. A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function. Algorithms. 2025; 18(2):104. https://doi.org/10.3390/a18020104

Chicago/Turabian Style

Carcangiu, Sara, and Augusto Montisci. 2025. "A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function" Algorithms 18, no. 2: 104. https://doi.org/10.3390/a18020104

APA Style

Carcangiu, S., & Montisci, A. (2025). A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function. Algorithms, 18(2), 104. https://doi.org/10.3390/a18020104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Training Algorithm for Locally Recurrent Neural Networks Based on the Explicit Gradient of the Loss Function

Abstract

1. Introduction

2. Neural Model

2.1. The Fibonacci’s Series and Binet’s Formula

2.2. Exploitation of Binet’s Formula to Calculate the IIR Impulsive Response

2.3. Derivative of the Loss Function with Respect to the Feedback Parameters

2.4. The Training Algorithm

3. Results

3.1. Test Case 1: Willamoski–Rössler Reaction

3.2. Test Case 2: Load Forecasting

4. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI