1. Introduction
Machine learning techniques, when they are applied to dynamic systems, are preferred to have in their turn a dynamic structure, namely the variable time being included in the algebraic structure of the model. More specifically, in the case of Artificial Neural Networks, the dynamics of the model are obtained by including delay blocks in the structure [
1]. The dynamics can be introduced in the neural network by maintaining its feedforward structure (Time Delay Neural Networks—TDNN [
2,
3,
4,
5]), or by introducing feedback [
6]. In the latter case, the delays are mandatory, otherwise, the calculation of the neuron output cannot be resolved. Both feedforward and feedback neural networks are suitable for modeling dynamic systems; thus, the choice of which paradigm to use depends on the requirements of the problem to be addressed and the available resources. The feedforward paradigm has the advantage of leveraging the same algorithms for training static NNs. In particular, in the case the delay blocks are foreseen only in a delay line at the input (Focused Time Delay Neural Networks—FTDNN [
7]), the downstream part of the NN is structured as a static one, and so any static paradigm can be implemented. A different training strategy is adopted depending on the stationary or non-stationary behavior of the system to be modeled. In the first case, the whole evolution of the system, assumed as a training set, can be used to train the NN iteratively in batch mode, in the same way the static NNs are trained. In case the physical system is not stationary, the batch mode is no longer suitable, and the NN model must be adapted dynamically during the evolution of the system [
8]. To this purpose, the training set at each iteration is constituted only by the last few samples of the physical signal, while the sensitivity of the model with respect to the past samples tends to vanish with time. From this point of view, the training strategy is the same as adapting filters [
9,
10], with the added value of exploiting nonlinearity. From a topological point of view, FTDNNs can be seen as the cascade of a Finite Impulsive Response (FIR) linear filter with a Multi-Layer Perceptron [
11]. As for the FIRs, these kinds of NNs have a short-term memory, represented by the samples stored in the delay line, and a long-term memory, represented by the weights of the connections. The FIR filters take their name from the fact that the impulsive response has a duration equal to the number of delays in the delay line (Memory Depth), after which it is null. This implies that a proper number of delays must be defined a priori to guarantee that the dynamics of the system under study will be properly modeled. As such, since the dynamics are unknown a priori, a trial-and-error procedure is adopted to design the delay line. As an alternative, feedback can be introduced in the structure of the NNs [
12,
13,
14]. Different strategies are used depending on whether the feedback connects each pair of neurons in the network, or neurons belonging to different layers, or if the output of the NN is fed back to the input, or, finally, if the feedback is localized within the neurons. This last category is called Locally Recurrent Neural Networks (LRNNs [
15,
16]), and for many applications, it represents the best compromise between performance and computational burden. The advantage of feedback is that the memory depth can be arranged by modifying a parameter, rather than changing the topology of the NN. The drawbacks are a larger computational cost with respect to feedforward NNs, and the fact that they are subject to instabilities [
4,
17]. LRNNs allow one to limit the computational cost, but the stability issue remains. The global structure of these networks is the same as that of a Multi-Layer Perceptron (MLP), but internally they are structured as an Infinite Impulsive Response (IIR) linear filter, optionally combined with an FIR filter, while the nonlinearity is placed downstream of the linear filter. The IIR filters have a delay line as the FIR filters, but the taps are connected to the input rather than to the output. They take their name from the fact that the duration of the impulsive response is theoretically infinite, even if after a sufficiently long period of time, the response is negligible. The main advantage of the IIR filters is that the vanishing time of the impulsive response can be set by changing the feedback parameters and by keeping the topology unchanged. Unfortunately, the values of such parameters could make the impulsive response unstable; therefore, some measures are needed to prevent this event.
The standard algorithm for the training of the feedback NNs is the Backpropagation Through Time (BPTT [
18]), which can be adapted to any feedback topology of NN. This algorithm has been adapted to LRNNs in [
17,
18,
19,
20], substantially developing the feedback loop a number of times, sufficient to consider the impulsive response extinct. Such a measure allows one to adapt the structure in spite of the feedback structure being non-casual, and it makes it possible to use the same procedures defined for feedforward structures for training. Nonetheless, some issues remain, as the number of times the loop should be developed is unknown a priori, so the assumed value could be too small, giving rise to interference among different impulsive responses, or too large, in which case it would oversize the computational cost. Furthermore, the stability of the impulsive response remains a main issue to solve [
17].
In the present work, a new training algorithm is presented, which at the same time overcomes the problems of training and stability, in this way allowing one to extend the applicability of the LRNNs. The organization of this paper is structured as follows.
Section 2 introduces the neural model and provides a detailed description of the newly proposed training algorithm. In
Section 3, the method is applied to two different forecasting problems: the prediction of a chaotic time series and the estimation of power demand. Finally,
Section 4 presents the conclusions, highlighting the main findings and discussing potential implications.
2. Neural Model
The global structure of an LRNN is like that of MLP, where neurons are organized in layers, but dynamic properties are achieved using neurons with internal feedback. In
Figure 1, the assumed structure of NN is shown. For the sake of simplicity and without prejudice to generality, the NN has a single-input, single-output structure with only one hidden layer, where the dynamic element of the network is concentrated, and a linear activation function is assigned to the output neuron. In the rest of the paper, we will refer to this neural structure since such treatment has the advantage of simplicity and matches the exigencies of the paper.
As shown in
Figure 1,
represents the input vector signal,
is the weights matrix of the links between the input neuron and the hidden layer,
is the number of the hidden neurons,
and
are, respectively, the input and the output vector succession of the hidden layer,
is the weights matrix of the links between the hidden layer and the output neuron, and
is the output vector signal.
The dynamic part of the network is an ARMA filter, where the parameters create the IIR part and the FIR part. The FIR part is a feedforward structure, for which the literature provides a layout of efficient methodologies; therefore, this work will focus only on the IIR part.
The output
of the NN is calculated as
Let us consider the calculation of the state
of the
-th hidden neuron
where
is the vector of feedback gains,
indicates the transposal operator,
is the vector state of the delay line with depth
, at the time
, which includes samples of the state sequence from
to
,
is the current input of the neuron. By referring to
Figure 1b, the output of the
hidden neuron is calculated as:
where
is the vector of forward gains,
is the a-dynamic weight, and
is the activation function of the neuron. Equation (3) allows to write a dynamic loss function to be used for the training (or adapting) of the NN. Let
be the desired output sequence of the NN as an answer to the input sequence
. A loss function can be defined as the mean squared error of the output with respect to the desired sequence:
where
is the duration of the sequence for which the NN must be trained. The simplest procedure to minimize the (4) is based on the gradient of
calculated with respect to all the parameters of the NN. Referring to
Figure 1, no difficulties arise in computing the derivatives of
with respect to the global parameters
, as well as the internal parameters
, with
. Concerning the parameters
, the gradient method is not convenient, as the optimal solution corresponds to the regression hyperplane of training samples in the product space
, the former being the output space of the hidden layer and the latter the output space of the neural network. Therefore, once all the parameters upstream of the hidden layer are established, the optimal set of weights of the output connections is univocally defined [
21]. However, calculating the derivatives with respect to the internal parameters
, with
is troublesome, as the derivative of a given sample depends on all the previous ones:
The last derivative cannot be solved explicitly, as each component
of the state vector, because the feedback depends on the entire previous sequence. Consequently, the derivative of the loss function can be calculated only offline. Nonetheless, many applications require the online adaptation of the network, so that the (5) should be expressed in explicit terms, and the generic term of the impulsive response of the IIR can be calculated. This issue has been fixed in [
19] introducing the Casual Backpropagation Through Time (CBPTT) algorithm, which consists of establishing a limit of the past time to be considered to calculate the derivatives. In practice, the feedback is developed, and an equivalent forward structure (FIR filter) is used to represent it. The impulsive response of a feedback filter (IIR) is theoretically infinite, but in practice, it is appreciable only for a few time constants. Therefore, the equivalent FIR filter must have a memory depth equal to the time the impulsive response of the IIR is appreciable. This trick allows one to approximate the calculation of the derivative (5) without considering the whole past sequence.
Nevertheless, two main issues persist. The first one is that, during training, the appropriate memory depth to be assigned to the equivalent FIR filter is unknown a priori, as it depends on the updated feedback coefficient. Consequently, an initial estimate is assigned to the memory depth of the FIR filter, which must subsequently be validated after the feedback coefficient has been updated. To the best of our knowledge, no general criterion exists for determining this parameter, which means that multiple iterations may be required to find an appropriate value. The second issue concerns the stability of the network following each update of the parameters [
17,
22]. The stability depends on the poles of the transfer function of the IIR filter, which, in turn, are influenced by the feedback coefficients. Since the parameters updated during network training are the feedback coefficients, the training process lacks direct control over the poles and, consequently, over the stability of the system.
The method presented in this paper provides a solution to both issues described, as the impulsive response of the IIR filter is expressed in exact terms and in explicit form. Secondly, the training is performed by directly updating the parameters, which control the stability of the network. This is obtained by generalizing Binet’s formula to calculate the Fibonacci’s series, as described in the next subsection.
2.1. The Fibonacci’s Series and Binet’s Formula
The Fibonacci’s series
is a numeric sequence, which owes its popularity to the fact that it reflects a ubiquitous scheme of growth in nature. The generic term of the sequence is described analytically by the following finite equation:
As can be readily demonstrated, the Fibonacci sequence can also be interpreted as the impulse response of a second-order IIR filter with unitary recursive weights, as depicted in
Figure 2. The structure of this filter closely resembles the initial portion of the internal architecture of a hidden neuron, as illustrated in
Figure 1b.
As can be seen, Equation (6) makes the sequence implicit as each term is the sum of the two last terms before. Fortunately, Jacques Philippe Marie Binet (1786–1858) provided a formula that allows one to directly calculate any value of the Fibonacci series without calculating the previous ones, stating it as follows:
The n-th term of the Fibonacci sequence denoted as , can be expressed explicitly using Binet’s formula:where represents the golden ratio and is its conjugate. Here, the development of that formula is briefly summarized.
Demonstration of Binet’s formula. As said before, the Fibonacci sequence is defined recursively as:
Let us assume that an explicit function, which provides the terms of the Fibonacci series, exists, and for the
n-th term, it has the general expression:
with
and
constant values to be determined. By substituting (8) in (6) the following expression is obtained
from which:
Equation (10) has two trivial solutions (
), which must be excluded because they could not generate the series, and two non-trivial solutions, namely the roots of the polynomial within brackets. These two solutions are:
It is worth noting that the first root
in (11) is the golden ratio value. Given that the recurrence relation (9) is linear, the sum of the two solutions is also a solution. The sought function can therefore be obtained as a linear combination of the two solutions corresponding to the two roots
and
:
with
and
to be determined. To this end, we can impose the correspondence with two arbitrary values of the series, for example, the first two: 1, 1.
The solution of the system (13) is
from which the following expression of Binet’s formula comes:
2.2. Exploitation of Binet’s Formula to Calculate the IIR Impulsive Response
The method to determine Binet’s formula can be applied by considering the discrete time instead of the index of the Fibonacci sequence and the generic number of delays, to calculate the impulsive response of an IIR filter used to represent the AR part of the neuron. When the input of the neuron is a unitary impulse, i.e., the (2) gives us the impulse response .
Let there be an IIR filter, such as the one described in
Figure 3, whose impulsive response is:
Let us assume that a function exists, which provides the arbitrary term of its impulsive response.
By replacing (16) within the generic expression in (15), it results in:
and finally:
The non-trivial solutions of (18) are the
roots of the polynomial between brackets. Therefore, by combining the
r solutions of the polynomial in (18), it is possible to derive the following general solution, which contains
r degrees of freedom:
with
to be determined. To this end, the first
samples of the impulsive response are calculated and imposed in the following linear equations system:
The solution of system (20) represents the coefficients of the function (19).
2.3. Derivative of the Loss Function with Respect to the Feedback Parameters
Equation (5) describes the derivative of the loss function with respect to the generic feedback parameter. As remarked in the previous sections, such derivatives require the sequence of all the previous samples; therefore, an explicit calculation is impossible unless a limit is imposed to the duration of the impulsive response. As said before, the vanishing time of the impulsive response depends on the feedback parameters we are calculating, so that the assumption is subject to uncertainty. The procedure described in
Section 2.2 allows one to fix this problem. The IIR filter is a linear system; therefore, its answer to a generic input sequence can be expressed as the convolution product between the input signal and the impulsive response:
Given the impulse response (19) and denoting with
the convolution product, the sequence of the state variable
can be expressed as
Equation (21) allows us to calculate the derivatives with respect to the roots
rather than the feedback coefficients
. This makes it possible to take control of the stability of the NN. In fact, as the state of the neurons depends on the roots
, if they are constrained to have the module less than 1, no matter the other parameters, the stability of the network is guaranteed. Therefore, it is convenient to train the network by adapting the zeros of the polynomials, and then calculating the corresponding feedback parameters, which are the coefficients of the polynomials having the
as roots. Equation (5) to calculate the derivatives of the loss function with respect to the feedback parameters is substituted by the following one:
where
is the derivative of the activation function of the hidden layer. From (19) comes that
is a polynomial in
and then only one term of its derivative is not null. The derivative finally writes:
The convolution product in (23) implies that a limit must be assumed for the duration of the impulsive response, but thanks to the use of the roots, this term can be established a priori.
2.4. The Training Algorithm
A simple gradient descent method [
23] is used to train the LRNN.
where
is the set of all the parameters of the NN, either forward or feedback parameters,
is the iteration index,
is the learning rate. The gradient is calculated with respect to all the independent parameters, considering that both the feedback coefficients
and the coefficients of the impulsive responses
are univocally determined once the values of the roots
have been established. It is worth noting that the values of the coefficients
cannot affect the stability of the NN. The only parameters that affect stability are the roots
, so the iterative procedure applying the (24) must be meant in constrained terms. This means that, when updating the set of parameters
, the roots with a module greater than 1 must be avoided by truncating the increment, or by projecting properly the move. The use of more advanced procedures, even if possible, is beyond the scope of this study.
Two stop criteria have been assumed: maximum number of iterations and maximum number of iterations without improvements.