Multiple-Reservoir Hierarchical Echo State Network

: Leaky Integrator Echo State Network (Leaky-ESN) is a useful training method for handling time series prediction problems. However, the singular coupling of all neurons in the reservoir makes Leaky-ESN less effective for sophisticated learning tasks. In this paper


Introduction
For both pattern recognition and time series prediction, the basis for their implementation is the need to first build an accurate model of the system.Though it is theoretically proven that fuzzy models can achieve arbitrary accuracy in approximating any nonlinear system, their ability to map nonlinearities is considerably limited by the utilization of linear functions in the back parts of the fuzzy rules.For systems with strong nonlinearity, a very large number of rules are required to model the system accurately.
Echo state network (ESN) is an optimization model for a recurrent neural network (RNN) [1], which reconstructs the hidden layer with a reservoir consisting of a high number of sparsely connected neurons.It achieves data memorization by modifying the state of the neuron within the reservoir.The main feature of ESN is to facilitate the network formation process by computing only the connection weights from the reservoir to the output layer.While the other weights remain unchanged, it avoids local minima that tend to occur in traditional RNN.This simplified method of training enables the ESN to achieve good performance when the echo state requirements are met [2].ESN have been used effectively in some fields, such as time series forecasting [3], dynamic pattern recognition [4], speech recognition [5] and nonlinear signal processing [6].
However, conventional ESN reservoirs have randomly connected neurons.The tunable parameters are the amount of reservoir neurons and the spectral radius of internal connection matrix.In fact, ESN with the same size and spectral radius can show significant differences across training [7].Therefore, many scholars have proposed improvement methods for ESN, e.g., modifying the topology of reservoir [8,9], refining the state update equation of network, and choosing a more optimal algorithm for the training process [10,11].
In order to enhance both the prediction accuracy and speed, wavelet neurons were introduced to the internal structure of the reservoir in the conventional ESN.This adjustment was made to achieve better performance, as stated in [12].Minimum complexity reservoirs and simple circulation reservoirs (SCR) are discussed, which have similar properties to stochastic reservoirs [13].In [14], a novel small-world ESN is proposed, which dynamically corrects the stochastic sparse network of reservoir with an improved Newman Watts (NW) model as a way to increase the prediction accuracy and convergence speed of the ESN.A new optimization method is proposed to constrain the optimization of the ESN using the penalty function interior point method and then further optimize the parameters of the ESN [15].Introducing a fresh approach named double activation functions echo state network (DAF-ESN), the proposed model substitutes the single activation function of the reservoir with a weighted superposition of various activation functions.Consequently, the ESN state transition equations will be modified, rendering the network model more flexible and responsive to distinct input signal [16].An optimization procedure is designed to evaluate the output weight based on a fixed number of training patterns, and the evaluation results demonstrate that the optimization procedure can increase the prediction performance of ESN [17].An echo state network that can cluster the number of neurons in the growth reservoir to improve the prediction accuracy is proposed in [18].In [19], the author applies an improved gravity search algorithm for the hyperparameter optimization of echo state network to improve the prediction effect of the network.
Since the ESN reservoir has a stochastic network topology, its stochastic nature is difficult to meet the demand for prediction of time series with different characteristics.Furthermore, ESN with only a single reservoir has significant limitations.It was not previously possible to train an ESN to operate as a Multiple Superimposed Oscillator (MSO) [20].Wiestra states that the MSO problem fails because the neurons are all connected in a coupled way, while the solution requires multiple couplings in the reservoir [21].
In this study, we introduce an innovative ESN model to overcome the above-mentioned limitations, termed the Multiple-Reservoir Hierarchical Echo State Network (MH-ESN).This model consists of several sub-reservoirs of different complexities joined together to form a new total reservoir, which reduces the coupling effect between the neurons in the total reservoir of the conventional ESN.Furthermore, the Xwish function [22] is used as the new activation function to mitigate the issue of gradient disappearance during the network's training process of the network.The proposed model is compared with GRU and Leaky-ESN models on two synthetic time series and a real dataset.The results show that MH-ESN greatly improves the prediction performance of time series, which proves the effectiveness of the proposed model.

Leaky-ESN and Xwish Function 2.1. Leaky-ESN
Leaky integrator echo state network (Leaky-ESN) is a modified network model of the ESN [23].It replaces the normal neurons in the ESN reservoir with leaky-integrating neurons, giving the sparse network the ability to learn with continuous and slow properties.Leaky-integrating neurons function as low-pass filters on the reservoir neurons, thereby improving the short-term memory capacity of the ESN and enabling it to adapt to temporal characteristics of the network learning task in diverse ways.Leaky-ESN and regular ESN have identical network structures, which comprises three primary components: the input layer, the reservoir, and the output layer.The reservoir replaces the fixed hidden layer of the RNN and functions as the core of the ESN.Throughout the training sequence, the reservoir has the capability to map input data from a low-dimensional input space to a nonlinear, high-dimensional state variable.This state variable is then linearly fitted to the desired output, simplifying the network training process.The classic layout of the Leaky-ESN is seen in Figure 1.In Figure 1, during the sampling process, the input variable at time n is denoted as u(n), the internal state variable of the reservoir is denoted as x(n), and y(n) represents the output vector of the output layer.The input connection weight matrix of the Leaky-ESN is represented as W in , where the elements are randomly generated numbers between 0 and 1.Meanwhile, W denotes the randomly generated weight matrix for the internal connections within the reservoir, which must satisfy the sparsity requirement to ensure the stability of the internal states.The output feedback connection weight matrix of the network is represented by W f b .Additionally, W out represents the weight matrix of the feedback connections, wherein the generation process is similar to that of W in .The formula for updating the state can be expressed as follows: where a ∈ [0, 1] is the leakage rate, when the leakage rate equals 1, and the leaky integrate neuron undergoes a transition and assumes the characteristics of a regular neuron; s in is the input scaling factor; and ρ is the spectral radius of the internal state matrix of the reservoir.
The activation function f employed via the internal neurons within the reservoir typically follows the form of a sigmoidal function, such as the Sigmoid or Tanh function.Meanwhile, the output activation function f out is typically defined to be the identity function.In the whole training process, W and W in generated randomly during the initialization of the network, and W out is calculated via the network training.Different from the gradient class training algorithm adopted via RNN, ESN usually adopts linear regression class training to output weights W out .This paper adopts the ridge regression algorithm to solve for W out , with the following equation: where θ is a hyperparameter, and I is a diagonal unit matrix.The normalized root mean square error (NRMSE) is utilized both as an assessment tool for evaluating the training performance of Leaky-ESN and as a benchmark for gauging the effectiveness of the improved network model that follows.The expression is as follows: where y(i) denotes the ith data of the actual output; d(i) denotes the ith data of the desired output; • means the Euclidean distance; and σ(d) denotes the standard deviation of the desired output.
The stability of ESN during the training process depends on whether the network model satisfies the echo state property (ESP) [24].The echo state property of ESN means that when the network reaches a stable state, the initial state has almost no impact on the network, and the state of the reservoir neurons is determined solely using the current input and output.At this point, the network gradually approaches a balanced state and exhibits the echo state property.Echo state property serves as the foundation for ensuring the stability of ESN training.However, in practical training, to guarantee the echo state property of ESN, the spectral radius of the reservoir neuron connection weight matrix should be less than 1.To simplify the complexity of echo state network prediction, the ESN model typically ignores the influence of output feedback on the internal state by setting it to zero in the output feedback matrix.

Xwish Function
The significance of the activation function in the update process of the reservoir state within ESN becomes apparent when analyzing the reservoir state update equation.Therefore, the activation function determines how the reservoir state is updated.Traditional ESN models generally choose the sigmoidal function as the activation function.
In order to mitigate the effect of the gradient disappearance caused by the S-shaped function in the network training on the prediction accuracy, the Xwish function [22] is used as a new activation function for the echo state network in this paper.Its function formula is as follows: where β is a parameter subject to correction.In Figure 2, the Xwish function is depicted for various values of β, and it exhibits smoothness across the real number range.Figure 3 shows the image of the first order derivative of the Xwish function when β takes different values, which is smooth in the range of real numbers.The Xwish function slows down the gradient disappearance, which improves the convergence accuracy of the function, and the new function is centered on 0, which improves the efficiency of the function when the weights are updated.In addition, its weights can be continuously updated without affecting the next training process, which maintains the diversity of input data and results in better convergence.

MH-ESN Model
Considering the reservoir internal random sparse connections between neurons, its topology structure is unclear.Therefore, we imagine that the reservoir with a relatively clear topology may have better prediction performance.Some studies have shown that the hierarchical topology of the reservoir can improve the prediction performance of ESN [25].By integrating the hierarchical topology with the reservoir design of ESN, a multi-reservoir hierarchical echo state network (MH-ESN) is proposed based on the Leaky-ESN model.The MH-ESN comprises multiple layers, with each layer consisting of a multi-reservoir echo state network model.Its topology is shown in Figure 4.The sub-reservoirs of each layer are connected through main neurons, which represent the state of the sub-reservoir by averaging the states of the neurons within it.The connections between main neurons can form a virtual sub-reservoir, as shown by the red dashed box, which is different from the previous actual subreservoirs in that the states of neurons in the virtual reservoir can vary greatly.This chapter selects a two-layer network structure, with each layer having three subreservoirs.
In Figure 4, the input signal is denoted by u(n), the state of the whole reservoir is represented by x(n), and the output signal is denoted by y(n).The number of neurons in each layer is N1 and N2, respectively.The first layer consists of sub-reservoirs with sizes of N1 i (i ∈ 1, 2, 3), and the second layer consists of sub-reservoirs, with sizes of N2 i (i ∈ 1, 2, 3).The size of the total reservoir N is equal to the sum of the sizes of all sub-reservoirs.The connection weight matrix for sub-reservoir neurons is generated in the same way as for single-reservoir neuron connection weight matrices.The weight matrix of the total reservoir W is given in block matrix form as follows: where W A represents the total connections within the two-layer reservoir, while the weight matrix W B corresponds to the interconnections within the virtual reservoir, and W AB pertains to the connections between the main neuron and other neurons within each sub-reservoir.The expression of W A is as follows: where W 1 and W 2 are the first-and second-layer weight connection matrices.In this paper, the state update equation for the matrix normalization of the MH-ESN model network after discretization using the Xwish function as the new activation function is expressed as follows: 8) where T is the total two-layer reservoir state, W in A is the two-layer input weight matrix, W in B is the virtual subreservoir input weight matrix, W f b is the feedback weight matrix, and Φ is the Xwish function.

Optimizing the Global Parameters of MH-ESN
In order to make the MH-ESN model have echo state characteristics and ensure the stability of the network during training, set W f b = 0, when the model does not contain output feedback.The MH-ESN model reservoir state update equation is expressed as follows: x In Equation ( 10), the parameters a, s in , and ρ should be optimized.The training objective of the MH-ESN model is to minimize the error, and the output weight matrix W out is trained so that the forecasted output y(n) is as near as reasonably practicable to the true value d(n), and the error e(n) is expressed in Equation (11).To facilitate the computational process of finding the gradient, the error function is defined as shown in Equation ( 12).
Let the global parameters q ∈ a, s in , ρ , and apply stochastic gradient descent to optimize the global parameters.The partial derivative of the error function with respect to q can be expressed as Since the partial derivative of the input variable u(n) with respect to the global parameter q is 0, the simplified expression for the partial derivative of all global parameters is shown below: In the above equation, it is also necessary to calculate the partial derivatives of the reservoir state with respect to the global parameter q.To simplify the calculation process, the identity function is chosen as the function of output activation and set X(n + 1) = s in W in u(n + 1) + ρWx(n).At this point, the expressions of the partial derivatives of the global parameters q for a, s in , and ρ can be obtained as Bringing ∂x(n + 1)/∂q back to Equation ( 14) yields the iterative formula for parameter optimization of the MH-ESN model during the training process: where µ is the learning rate of the q in the parameter optimization process.The correction values obtained after the optimization search must ensure that the MH-ESN model has echo state characteristics.

Simulation and Analysis
To evaluate the feasibility and effectiveness of the proposed MH-ESN model in this chapter, two different time series are selected for prediction in this section, namely the MSO time series and Mackey-Glass chaotic time series.The global parameters of the ESN models are trained using gradient descent.The normalized root mean square error (NRMSE) of the predictions is used as the performance metric to evaluate the performance of the MH-ESN, gate recurrent unit (GRU) [26] and Leaky-ESN models.GRU is generated through MATLAB R022b internal functions.Considering that the reservoir generation is random, the prediction results of the network are not the same even when the hyperparameters are exactly the same.We gave the initial parameters with good results via empirical debugging, and based on this, we ran 30 times and averaged the test results.The initial parameters of each network are shown in Table 1.In the table, leakage rate a indicates the degree of retention of the state of the previous time when the state of the reservoir is updated; spectral radius ρ is an important parameter to guarantee ESP; input scaling factor s in scales the input data; learning rate µ is the step size or speed at which model parameters are updated in each iteration; and reservoir size N is the number of neurons in reservoir.

MSO Time Series
The time series are created using the following equation: The expected output is d(n) = u(n − 5).In this experiment, the neuron size of each layer of the MH-ESN model was made the same, i.e., N1 = N2 = 30.Each layer has three subreservoirs with neuron sizes of 12, 10, and 8 in the sub-reservoir, and each sub-reservoir is set with different sparsity according to the different neuron sizes.The MH-ESN model has a total of six sub-reservoirs, so the size of the main neurons in the virtual sub-reservoir is different.The neurons inside each sub-reservoir are sparsely connected.Using the same input sequence and parameters for excitation, the MSO time series are trained in both cases separately, and the NRMSE is selected as the performance evaluation index.
Figure 5 shows the predicted versus expected values of the MH-ESN model for the MSO time series, and it can be seen that the MH-ESN model can fit the expected output curve well after the washout segment.Table 2 shows the prediction results of the three models trained for MSO time series prediction, and it can be observed that MH-ESN has a better prediction accuracy than GRU and Leaky-ESN models for the MSO time series problem prediction.This indicates that the topology of MH-ESN can enhance the predictive performance of ESN on the MSO time series.

Mackey-Glass Chaotic Time Series
The Mackey-Glass chaotic time series is extensively employed to assess the discriminate ability of ESN in chaotic nonlinear systems.This time series can be expressed using a first-order time-lag differential system, which is given as follows: where the chaos parameters are α = 0.2, β = 10, and γ = −0.1.The time series has chaotic properties when τ > 16.8.Furthermore, in this chapter, we set τ = 17 and u(0) = 1.2.Using the same input vector, initialization parameters, and activation function, the NRMSE is used as the measure of performance.
Figure 6 shows the predicted versus the expected values of the MH-ESN model for the MG chaotic time series, and it can be seen that after more than 1000 steps, the predicted value curve of the MH-ESN model fits well with the expected output curve.Table 3 shows the prediction results of the three models trained for MG time series prediction.It can be observed that the MH-ESN model has a higher prediction accuracy than the GRU and Leaky-ESN models, and the simulation run results illustrate that the MH-ESN model has better prediction results for the MG chaotic time series.

ECG
Electrocardiogram (ECG) data reflect the physiological activity characteristics of the human heart, which is often used as one of the diagnostic criteria for the heart in medicine.With the continuous improvement of the detection equipment, the acquisition of ECG signals has also become easy, which also helps people to analyze this.In this paper, the ECG signal is provided by the MIT-BIH arrhythmia database.
Figure 7 shows the predicted versus expected values of the MH-ESN model for the ECG time series, and it can be seen that the MH-ESN model can fit the expected output curve well after the washout segment.Table 4 shows the prediction results of the three models trained for ECG time series prediction.It can be seen that in the ECG time series prediction experiment, the prediction results of Leaky-ESN is not as good as GRU, but the MH-ESN model greatly improves the prediction accuracy and performs the best among the three.

Conclusions
This paper proposes a multi-reservoir hierarchical echo state network (MH-ESN).Compared with the reservoir structure of traditional ESN, the reservoir topology structure of the MH-ESN model processes the neurons in a hierarchical manner.Each layer is composed of multiple sub-reservoirs of echo state network models, and the sub-reservoirs in each layer are connected through main neurons.This design can more accurately simulate the hierarchical structure of a real biological neural network and improve the stability and prediction accuracy of the network.In the MH-ESN model, the connections between the main neurons are treated as a virtual sub-reservoir, and the virtual sub-reservoir is fully connected to the main neurons.The performance of MH-ESN is verified by predicting the MSO time series, Mackey-Glass chaotic time series, and the ECG time series.The simulation results show that MH-ESN can further improve the prediction accuracy of the ESN, and it is a more reliable time series prediction model.

Figure 2 .
Figure 2. Image of the Xwish function.

Figure 3 .
Figure 3.The derivative of the Xwish function.

Figure 5 .
Figure 5.Comparison of predicted and expected values of MSO time series.

Figure 7 .
Figure 7.Comparison between predicted value and expected value of ECG time series.

Table 2 .
Comparison of the performance of the three models against MSO time series.

Table 4 .
Comparison of the performance of the three models against ECG time series.