Investigating the Surrogate Modeling Capabilities of Continuous Time Echo State Networks

Continuous Time Echo State Networks (CTESNs) are a promising yet under-explored surrogate modeling technique for dynamical systems, particularly those governed by stiff Ordinary Differential Equations (ODEs). A key determinant of the generalization accuracy of a CTESN surrogate is the method of projecting the reservoir state to the output. This paper shows that of the two common projection methods (linear and nonlinear), the surrogates developed via the nonlinear projection consistently outperform those developed via the linear method. CTESN surrogates are developed for several challenging benchmark cases governed by stiff ODEs, and for each case, the performance of the linear and nonlinear projections is compared. The results of this paper demonstrate the applicability of CTESNs to a variety of problems while serving as a reference for important algorithmic and hyper-parameter choices for CTESNs


Introduction
Modeling dynamic systems using scientific machine learning (SciML) techniques is a rapidly growing field with advanced ML architectures being applied to model complex problems across a diverse range of applications.Some examples are rapid design optimization [1], real-time health monitoring [2], turbulent flow modeling [3], and materials discovery [4].Many of these applications utilize a " surrogate model " that makes real-time predictions of the system behavior in place of full-order models that would be too slow or expensive for the application.
Recently, Echo State Networks (ESNs) have seen an increase in popularity for modeling highly nonlinear and chaotic phenomena in domains such as optimal control planning [5], chaotic time series prediction [6,7], signals analysis [8] and even turbulent fluid flow [9].These applications leverage the ability of ESNs to capture highly nonlinear transient behavior accurately, as well as the extremely low cost of training them, with the empirical success of ESNs on a wide range of approximation tasks discussed and explained in several works [10,11].One of the biggest limitations, however, for using the standard ESN implementation in the surrogate modeling of nonlinear dynamical systems, is that the available training data may not be uniformly sampled in time.A particular example of this is the numerical solution of stiff systems of Ordinary Differential Equations (ODEs) from ODE solvers.An ODE system is said to be stiff if, for any initial conditions and in certain intervals, the solving numerical method is forced to use a timestep which is very small compared to the smoothness of the exact solution [12].Numerical ODE solvers overcome the instability due to stiffness by having a variable time step size during the solve, leading to uneven sampling of the solution in time.
There have been several attempts to apply reduced-order modeling for stiff ODEs.Ji et al. [13] used Physics Informed Neural Networks [14], Kim et al. [15] used Neural ODEs [16] and Goswami et al. [17] used Deep Operator Networks [18] to solve several stiff systems such as the ROBER [19] and POLLU [20] problems.However, the methodologies applied require assumptions and scalings that may not generalize, or require lots of training data and compute resources to train deep neural networks.Work by Anantharaman et al. [21] also showed the failure of popular architectures such as Long Short Term Memory (LSTM) and standard ESNs in modeling stiff systems.
To use the attractive properties of ESNs (i.e capacity to model highly nonlinear signals and ease of training) for modeling stiff systems, a variant of ESNs called Continuous Time Echo State Networks or CTESNs has been proposed [21] to address this issue.CTESNs have been successfully employed in various applications from accelerating predictions of power system dynamics [22] to accelerating solutions of pharmacology models [23].
In recent literature, two ways of using CTESNs for surrogate modeling have emerged; the Linear Projection CTESN (LPCTESN) [21] and the Nonlinear Projection CTESN (NLPCTESN) [22,23] and have been applied to a range of problems.However, there is currently a lack of work critically examining the accuracy of both methods and how they compare to one another.Further, both projection methods use a Radial Basis Function (RBF) for interpolation, which also comes with several algorithmic choices that need to be explored.This study aims to fill this gap by investigating the effects of these algorithmic choices on surrogates created to solve several stiff ODE systems such as Robertson's equations, the Sliding Basepoint model of automobile collision, and the POLLU air pollution model.The findings of the study show that for the same hyper-parameter settings of the CTESN, the NLPCTESN outperforms the LPCTESN on all benchmarks shown.Further, it is shown that for the interpolating RBFs used within CTESNs, k-Nearest Neighbor (k-NN) polynomial augmented RBFs outperform standard RBFs in predictive accuracy.
This paper is divided as follows: Section 2 introduces the concept of ESN and CTESN, along with the projection methods LPCTESN and NLPCTESN described above discussed in detail.Section 3 demonstrates the application of the methods to several challenging stiff ODE problems, with a qualitative and quantitative discussion of the results.Section 4 summarizes the work with possible future directions to take.An Echo State Network [24], depicted in Figure 1, is a form of reservoir computing that is very similar to the more popular Recurrent Neural Network (RNN) in its architecture.They make predictions by following a recurrence relation (Equation 1) for updating a latent space vector and using that vector to map to a given output.Unlike RNNs, the parameters of the RNN in the " reservoir " are fixed and are not updated during training, and it is only the mapping from latent space to the output space that is learned.Depending on the implementation, this makes the training of ESNs very fast (sometimes as fast as a simple least squares fit) and computationally cheap.

Standard Echo State Networks
The governing equation for an echo state network reads: where r n ∈ R Nr is the latent state at timestep n, σ is an activation function (most commonly tanh), W in ∈ R Nr×Nx and W ∈ R Nr×Nr are the reservoir matrices which are fixed and randomly initialized, and x n is the system state at time n.The matrix W is a sparse matrix and usually has 1% nonzero entries.
The output projection reads: where Φ is decided by the projection method.The most popular projection method is the linear projection, resulting in where W out ∈ R Nx×Nr is the linear projection matrix that needs to be fitted to the training data.Like most machine learning algorithms, ESNs have a set of hyper-parameters that need to be tuned, and there are several works [25,26] that can be used as guides to select them.A key hyper-parameter in ESNs is the spectral radius of W. In this study, the spectral radius is fixed to a value of (0.01) for all models created.Although a bit smaller than the conventional values used in standard ESNs, it was found in this study on CTESNs that using the slightly smaller value maximized predictive accuracy while also ensuring the stability of the reservoir (r) solution.To fit the trainable matrix, one only has to solve the ordinary linear least squares problem: If X = [x 1 ; x 2 ; ....; x N ] and R = [r 1 ; r 2 ; .....; r N ] then

Continuous Time Echo State Networks (CTESN)
Continuous Time Echo State Networks are a variant of ESNs that model time as a continuous rather than discrete quantity.The model equation for a CTESN is given by: with all variables having the same definition as in the previous section.The projection equation reads: As per the literature, there are two ways of modeling the relation between the latent space and the outputs [27]; the linear method (called Linear Projection CTESN or LPCTESN), described by: The second method is the nonlinear method (called NonLinear Projection CTESN or NLPCTESN), where the projection Φ is a nonlinear mapping.Many possible functions can be used, but the literature on NLPCTESNs [22,27] currently uses standard Radial Basis Functions (RBFs) to write the projection:

Surrogate Modeling via CTESNs
To create and use a surrogate model via the CTESN method, a few steps are followed.First, a Design of Experiments (DoE) space is created and N sample combinations of query parameters are drawn; call this set P={p 1 , p 2 .....p N }.Each p i represents a set of conditions at which the ODE is solved and the surrogate is expected to capture the change in the solution due to changes in the value of p i .
Examples include the initial conditions (see Sections 3.1.2and 3.2), rate constants of the ODE (see Sections 3.1.1and 3.3), etc.The ODE is solved numerically at each p i ∈ P, to return the solution set Y={y 1 , y 2 ....y n }, y i ∈ R Nx×N i ts .A single parameter combination p * ∈ P is drawn at random, and the reservoir ODE (given by Equation 5) is solved using a numerical ODE solver.This returns a r(t) ∈ R Nr×Nts where N ts is the number of timesteps in the solution at p * .Then, depending on whether the method follows a linear or nonlinear projection method, either Algorithm 1 or Algorithm 2 is followed to fit and query the surrogate.
Algorithm 1 Linear Projection CTESN Surrogate Fitting To query a new parameter p test : Step 1: Algorithms 1 and 2 both use an interpolating RBF in their procedure, and in this article, it is demonstrated that polynomial augmented k-nearest neighbor (k-NN) RBFs deliver superior results in terms of generalization to new test problems for CTESN-based surrogate models, compared to standard RBFs.The reader is encouraged to read Appendix Section A for a detailed explanation of k-NN polynomial augmented RBFs.

Applications and Results
CTESN surrogates are created for three problems.First Robertson's equations are solved, parametrizing the rates of reaction and initial condition separately.Then, the Sliding Basepoint system is modeled, comparing explicitly the difference due to the projection method, and parametrizing the initial conditions of the problem.Finally, the POLLU system is solved, again parametrizing the initial conditions of the problem and comparing the results obtained using differing projection methods.Unless mentioned otherwise, the training data was sampled from the DoE space randomly.
In all results in this section, the MAE is computed as where N test is the number of test cases, y i j,pred and y i j,true are the prediction and true solution respectively for the i'th time step of the j'th test case.

Robertson's Equation
The first demonstrated application is the surrogate modeling of the Robertson Equations [19].They are written as: where y 1 , y 2 and y 3 represent the concentration of the reactant species and r 1 , r 2 and r 3 represent the fixed rates of reaction.The system is usually subject to initial conditions of [y 1 , y 2 , y 3 ] = [1, 0, 0].They are a system of ODEs that describe a standard rate process and are often used to benchmark numerical ODE solvers due to the stiffness of the system.More specifically, the long-time integration of Robertson's equations is known to be a challenging problem for numerical ODE solvers, and the system hence serves as a good test for surrogate models.In this work models are created by parametrizing the system in two ways; first, the rates of reaction are parametrized.This has been the focus of several other papers written on CTESNs [21,27].Second, the initial conditions of the system are parametrized.

Parametrizing Rates of Reaction
In this section, the Design of Experiment (DoE) space for the rates is given as: The focus is limited to presenting the results of the prediction of the variable y 2 .This variable has a sharp transient that occurs at a time scale much smaller than the other two, and hence causes the system to be stiff.
The average MAE for y 2 , averaged across several test cases listed in Table 1, is shown in Table 2 sorted in descending order of generalization MAE.Predictions for y 2 for a particular test parameter are shown in Figure 2.For the same hyper-parameters, it can be observed that the absence of either the augmenting polynomial or the k-NN interpolation (i.e using all collocation points to predict the solution in the RBF) significantly increases the error of prediction.It can also be seen that the nonlinear projection performs better on average than the linear projection.
In the next section, the initial condition of Robertson's ODE is parametrized, and a similar error analysis is done.

Parametrizing Initial Conditions
The initial condition of the system is parameterized.This problem can be challenging for lower initial values of y 1 (0) as it leads to a smaller and sharper transient in y 2 , which becomes more difficult to capture accurately by the surrogate.The DoE space of the initial condition is given as  y 2 (0) = 0, The condition for y 3 is decided on the basis that the sum of all quantities should always equal 1.Once more, the focus is on comparing the predicted results in y 2 .3, sorted in descending order.Figure 3 shows the comparison for a test parameter, across 5 different configurations of the CTESN surrogate model.A similar trend of hyper-parameter performance as seen in Table 2 is noted, in that the absence of the k-NN interpolation increases the error of the prediction.For the same hyper-parameters, the nonlinear projection once again achieves lower generalization MAE than the linear projection.

Sliding Basepoint Model
Finally, the discussed methods are applied to the surrogate modeling of a realistic crash safety design problem.A system of ODEs proposed by Horváth et al. [28] that approximately models an automobile collision is solved via a created surrogate model.This system, called the Sliding Basepoint model, has parameters that were fitted on realistic crash data and are assumed to accurately model a collision problem. Figure 4 depicts the system at its initial state and at a later time.The system is given as: where m 1 , m 2 , F 1 , F 2 , x 1 , x 2 , v 1 , v 2 refer to the masses, total forces on, positions and velocities of the bodies respectively.m 1 in reality reflects the mass of the chassis of the car, x 1 behaves similar to the deformation of the bumper.F s represents the spring force: between the two masses and P represents the power of dissipation Finally, the forces are computed based on best-fit models described in the paper: Parameter Set v 0 /25 k 0 /10 24.9 3.38 0.459 145000 0.1 Table 6: Parameter values taken from [28] with their SI units.
The values of the fixed parameters are given in Table 6.The reader is referred to the paper [28] for further details.The system is subject to initial conditions The mechanics of crash and impact problems are known to have sharp transients and highly oscillatory behaviors associated with them, making their numerical solutions slow and costly to compute for a wide range of parameters.Hence, CTESNs will be a good surrogate modeling tool for the problem.In this work, the initial conditions of the problem (v 0 and k 0 ) are parametrized to simulate different impact velocities and directions (the spring constant of the bumper can be assumed to be different in different directions).The state space for the problem is the vector The DoE space is the range: chosen to represent a wide range of speeds and stiffness constants.The DoE space was sampled using a space-filling sampling strategy [29] and the surrogate models were trained on 100 data points, and data was generated using a stiff ODE solver using solver parameters given in [28].
Table 7 shows the average MAE for the state variable v 2 for the linear and nonlinear projection methods, all other hyper-parameters being kept the same, for test parameters listed in Table 6.v 2 was chosen to demonstrate the accuracy of the surrogates because, as visible in Figure 5, it has sharp nonlinear oscillatory transients starting from the moment of collision, which is difficult to capture for surrogate models.It can be seen from Table 7 and Figure 5 that the LPCTESN has a poorer performance on test cases compared to the NLPCTESN.
Of particular interest is the speedup obtained by using the surrogate; using the surrogate leads to up to a 200x speedup in the prediction of the solution, the ODE solver taking roughly 0.02 seconds per solve of the ODE system.This is important because the transients in Figure 5 occur at the same time scale O(10 −2 s).With a 200x speedup, if the surrogate is deployed on board a vehicle, it can judge the severity of the collision much quicker by computing impact forces from the result of the system, and closed-loop control and safety measures can be deployed.In this case, it would effectively function as a digital twin for collision monitoring.

The POLLU Model
The CTESN approach is used to model the POLLU air pollution model developed at the Dutch National Institute of Public Health and Environmental Protection [20].It consists of 20 species and 25 reactions, modeled by nonlinear ODEs of the form where y is the concentration vector (in ppm) of the reacting species, P is the production term and L is a diagonal matrix representing the loss term for every species in the system.Table 8 shows the production and loss rates for each species of the system.The values of the rate constants r are given in table 11.The reader is referred to the paper [20] for a complete description of the reaction system.
The system is a common benchmark for stiff ODE solvers and represents a difficult problem to solve due to the large number of species and ODEs involved.When such systems have to be solved at many grid points, say in a computational mesh, they represent a very expensive computation and hence this example is an ideal application for testing surrogates of stiff ODEs.In this work, the initial conditions of the system are simultaneously parametrized, according to the following:  where y 2,0 , y 4,0 and y 7,0 refer to the initial concentration of the respective species.The initial conditions for the rest of the species are default as per the paper [20].
The training data was sampled using 100 data points within this DoE space, selected randomly.The reservoir size N r was set to 100, and the number of queried neighbors N neigh by the k-NN RBF was set to 10. Table 10 shows the mean absolute errors computed across several test cases (listed in Table 9) for a few species in the reaction, for the linear and nonlinear projection methods.It can be observed that the nonlinear projection CTESN outperforms the linear projection CTESN when all other hyper-parameters are kept the same.Figure 6 shows the comparison of the prediction graphically; the LPCTESN prediction is much noisier than the NLPCTESN prediction at later times.This was also observed with Robertson's equations, where the predictions at larger time scales by the LPCTESN tended to become noisier.Table 8: Species involved in the POLLU reaction system, with their production and loss rates, derived from [20].

Discussion and Conclusion
From the numerical experiments conducted, it was observed that polynomial-augmented K-nearest neighbor RBFs outperform standard RBFs in terms of accuracy when used as part of the CTESN surrogate modeling algorithm.It was observed that the nonlinear projection (NLPCTESN) method consistently outperformed the linear projection (LPCTESN) method, achieving superior accuracy when the hyper-parameters are the same, on a variety of problems.The NLPCTESN method demonstrated accuracy across several problems with sharp transients, varying time scales, and long horizons of integration, whereas the LPCTESN method had a higher generalization error on all test examples.The surrogate was also shown to achieve a speedup of several orders of magnitude compared to an ODE solver of a realistic collision problem and could be used for several cases like design optimization and online collision severity monitoring.There are several directions in which this work could be built upon.The CTESN is a black-box data-driven method, and future works need to investigate how well the model learns the physics of the problem, or apply physics-constrained modeling approaches to the outputs of the CTESN.The speedup of the surrogate model becomes very apparent when solving many instances of the ODE; this happens either when the ODE system is very large, or the small ODE system has to be solved repeatedly many times, such as in coupled ODE-PDE systems.Examples include chemically reacting flows similar to the POLLU system, in which a large stiff system of ODEs has to be solved at every compute node, or FEM-based crash solvers which require accurate modeling of sharp transients similar to those demonstrated in this work.9.

B POLLU Reaction rates
The rates of reaction of the POLLU system are shown in

Figure 1 :
Figure 1: Depiction of a standard Echo State Network.

Figure 2 :
Figure 2: Figures show the time history for y 2 for a test parameter set (P2 from Table 1).Each trial mentioned below is referenced from Table 2.The NLPCTESN prediction is the best out of all models.(a) Trial 1 (b) Trial 2 (c) Trial 3 (d) Trial 4 (e) Trial 5

Figure 3 :Figure 4 :
Figure 3: Figures show time history for y 2 for a test parameter set (P4 from Table 3).Each trial mentioned below is referenced from Table 4.The NLPCTESN performs the best out of all models.(a) Trial 1 (b) Trial 2 (c) Trial 3 (d) Trial 4 (e) Trial 5

Figure 6 :
Figure 6: Solutions y 1 ,y 2 ,y 14 and y 20 for a test parameter (P2 from Table 9).The left and right columns contain results from the linear and nonlinear projections respectively.The NLPCTESN outperforms the LPCTESN in this example.(a)LPCTESN -y 1 (b) NLPCTESN -y 1 (c) LPCTESN - pairs To query a new parameter p test :

Table 1 :
Test parameter values for rate parametrization of Robertson's ODE system Trial No. N r N data N neigh

Table 2 :
Average Mean Absolute Error (MAE) in y 2 , averaged across all test cases.

Table 3 :
Test parameter values for initial condition parametrization of Robertson's ODE system Trial No. N r N data N neigh

Table 4 :
Average Mean Absolute Error (MAE) in y 2 , averaged across all test cases when parametrizing the initial condition.Table 4 tabulates the average MAE across all test cases for the problem, listed in Table

Table 5 :
Test parameter values for the collision modeling problem

Table 10 :
Mean Absolute Error (MAE) for the POLLU problem, averaged across all test cases in Table

Table 11 :
Reaction rates for the POLLU problem