Automated differential equation solver based on the parametric approximation optimization

The numerical methods for differential equation solution allow obtaining a discrete field that converges towards the solution if the method is applied to the correct problem. Nevertheless, the numerical methods have the restricted class of the equations, on which the convergence with a given parameter set or range is proved. Only a few"cheap and dirty"numerical methods converge on a wide class of equations without parameter tuning with the lower approximation order price. The article presents a method that uses an optimization algorithm to obtain a solution using the parameterized approximation. The result may not be as precise as an expert one. However, it allows solving the wide class of equations in an automated manner without the algorithm's parameters change.


Introduction
Differential equations: ordinary (ODE) and partial differential equations (PDE) are classical ways to express the physical laws. The amount of the differential equations analyzed in the mathematical physics domain is limited, at least by the number of variational principles in physics. Modern data-driven methods [11,19,9] may be considered as a source of yet not analyzed equations. During the equation discovery process, we obtain models with an arbitrary form, which may be a challenge for modern equation solvers.
An expert can always find a numerical solution for the obtained equation. However, a precise numerical solution of an equation of an unknown type is a challenging task. There are many expert systems and solvers for ODE systems that are alleviating the expert's work [17]. However, a proper method selection does not guarantee convergence for an arbitrary equation, and the expert system should contain more than methods but proper sets of the parameters. Most of the ODE solvers [5] are very precise tools yet demanding for the equation form and parameters tuning. Solution of partial differential equations is an undeniable traditional topic in the mathematical physics and applications [12]. A wide variety of methods starting from finite-difference schemes through finite element method to modern arXiv:2205.05383v1 [math.NA] 11 May 2022 spectral-like analytical methods are established. In the classical analysis, it is assumed that the operator properties and possible boundary condition types are a priori known since they are defined by the type of the process and the physical nature of the problem that is considered.
The classical finite-difference [21] and finite-element method [20] (FEM) have established area of applicability. For example, FEM is widely used to solve elliptic equations occurring in different areas, for example, mechanics.
Without a doubt, decades of development made FEM the fast method to solve known physics and mechanics-related problems [14]. However, there is no possibility to apply finite-difference and finite element methods to arbitrary equations without significant research of every given problem. Finite-difference methods could be applied to the linear equations. However, it is required to linearize equations, derive a finite difference scheme, and research stability for every PDE solution problem.
Spectral methods [1] are the most modern analytical and numerical methods for PDE solutions. However, their application to an arbitrary problem is restricted by automatic differentiation on the polynomial decomposition series, restricting the solutions' class.
Arising neural differential operators methods are slightly dependent on a training dataset and require a neural network to be retrained every time a new problem arises [7]. The recent research shows that combined with the transition to the spectral domain may be promising [6] even though it also restricts the applicability to the linear methods if applied to the Fourier specter directly.
To sum up, there is no universal method (and numerous "No free lunch" theorems for various numerical algorithms may become evident that it is not possible for differential equations also) that solves every equation fast and precisely. As in the ODE case, a decision support tree could expand every time a new equation and/or boundary condition type appear. Therefore such a system will exchange mathematical physics expert time for the coding and tree forming time. One could name Wolfram Mathematica decision support tree for ODE and PDE as the closest system to the universal. However, it has a drawback of proprietary software and thus is not easily embedded in any algorithm. Fig. 1: Rough typization of a differential equation solvers. Typization that is correspondingly may be applied to equation is shown with color (ODE (violet), PDE (blue) or both).
The differential equation solution scheme is based on the classification provided in Fig. 1. We note that real systems contain very extended classification. For every combination "equation -boundary condition type -domain" the expert can use a separate solver. For example, the Runge-Kutta method could be chosen in a non-stiff ODE with initial conditions and connected domain case. For stiff equations, more advanced ODE solvers are required. The stiffness assessment, equation and initial condition transformation to the conventional system for Runge-Kutta-like methods are left out of the scope. We emphasize that the equation transformation for the general equation type may be complicated and usually done expertly. In the PDE solution case, we see the same picture. For canonical elastic media equations with arbitrary connected domains, the FEM method family is usually used. Other types of canonical equations usually have finite-difference solutions.
Another critical question is what should be done if the equation cannot be classified using Fig. 1. We obtain various non-canonical equations with noncanonical boundary conditions during the equation discovery process. Whenever a new equation appears, we should extend classification and add a new solver or apply all available solvers to obtain a possibly incorrect solution. Thus, it is urgent to solve equations in an automated and unified manner to compare solutions with observational data. The less supervised the solution algorithm is, the more viable equation discovery is. Despite being powerful tools in the expert's hands, most of the methods described above are usually applied manually for every given problem. Thus, the topic of an automated arbitrary differential equations solution arises in the literature [16].
The second layer of a problem is the programming language choice. Historically, the solvers are programmed in a performance manner, i.e., using C, C++, and Fortran. Namely, well known solvers ODEPACK (in Fortran) [3,4] and solver included in C++ Boost libraries [13] are widely used in science. However, modern machine learning and PDE discovery algorithms use Python as a standard, meaning that the machine learning tools usage contradicts the classical approach to differential equation solution. As a compromise, several methods were done using the Julia language [15]. Currently, the Julia language is not so popular among data scientists. In the Python language are available established solvers (as an example, scikit-learn solver and the PyCheb package [8]). As an alternative approach, meaning universal solver, possibly single software [10] could be named.
In the paper, we describe the automated algorithm of the PDE solution using neural networks and its open-source realization, which reduces computational time and may work with the equation discovery algorithm done in Python. A similar approach is already described in several references and is mainly known as PINN [18]. However, we aim only for the PDE solution part, and thus we omit the physics analogy and replace it with Sobolev space optimization, which is more suitable for general processes.
In contrast to the methods described in the literature, we aim to combine the possibility of solving a wide class of equations and a high level of automatization of the process. It means that we, without the help of the expert, try to combine classical numerical methods and neural networks to obtain the field that approximates the equation's solution without the algorithm parameters changing. The obtained approximation may not have the best quality compared to the classical solvers. The main advantage is that the solution is obtained for every equation despite its type. We use the Sobolev space norm to measure the solution's "quality". We validate the algorithm on classical ODEs: Legendre equation and Panleve transcendents and PDEs: wave equation, heat equation, Korteweg-de Vries equation.
The paper is organized as follows: Sec. 2 contains the definitions and algorithm description used in the article, Sec. 4 contains the application of the given algorithm to particular PDEs, Sec. 5 outlines the paper and proposes the directions for the future work.

Problem statement
From the mathematical point of view, the information about the equation during the discovery process is minimal since the algorithms do not use any a priori assumptions about the data governing process. On example of time and singledimensional space domain, we solve the boundary PDE problem defined on a subdomain (x, t) ∈ Ω ⊂ R 2 with a boundary ∂Ω in form Eq. 1.
In Eq. 1 we assume that the differential operator L and the boundary operator b and the arbitrary functions f, g are defined such that the boundary problem is correct. We do not know any a priori information on the type of L (it may be non-linear, of arbitrary order, with variable coefficients, order and coefficients are subject to change during the discovery process).
However, we may fix the form of the operator b. In classical equation discovery algorithms, boundary conditions are not used, and the equation is found using domain interior. If the dimensionality of data and operator agrees, we may impose classical Dirichlet conditions (function values on the boundary are fixed and taken from data for every equation) to solve a discovered equation. However, the order of operator L is usually not constant in the discovery process, and we may have to take non-conventional conditions, such as the data within the domain interior, to make the problem well-posed. It is worth mentioning that the problem posedness is not used in the algorithm, and theoretically, we may solve under and over-defined problems. In this case, boundary conditions are satisfied in an "averaged" manner.
Hypothesis: for such a problem setup, it is more natural to use the optimization methods to obtain the approximation parameters that solve the equation instead of a straightforward solution.

Theoretical formulation
Most of the numerical methods assume that the solution field is found in a discrete subset of Ω in form of the mesh function. It means that for two-dimensional equation we have following function representation: We emphasize that the approach will work for higher dimensions. Without loss of generality, we assume that the field discretization X = {x (i) , t (i) } ⊂ Ω is fixed during the process of PDE solution. For the experiments, we use a uniform mesh. However, the discretization for the method described below could be chosen arbitrarily.
We formulate a minimization problem to find the solution field as Eq. 3.
In Eq. 3 arbitrary norms || · || i and || · || j may be chosen. Usually l 2 and l 1 norms are taken. Operator L is assumed to be the "precise" operator that gives the exact value of the derivative for the original function u(x, t) at the mesh points. We note that Eq. 3, λ is an arbitrary chosen constant, which does not influence a resulting solution, only convergence speed if the boundary conditions are correctly defined. In this case, there is no doubt that the solution of the optimization problem converges point-wise to the boundary problem solution.
Since the solution is not known, we use numerical differentiation methods to obtain the values of the Lū for a given solution approximation candidateū. In practice, differential and boundary operators are also the approximation that has an error, and the minimization algorithm is the numerical algorithm with its own error. Therefore, the final problem that is solved in the article is formulated as Eq. 4.
In Eq. 4L andb are the approximate differential and boundary operators (meaning that the derivatives are replaced with the approximations), X is the discretization andū are taken accordingly the given grid point. It should be emphasized that the choice of the operatorL approximation should be considered as a separate problem.

Numerical realization
Following questions should be answered to solve the problem using any numerical PDE solution method: how is the function represented; how do we take the derivative; how do we obtain solution approximation parameters.
The numerical solution scheme is shown in Fig. 2. As an example for finite-difference scheme we represent the function as the values at the discrete grid points using matrix or more general n-dimensional array representation:ū In this case, numerical differentiation is a condition that binds adjacent nodes and forms the system of linear equations. For example, we can use the first-order finite-difference scheme (approximation order is O(h), where h is the uniform grid step in discretizing the given dimension). We use both forward and backward for boundaries in the form Eq. 6.
We note that in Eq. 5 and in Eq. 6 different notations for grid position are used. For the interior points we use scheme Eq. 7 as more stable and higher-order.
As the third component of the numerical PDE solution algorithm, schemes Eq. 6-Eq. 7 together with the boundary conditions are used to form the system of n × m equations to find the values at the grid points. We note that such simplicity is not typical for the finite-difference schemes, and usually, the expert is required to form the finite-difference schemes and solutions.
All three components are required to determine most of the numerical PDE solution algorithms. To make the process of PDE solution in a more automated manner, we use machine learning models.
Meaning that as the first component of the numerical solution we try to approximate solution u(x, t) of an equation Lu = f with continuous parametrized functionū(x, t; Θ) : R 2 → R which is represented by the machine learning model. The parameter set Θ = {θ 1 , ...θ N } is an arbitrary set that determines the predefined function form, as a simplest example u(x, t; Θ) = θ 1 x + θ 2 t + θ 3 may be the linear regression.
As the second component we use finite-difference schemes Eq. 6-Eq. 7. We explicitly build the finite difference schemes and combine them to a complete operator for the higher-dimensional derivatives to speed up the computation. However, the finite differences are not used to bind the values in the adjacent grid points. Instead, we find an approximation of whole tangent spaces representing required order derivatives at the given point.
To find the parameter set Θ = {θ 1 , ...θ N } we use the formulation of the problem Eq. 4 in form Eq.8.
In this case, we use the so-called "mesh-free method". For the neural network, the main difference is that the parameter h may be chosen arbitrarily without connection to the discretization grid X. Such solutions are usually referred to as "mesh-free".
We note that schemes Eq. 6-Eq. 7 in the finite-difference analysis is proven to converge in grid points thus taking h same as the grid resolution for schemes is somewhat optimal. We note that there exist different schemes such as x + 1/2h that are not considered for brevity.
To sum up, we encode every operator to use it in the neural network training process. The same procedure is done for the boundary conditions. Starting from the arbitrary neural network's weights set Θ init , we use the optimization algorithm to obtain the optimal set of the parameters Θ opt that minimizes the difference between the applied operator to the approximated by the neural network functionLū and function f over all discretization points X. Additionally, we introduce the difference between the applied boundary operator and function g.
It should be noted that such a learning process differs from the classical neural network training process, where L p norm convergence is usually considered. In this case, the neural network convergence rather in a Sobolev space [2].

Modular approach
To note that the proposed scheme is not unique, we propose the module structure shown in Fig. 3 of the resulting solver. As an alternative, we can use two components of finite difference scheme Eq. 5 and Eq. 6-Eq. 7 for differential approximation and boundary operator approximation. However, the system is not solved, but matrix values at the grid nodes are optimized such that they satisfy the finite-difference scheme and boundary conditions using the Eq. 4. Such an approach trades quality and continuity to the convergence speed for lower dimension problems.
For neural network training, we use gradient descent. However, the algorithm may be replaced with an evolutionary optimization to make the convergence process more stable due to the possibility of averaging stochastically obtained results. Theoretically, the convergence process may be more stable if proper mutation and cross-over operators are used.
Our experiments show that the problem formulation Eq. 4 is more important than the particular realizations of the field approximations, numerical differentiation realization, optimization algorithm, and initial field interpolation or neural network solution upscaling. Thus, we mark corresponding modules in the scheme as replaceable.

Caching of approximate models
We use initial field interpolation to achieve faster and better convergence. This module replaces the "initial field interpolation" module shown in Fig. 3. The effect of the initial guess for the optimization is shown in Sec. 4. It is realized as a "cache" of models for neural networks. As the initial step for every algorithm run, we search in the library the model of the same or another architecture with the lowest Sobolev space norm (sum over all grid points X of the functional in Eq. 8) for the given equation and boundary conditions. If the model has another input or output dimension, we change the input or output layer and compute the norm. If the models' architectures are different, we train the input architecture on values of a "cached" one.
After the algorithm is stopped, the weight of the neural network and the optimizer state (gradient value and related gradient parameters) are saved for further use. Thus, when the same equation is solved again, the shortest possible time is used to obtain the approximation. The use of the caching technique makes the optimization process exploit the existing solution, and thus the optimization tends to converge to the same local optima. The network weights are perturbed every time the best model is taken from the cache to avoid the locality of the solution.
As a result, the proposed approach may solve ODE and PDE similarly without the parameters changing. Even though the parameter's tuning may reduce the optimization time, the overall quality solution remains the same for the broad parameters' range. Such an approach does not challenge the classical methods. On the contrary, the solution in the most challenging cases may be incorrect, but it allows one to compare two equations during the discovery process without stopping the algorithm due to the inappropriate equation for solution errors.

Numerical experiments
The following experiments show the broad range of equations that could be solved with single neural network architecture and algorithm hyperparameters set. Every experiment and picture is supported by repository 1 with code and experimental data. Experiments show: that cache allows converging faster; that adding points to the grid leads to a better solution; that the error between the exact and obtained solutions is negligible for equation discovery application.
We note that using neural networks does not give the possibility to reproduce singularity points. Therefore all equations are considered in variables' range where no singularities are contained.
As the exact solution, we chose the Wolfram Mathematica 13 DSolve or NDSolve output if it is not stated otherwise.

Ordinary differential equations
Ordinary differential equations may be used to determine possible class functions obtained as a solution. This subsection considers two different sets of equations -the Legendre equation and the Panleve transcendents.
Legendre equation Legendre equation may determine which the maximal Taylor series decomposition order may be obtained as a solution. In this section, we consider the problem in form Eq. 9. The solution to the problem is a Legendre polynomial of degree n.
In Eq. 10 L n (t) is a Legendre polynomial of degree n. As seen from Fig. 4 (left), the reasonable initial guess, as expected, makes the optimization converge faster. As a drawback, such an approach makes the optimization more "stiff" and thus, in some cases, it may get stuck in the local minima. The initial guess parameters are perturbed with a noise of a small magnitude to mitigate stiffness partially.
The root mean square errors (RMSE) for the same setup are shown in Fig. 4  (right). The error is computed using the analytical solution -Legendre polynomial values of corresponding order in a range t ∈ [0, 1] (the half range is taken due to symmetry property) for every grid point. Since the maximal value of Legendre polynomial on range t ∈ [0, 1] is 1, RMSE could be interpreted as the absolute error.
In Fig. 4 (right) we see that the locality of the solution described in Sec. 3.4 appears when the cache is used. Namely, the error dispersion is lower when an initial guess is used. It allows the algorithm to find the solution with a lower Sobolev space norm as a positive effect. Therefore, the initial guess allows faster convergence, and the solution is likely to have a better norm.
Overall, the ability of the solver to converge towards a Legendre polynomial solution means that the solver can converge towards any analytical solution (a solution that may be represented in the form of the Taylor series). We may also answer on a maximal Taylor series decomposition order. We obtain at least the ninth term in decomposition with a good (less than 10 % error) precision for 100 points without parameter change and optimization time increase.
To reduce error and show different module from Fig. 3. We show the solution using the matrix differentiation prototype. The results for same grid setup for Legendre equation Eq. 9 are shown in Fig. 5. From comparison matrix and neural network approximation, we see that whereas matrix has a lower error, it has higher computation time. Thus, additional techniques such as consequent field up-sampling or another result "caching" are required. The matrix-based algorithm is, however, out of the paper's scope. Therefore further experiments are conducted using neural network approximation only.
Panleve transcendents Panleve transcendents are a series of differential equations. Each has a different special function class in the solution. Meaning that the higher order of transcendent moves a step closer to the general hypergeometric function. This subsection considers different aspects of the Panleve transcendents solution as an essentially non-linear ODE with variable coefficients. The scheme of functions that form the general solution of a given Panleve transcendent is shown in Fig. 6. Fig. 6: Scheme of solution of Panleve transcendent "complexity". Next class of functions contains previous. The color is predicted relative "performance" of a conventional solver for every transcendent.
Exact initial and boundary problems are placed in A since the form of the equation is not important for further experiments. We note that the value range is taken such that the solution does not contain any singularity points.
Whereas the Legendre equation appears as a relatively simple linear equation with variable coefficients, the Panleve transcendents are significantly non-linear and have wider solution space than the polynomial. Additionally, the maximal sequential number of transcendent allows showing which class of function solver can reproduce. A hypergeometric function is the broadest possible class for a real-valued ODE solution.
We note that the solver parameters and the neural network model are the same as for the Legendre polynomial in Sec. 4.1 in most Panleve transcendents experiments (with possibly variable stop-criterion, meaning that in some cases optimization is stopped earlier or later depending on the equation).
For the first three Panleve transcendents, we repeat the same experimental setup as in Sec. 4.1. However, this experiment series is used to show convergence by changing the amount of uniformly taken points grid res from a value range t ∈ T . Error and optimization time distribution are shown in Fig. 7. Error is computed using the Wolfram Mathematica 13 numerical solution on optimization grid points.
As seen from Fig. 7 using proper initial weights distribution reduces optimization time and possibly leads to a lower error.
More complex transcendents have increased solving time as shown in Fig. 8 (left) (we emphasize that the logarithmic scale for time is used).
We note that the optimization time mostly depends on the number of terms of the equation. To assess the influence of the solution complexity, we should For all six Panleve transcendents mean (for PIV-PVI only one run was performed) error over all experiments using cache is shown in Fig. 8 (right). We emphasize that error is normalized on the maximum error achieved during all experiments for all grid points.
In summary, algorithm error "converges" towards a solution for every transcendent when there are no discontinuity points. Thus, we may conclude that the optimization problem always has a solution close to the true PDE solution in the range where the solution is analytic. However, such a statement requires more rigorous proof out of the paper scope. Unlike ODEs, PDEs contain an interconnection between differentials with respect to the different variables. Therefore, the ability of the ODE solution does not imply the ability to solve PDE. In this section, two canonical equations -wave and heat equations-are considered a "lower" bound of the complexity, and the Korteweg-de Vries equation is considered a more complex example of a non-linear equation. We note that it is impossible to pick an "upper" bound of PDE solution due to the non-classicability of the characteristic surface of the equation starting from order three.

Partial differential equations
Wave equation We try to assess the convergence of the algorithm the solution of the wave equation with boundary conditions in form Eq. 11 We use the formulation Eq.3 to obtain the solution of the equation for ten runs for consequently increasing the number of points in discretization from 10 × 10 points in Ω (since the mesh is assumed uniform, it is equal to h = 1/(N −1) = 1/9, where N is the number of points for time and space dimensions, i.e., we take 10 points in the range [0, 1] including boundaries) to 100×100 points with the step of 100 points. As a result of ODE experiments, we show only the "cache" version, meaning that every time we start with the best possible initial neural network weights (see Sec. 3.4).
We take an analytical solution from the Wolfram Mathematica 13 software as the exact solution. The solution has the analytical form and is taken at the grid points for every grid used in the optimization process. We record the optimization time, and the root mean square error between (RMSE) Wolfram Mathematica solution and the proposed algorithm solution on the same grid. The time and error distribution for 10 runs are shown in Fig. 9 (left).

Heat equation
The second equation is a typical equation of a parabolic type -heat equation. To demonstrate that the algorithm converges in case of an incorrectly posed boundary problem, we use the following formulation shown in Eq. 12 The boundary problem Eq. 12 for heat equation has an analytical solution in form Eq. 13.
u(x, t) = 500 + x+ We use the same grid setup for the experiments as for the wave equation in Sec. 4.2. Namely, ten runs from 10 × 10 points to 100 × 100 uniformly taken from Ω = [0, 1] × [0, 1]. The error in this case is computed using the analytical solution Eq. 13 with 100 first terms in sum taken. The error and time plots are shown in Fig. 9 (middle).

Korteweg-de Vries equation
To show a more sophisticated PDE solution, the Korteweg-de Vries equation (Eq. 14) was used.
Following forcing, initial and boundary conditions were applied as shown in Eq. 15.
f (x, t) = cos t sin x u(x, 0) = 0 Due to the extended computation time, the KdV equation solution was tested on a 10 × 10, 20 × 20 and 30 × 30 uniformly taken points from a range x × t ∈ [0, 1] × [0, 1]. The results of 10 consequent runs for every experiment are shown in Fig. 9 (right). We note that initial solution was obtained within 2000 seconds time range.
Even though the computation time is higher than in the ODE case, the PDE part allows obtaining at least an approximated coarse solution, which could be used in the equation discovery algorithm within a reasonable time. Moreover, the error is below 10 % of maximal field value at the initial 10 × 10 grid.

Conclusions
The paper proposes a unified numerical differential equation solver based on optimization methods. It has the following advantages: -It can solve PDEs without the involvement of an expert, which is most useful for data-driven equation discovery methods; -It has good precision every for equation discovery application; -It can be easily parallelized; -It has a flexible modular structure. The modules could be replaced to achieve better speed or better precision.
It is also seen that the optimization time is the main drawback of the method's experimental realization. We propose the following optimization speedup directions: usage the power of GPU to make the optimization using fast memory and built-in matrix instructions -Better usage of initial approximation -More intelligent use of the numerical differentiation -Usage of joint neural network and matrix-based optimization methods All experimental data and script that allows reproducing experiments are available at the GitHub repository 2 .

Acknowledgments
This research is financially supported by The Russian Scientific Foundation, Agreement #21-71-00128.