Theory of Connections Applied to Support Vector Machines to Solve Differential Equations

Differential equations are used as numerical models to describe physical phenomena throughout the field of engineering and science, including heat and fluid flow, structural bending, and systems dynamics. Although there are many other techniques for finding approximate solutions to these equations, this paper looks to compare the application of the Theory of Connections (ToC) with one based on Support Vector Machines (SVM). The ToC method uses a constrained expression (an expression that always satisfies the differential equation constraints), which transforms the process of solving a differential equation into an unconstrained problem, and is ultimately solved via least-squares. In addition to individual analysis, the two methods are merged into a new methodology, called constrained SMVs (CSVM), by incorporating the SVM method into the ToC framework to solve unconstrained problems. Numerical tests are conducted on three sample problems: one first order linear ODEs, one first order non-linear ODE, and one second order linear ODE. Using the SVM method as a benchmark, a speed comparison is made for all the problems by timing the training period, and an accuracy comparison is made using the maximum error and mean-squared error on the training and test sets. In general, ToC is shown to be slightly faster (by an order of magnitude or less) and more accurate (by multiple orders of magnitude) over the SVM and CSVM approaches.


INTRODUCTION
Differential equations (DE) and their solutions remain an important topic in science and engineering. The solutions drive the design of predictive systems models and optimization tools. Currently, these equations are solved by a variety of existing approaches with the most popular based on the Runge-Kutta family [1]. Other methods include those which leverage low-order Taylor expansions, namely Guass-Jackson [2] and Chebyshev-Picard Iteration [3], [4], [5] which have proven to be highly effective. More recently developed techniques are based on spectral collocation methods [6]. This approach discretizes the domain about collocation points, and the solution of the DE is expressed by a sum of "basis" functions with unknown coefficients that are approximated in order to satisfy the DE as closely as possible. Yet, in order to incorporate boundary conditions, one or more equations must be replaced by equations to enforce the constraints.
The Theory of Connections (ToC) is a new technique that first analytically derives a constraint expression which satisfies the problem's constraints exactly while maintaining a function that can be freely chosen [7]. This process transforms the DE into an unconstrained optimization problem where the free function is used to search for the solution of the DE. Prior studies [8], [9], [10], have defined this free function as a summation of basis functions; more specifically, Chebyshev polynomials. Motivated by recent results that solve ordinary DEs with support vector machines (SVMs) [11], this article explores the ability to solve DEs by leveraging the flexibility of the ToC framework and incorporating an SVM as the free function. The SVM model for function estimation [12] also uses a linear combination of functions that depend on the input data points. While in the first uses of SVMs the prediction for an output value was made based on a linear combination of the inputs x i , a later technique uses a mapping of the inputs to feature space, and the model SVM becomes a linear combination of feature functions ϕ(x). Further, with the kernel trick, the function to be evaluated is determined based on a linear combination of kernel functions, with Gaussian kernels being a popular choice.
This article compares this combined method, referred to hereafter as CSVM for constrained SVMs, to vanilla versions of ToC [8], [9] and SVM [11] over a variety of DEs. In all cases, the vanilla version of ToC outperforms both the SVM and the CSVM methods in terms of accuracy and speed. However, the CSVM method is easily scalable to higher dimensional problems, PDEs for example, whereas the vanilla ToC is not. In addition, the CSVM method satisfies the boundary conditions of the problem exactly, whereas the vanilla SVM method solves the boundary condition with the same accuracy as the remainder of the data points in the problem. Thus, the CSVM method combines the scalability of the SVM method with the exact boundary condition satisfaction of ToC. Of course, SVMs are not the only machine learning algorithm that can be substituted as the free function in ToC. Any machine learning algorithm that is differentiable up to the order of the DE can be seamlessly incorporated in ToC. Future work will leverage this benefit to analyze the ability to solve DEs via neural networks.

BACKGROUND ON THE Theory of Connections
The Theory of Connections (ToC), as established in [7], is a generalized interpolation method. This technique allows for the construction of all functions satisfying k linear constraints with the general form, where g(t) represents a "freely chosen" function, η i are the coefficients derived from the k linear constraints, and p i (t) are user selected functions that must be linearly independent from g(t). Using this approach, least-squares solutions for initial-value (IVP), boundary-value (BVP), and multivalue (MVP) problems on both linear [8] and nonlinear [9] DEs have been solved with machine error accuracy. Additionally, this method extends to relative, integral, and infinite constraints [10], establishing a unified framework to solve DEs. This method is best understood through example. Let us consider a general DE subject to initial value constraints on the function and its derivative such that, L(ÿ,ẏ, y, t) = 0 subject to: To solve this DE we start with Eq. (1) where k = 2 and p i (t) is chosen to be p 1 (t) = 1 and p 2 (t) = t. Using these parameters, Eq. (1) becomes, To apply the constraints, this equation is evaluated at the constraint locations producing two equations: where the only unknowns are η 1 and η 2 . With these equations a linear system can be constructed, and the two unknowns are solved through matrix inversion, leading to the η coefficients of These two terms are now plugged into Eq. (2) to form the final expression for y(t) that always satisfies the constraints regardless of the function chosen for g(t), known as the constrained expression. Analyzing this equation, it can be seen that if t = t 0 then the equation reduces to y(t 0 ) = y 0 , and if the derivative is taken and also evaluated at t = t 0 then it becomesẏ(t 0 ) =ẏ 0 . Now, with the constraints fully incorporated in Eq. (3) and g(t) representing all possible functions passing through those constraints, the solution to the DE is found by "searching" for g(t). In prior work [8], [9], this was done by representing g(t) as a basis of Chebyshev orthogonal polynomials, where ξ is the unknown vector of coefficients to be solved and h(x) represents a vector of m basis functions. In general the independent variable t ∈ [t 0 , t f ] does not coincide with the range of Chebyshev polynomials which are strictly defined on the interval x ∈ [−1, +1]. Therefore, a change of variable is needed in order to correctly incorporate these basis functions. This can be done by linearly mapping x to t, It follows that the first derivative of g(t) becomes, where it can be seen that Using this, the constrained expression defined in Eq. (3) and its derivatives become, which can be written in the form, where a(x) is zero when evaluated at the boundary conditions and b(t) is equal to the constraints. The constrained expression and its derivatives can be plugged into the DE reducing the equation to an expression with the only unknown being the vector ξ.
The resulting equation is discretized over a set of N values of x (and inherently t). In expressing the solution of a DE as a basis of Chebyshev orthogonal polynomials, the optimal point distribution is provided by collocation points [13], [14], defined as, where N is the number of point in the interval. As compared to the uniform distribution, the collocation point distribution is the most efficient way to numerically represent Chebyshev polynomials. By discretizing the DE over the domain of x creates a system of equations that is linear with respect to the constant term ξ. Then, Eq. (6) can be rewritten in the form, In prior work [8], [9], a scaled QR decomposition approach was determined to be optimal, and thus, Eq. (7) can be solved by, In the case of a nonlinear DE an iterative least-squares approach is needed, which is developed in Ref. [9]. The equation is solved by first using an initial guess for ξ 0 , obtained by fitting an initial estimated solution of the DE, and then starting the nonlinear Newton iteration whose k-th iteration is, where L k is the loss function defined by the residuals of the DE which are specified using ξ k at all values of the vector x. The convergence is obtained when the L 2 norm, where ε is a given tolerance. In the end, the final value of ξ constitutes the solution of the DE. This ξ vector is plugged back into the constrained expression Eq. (4) providing an analytical solution of the DE that always satisfies the constraint conditions. This feature allows for analytical differentiation and integration of the solution.

AN OVERVIEW OF SVMS
The technique of Support Vector Machines (SVMs) was originally introduced to solve classification problems (as pattern recognition [12]). A classification problem consists of determining if a given input, x, belongs to one of the two possible classes. The proposed solution was to find a decision boundary surface that separates the two classes. The equation of the separating boundary could depend only on a few input vectors called the support vectors.
The first algorithm that learned from examples to perform classification was Rosenblatt's perceptron introduced in 1957. One of the ideas introduced with the perceptron was to use a mapping of the inputs x to a feature space ϕ(x), and find a decision surface that separates the classes in the new feature space.
A set of examples (x 1 , y 1 ), . . . (x n , y n ) is available, where y i = 1 if x i belongs to the first class and y i = −1 if x i belongs to the second class. The separating hyperplane is built using a linear combination of feature vectors with the weights w to be determined such that the model performs the correct classification for new inputs.
Later the ideas were extended to function estimation. A function that is linear in parameters is used as a model, The perceptron algorithm updated the weights iteratively, and found a separating boundary, but not an optimal one. The initial SVM algorithm introduced in the 1960's was looking for an optimal boundary, but the search for a separating boundary in the feature space was not used until the 1980s. Vapnik in [12] formulates the goal of the statistical learning models as finding the function that minimizes the empirical risk on the set of functions from the class chosen as the model, The empirical risk defined above is the Mean Squared Error (MSE) for the function f that tries to estimate the output y given input-output data pairs (x 1 , y 1 ), . . . , (x n , y n ) like in linear regression. Minimizing the empirical risk is an optimization problem, for which one of the popular solutions is to use a gradient descent algorithm, updating the parameters (weights) of the model iteratively.
In 1959, the LMS (least mean squares) algorithm was introduced as a simple way of training linear adaptive systems with mean square error minimization. LMS uses steepest-descent algorithm to update the weight vector, where w n is the weight vector, h is the step size, x n is the input vector and e n = y n − f n .
Without mapping to feature space, working with the original input vectors x multiplied by the weights w the function to minimize is written as, In this context the function to be minimized is called a cost function and it gives rise to the usual least squares regression model which can be solved either by an iterative gradient descent algorithm or by using the normal equations, which require finding the inverse of the matrix X T X where X is the matrix formed by using the input vectors as columns. The regression problem also has a probabilistic interpretation. Minimizing the empirical risk also corresponds to finding the Maximum Likelihood (ML) parameters when measurements are assumed to be corrupted by Gaussian noise Minimizing the expected square error is equivalent to minimizing the logarithm of the likelihood of the data, which is a product of the probabilities of observing the given outputs for the selected values of the parameters. For normally distributed errors, the probabilities are proportional to exponential functions with the negative of the squared error as the argument. The density of the normal distribution gives from which we can write the conditional probabilities The independence of the errors allows writing the likelihood of the model as the probability of the observed y's given the inputs x, By taking a logarithm, maximizing the likelihood L is seen as equivalent to minimizing the sum of squared errors (the least squares problem) [15]. In the SVM model for classification an additional term, b, called bias, is added to the linear combination of features ϕ(x) with weights w i leading to the model, Assuming that the training data is separable by a linear decision boundary, a separating hyperplane with equation The parameters are rescaled such that closest training point, to the hyperplane, let's say (x k , y k ), is on a parallel surface with equation w T ϕ(x) + b = 1. By using the formula for orthogonal projection, if x satisfies the equation of the hyperplane then the distance to the origin of the space has magnitude given by w T x/w T w.
It follows that the distance between the two hyperplanes, called the "separating margin" is 1/w T w. Thus to find the largest separating margin, one needs to minimize w T w. Instead of just maximizing the margin between separating hyperplanes, a regularization term is added to the objective function,

The kernel trick
In Eq. (8) the function ϕ(t) always shows up in a dot product, and thus the dot product kernel can be used in the computation, Dot products involving derivatives can be expressed in terms of the partial derivatives of the kernel. We will use the notation, where K is a kernel, popularly defined as the RBF (radial basis function) kernel. This is also known as Gaussian kernel, and has the formula,

Using SVMs to Solve Differential Equations
We follow the method of solving DEs using RBF kernels proposed in [11]. As an example, we take a first order linear initial value problem, y − p(t)y = r(t), subject to: y(t 0 ) = y 0 , to be solved on the interval [t 0 , t f ]. The domain is partitioned into N sub-intervals using grid points t 0 , t 1 , . . . , t N = t f , which, from a machine learning perspective, represents the training points. The model given in Eq. (8), is proposed for the solution y(t), where the functions ϕ(t) have the property that, and the parameter σ is a value that is learned in the training, together with the coefficients w i , i = 1, . . . , N . Note that the number of coefficients w i equals the number of grid points t i , and thus the system of equations used to solve for the coefficient is a square matrix. Let e be the vectors of residuals obtained when we use the model solutionŷ(t) in the DE, that is, e i is the amount by whichŷ(t i ) fails to satisfy the DE,ŷ which gives, and for the initial condition, it is desired that, is satisfied exactly. In order to have the model close to the exact solution, the sum of the squares of the residuals, e T e, is to be minimized. This expression can be viewed as a regularization term added to the objective of maximizing the margin between separating hyperplanes. The problem is formulated as an optimization problem with constraints: subject to: Using the method of Lagrange multipliers, a multiplier (factor) is introduced for each constraint, introducing a penalty term to me minimized, leading to Eq. (9). The values where the gradient of L is zero give candidates for the minimum such that, Note that the conditions found by differentiating L with respect to α i and β are simply the constraint conditions, while the remaining conditions are the standard Lagrange multiplier conditions that the gradient of the function to be minimized is a linear combination of the gradients of the constraints. Using, we obtain a new formulation of the approximate solution given by Eq. (10), where the inner products can be expressed in terms of the kernel, and its partial derivatives, To find the actual solution one needs to train and choose a good value for σ and solve for the coefficients α i , β, and b. Substituting the results from the equations on the gradient of L with respect to w and e i into the equation on the gradient of L with respect to α i we obtain Eq. (11). The last condition, the constraint w T ϕ(t 0 ) + b − y 0 = 0 becomes Eq. (12). Replacing all the inner products by the corresponding kernel values, we have a system with the same number of equations as unknowns. The third condition, could also be used to reduce the number of unknowns in the system.

Constrained SVM (CSVM) Technique
In the Theory of Connections (ToC) [7] the general constrained expression can be written for an initial value constraint by, where g(t) is a "freely chosen" function. In prior studies [8], [9], this free function was defined by a set of orthogonal basis functions, but this function can also be defined using SVMs by, where g 0 becomes, This leads to the equation, such that the initial value constraint is always satisfied regardless of the value of w and ϕ(t). Through this process, the constraints only remain on the residuals and the problem becomes, Again, using the method of Lagrange multipliers, a term is introduced for each of the constraints, as shown in Eq. (13), and the values where the gradient of L is zero give candidates for the minimum, we obtain a new formulation of the approximate solution, that can be expressed in terms of the kernel K(u, v) = ϕ(u) T ϕ(v) and its derivatives. Combining the three equations obtained when taking the gradients of L, we can obtain a linear system with unknowns α i , The coefficient matrix is given by Eq. (14), where K(t 0 , t 0 ) = 1, and we use the notation given in Eq. (15). Finally, in terms of the kernel matrix,

Non-linear ODEs
Non-linear ODEs with initial value boundary conditions can be written generally using the form, The solution form is again the one given in Eq. (8) and the domain discretized into N , sub-intervals, t 0 , t 1 , . . . , t N (training points). Let e i be the residuals for the solution y(t i ), To minimize the error, the sum of the squares of the residuals is minimized, min(e T e) = min N i=1 e 2 i . As in the linear case, the regularization term w T w is added to the expression to be minimized. Now, the problem can be formulated as an optimization problem with constraints: subject to: The variables y i are introduced into the optimization problem to keep track of the nonlinear function f at the values corresponding to the grid points. The method of Lagrangian multiplies is used for this optimization problem, as shown in Eq. (16). The values where L are zero give candidates for the minimum.
A system of equations can be setup by substituting the results found by differentiating L with respect to w and e i into the remaining five equations found by taking the gradients of L. This will lead to a set of 3N + 2 equations and 3N + 2 unknowns which are α i , η i , y i , β, and b. This system of equations is given in Eq. (17).
The system of equations given in Eq. (17) is solved using a multivariate Newton's method. This leads to the same matrices in Eq. (20) of Ref. [11] with one exception: the regularization term, I/γ, in the second row of the second column entry will be missing from this set of equations. The reason is, while running the experiments presented in this paper, that regularization term had an insignificant effect on the overall accuracy of the method. Moreover, as has been demonstrated here, it is not necessary in the setup of L(w, b, e, y, α, β, η) = 1 2 the problem. Once the set of equations has been solved, the model solution is given in the dual form by, As with the linear ODE case, the set of 3N +2 equations that need to be solved and dual form of the model solution can be written in terms of the Kernel matrix and its derivatives.

NUMERICAL RESULTS
This section compares the methodologies described in the previous sections on a subset of the problems given in [11]. Problem #1 is a 1st order linear ODE. Problem #2 is a 1st order non-linear ODE, and problem #3 is a 2nd order linear ODE. All problems were solved in MATLAB on a Windows 10 operating system running on an Intel(R) Core(TM) i7-7700 CPU at 3.60GHz and 16.0 GB of RAM. The tabulated results from this comparison are included in the appendix at the end of the paper. A summary of those tabulated values is included in the subsections that follow, along with a short description of each problem.
The accuracy gain of ToC and CSVM compared to SVM for problem #1 is shown in Fig. 1. The results were obtained using 100 training points. The top plot shows the error ratio between the SVM and ToC solutions, and the bottom plot shows the error ratio between the SVM and CSVM solutions. Values greater than one indicate that the compared method is more accurate than the SVM method, and viceversa for values less than one.  Figure 1 shows that ToC is the most accurate of the three methods followed by CSVM and finally SVM. The error reduction when using CSVM instead of SVM is typically an order of magnitude or less. However, the error reduction when using ToC instead of the other two methods is multiple orders of magnitude. The attentive reader will notice that the plot that includes ToC solution has less data points in Fig. 1 than the other methods. This is because the calculated points and the true solutions vary less than machine level accuracy and when the subtraction operation is used the resulting number becomes zero. Tables 1 through 3 in the appendix compare the three methods for various numbers of training points when solv-ing problem #1. Additionally, these tables show that solving the DE using ToC is both the quickest solution (i.e. it has the shortest training time) and the most accurate solution (i.e. it has the lowest maximum error and MSE on both the training set and test set) of the three. The CSVM results are the slowest, but they are more accurate than the SVM results. The accuracy gained when using CSVM compared to SVM is small, typically less than an order of magnitude. On the other hand, the accuracy gained when using ToC is multiple orders of magnitude. Moreover, the speed gained when using SVM compared to CSVM is typically less than an order of magnitude, whereas the speed gained when using ToC is approximately one order of magnitude. An accuracy versus speed comparison is shown graphically in Fig. 2, where the mean-squared error and training time are plotted for five specific cases: 8, 16, 32, 50, and 100 training points.
The accuracy gain of ToC compared to SVM for problem #2 is shown in Fig. 3. The figure was created using 100 training points. The plot shows the error in the SVM solution divided by the ToC solution. Values greater than one indicate that ToC is more accurate than the SVM method, and values less than one indicate that ToC is less accurate than the SVM method. Figure 3 shows that ToC is the most accurate of the two methods. The ToC error is orders of magnitude lower than the SVM method. Tables 4 through 5 in the appendix compare the two methods for various numbers of training points when solving problem #2. Additionally, these tables show that solving the DE using ToC is faster than using the SVM method for all cases except the second case. However, the speed gained using ToC is less than an order of magnitude. Furthermore, ToC is more accurate, by multiple orders of magnitude, than the SVM method for all of the test cases. In addition, ToC is able to continue reducing the MSE and maximum error on
subject to y(0) = 1 and y (0) = 1. The accuracy gain of ToC compared to SVM for problem #3 is shown in Fig. 5. The figure was created using 100 training points. The plot shows the error in the SVM solution divided by the ToC solution. Values greater than one indicate that ToC is more accurate than the SVM method, and values less than one indicate that ToC is less accurate than the SVM method. Figure 5 shows that ToC is the most accurate of the two methods. The ToC error is orders of magnitude lower than the SVM method. Tables 6 through 7 in the appendix compare the two methods for various numbers of training points when solving problem #3. Additionally, these tables show that solving the DE using ToC is faster than using the SVM method for all cases. The speed gained by using ToC is approximately an order of magnitude or less. Furthermore, ToC is more accurate than the SVM method for all of the test cases. One interesting note is that when moving from 16 to 32 training points the ToC actually loses a bit of accuracy, whereas the SVM method continues gain accuracy. Despite this, the ToC is still multiple orders of magnitude more accurate than the SVM method. An accuracy versus speed comparison is shown graphically in Fig. 6, where the mean-squared error and training time are plotted for five specific cases: 8, 16, 32, 50, and 100 training points.

SUMMARY AND FUTURE WORK
This article presents three methods to solve various types of ODEs: ToC, SVM method, and a method called CSVM that incorporates SVMs into the ToC framework. In addition, four problems are presented that include linear first order ODEs, non-linear first order ODEs, and linear second order ODEs. In summation, these comparisons show that, in general, ToC is faster, by approximately an order of magnitude or less, and more accurate, by multiple orders of magnitude, than the other two methods. The CSVM method has similar performance to the SVM method, but the CSVM method satisfies the boundary constraints exactly whereas the SVM method does not. Moreover, the performance of the CSVM method shows that machine learning algorithms can be easily incorporated into the ToC framework.
This will be exploited in future studies and specifically for partial DEs where the scalability of SVMs (or other machine learning algorithms such as neural networks) give a major advantage over the orthogonal basis functions used in the vanilla version of ToC. In addition, future work will focus on comparing the SVM method and ToC for higher order linear and non-linear ODEs as well as PDEs. Additionally, comparison problems should be looked at other than initial value problems. For example, problems could be used for comparison that have boundary value constraints, differential constraints, and integral constraints.