Deep Neural Network Model for Approximating Eigenmodes Localized by a Confining Potential

We study eigenmode localization for a class of elliptic reaction-diffusion operators. As the prototype model problem we use a family of Schrödinger Hamiltonians parametrized by random potentials and study the associated effective confining potential. This problem is posed in the finite domain and we compute localized bounded states at the lower end of the spectrum. We present several deep network architectures that predict the localization of bounded states from a sample of a potential. For tackling higher dimensional problems, we consider a class of physics-informed deep dense networks. In particular, we focus on the interpretability of the proposed approaches. Deep network is used as a general reduced order model that describes the nonlinear connection between the potential and the ground state. The performance of the surrogate reduced model is controlled by an error estimator and the model is updated if necessary. Finally, we present a host of experiments to measure the accuracy and performance of the proposed algorithm.


Introduction
In this paper we study features of the spectral problem for the family of elliptic operators of the reaction-diffusion type posed in the finite domain Ω = [−B, B] n , B > 0. These operators are also known as Schrödinger operators or Schrödinger Hamiltonians and they are defined by the differential expression of the form: Here − is the distributional realisation of the Laplace operator and V ω is the multiplication operator with the real function V ω . The parameter ω describes a random perturbation of a given potential. The associated spectral problem is to find an eigenvalue ε and an eigenmode u = 0 such that u verifies: and a set of boundary conditions. We will consider boundary conditions that lead to the realization of the expression H as a self-adjoint operator in a Hilbert space. In particular, we will consider functions u ∈ H 1 0 (Ω) with Dirichlet boundary conditions and u ∈ H 1 π (Ω), where H 1 π (Ω) is the first-order Sobolev space of functions (square integrable functions whose gradient is also square integrable) that satisfy the periodic boundary conditions [1,2]. In what follows we will use · to denote the L 2 (Ω) norm of a square integrable function.
We will consider-as an academic prototype-short-range confining electrostatic potentials such as those considered in [3] (see also [1,2]) and a more challenging class of confining potentials related to the effect of Anderson localization [4]. By the effect of localization we mean that we search for eigenvalues ε such that u, u = 1, is essentially has the same eigenvalues as the operator H ω . Furthermore, ψ is an eigenmode of H ω if and only if φ = ψ/u ω is an eigenmode of A ω . Based on this equivalence, we call the function W = 1/u ω the effective confining potential defined by V ω . For more details see [5,7]. One possibility of obtaining data-sparse representations of the solutions of elliptic problems is through the use of tensor networks also known as tensor train decompositions or matrix product states [8,9]. We choose a more direct approach known as Variational Physical Informed Neural Network (VPINN) [10][11][12]. Realizations of these dense network architectures are trained to solve the variational formulation of the problem. This approach to training neural networks is a part of the unsupervised learning paradigm. It is a meshless approach that is capable of solving variational (physical) problems, by minimizing the loss functional, which combines the variational energy of the system together with penalty terms implementing further physical or normalization constraints.
The main physical constraint for the ground state is the positivity constraint. To deal with excited states we need to implement further symmetry constraints in the variational space. We opt for a different approach, also based on positivity constraints and a-posteriori error estimation. We use the neural network to approximate the solution of the source problem: − u + Vu = 1 with associated boundary conditions. The solution u is called the landscape function, and in this case we are interested in the mapping L : V → u. The landscape function is a positive function in C 1 (Ω) and its reciprocal W = 1/u is called the effective potential. The effective potential provides a mechanism that incurs localization on bounded states. To localize the excited states we use the approach of [5,7]. Let W min,i be the i-th lowermost minimum of the effective potential W. It was observed in [5] that the following heuristic formula: yields good approximations to the energies of excited states. Note that this relationship, given its simplicity, is also something we might reasonably hope to learn algorithmically from a sample of landscape functions. This was stated as a motivation to utilize neural networks in the study of the eigenvalue problem for Schrödinger operators [3].
For an eigenmode ψ of H with the eigenvalue ε we have the estimate: ψ(x) ≤ εu ψ L ∞ (Ω) , x ∈ Ω. This estimate can be obtained (see [13]) using the Feynman-Kac formula for the representation of the bounded state as an expectation of an integral of the Brownian motion. It was further argued in [13] that an eigenmode ψ with energy ε := ε(ψ) can only localize in the region: Subsequently, as a combination of (2) and (3) we get both information on the excited state energy and information on the location of the excited state's support.

The Motivation and the Contribution of this Paper
The use of neural networks as data-sparse representations of complex, high dimensional nonlinear mappings is an emerging trend in many disciplines. In particular, it has been used to tackle many body Schrödinger equations [14,15], the Black-Scholes equation, the Hamilton-Jacobi-Bellman equation, and the Allen-Cahn equation [10,11,[16][17][18][19][20].
In general, all of the above problems can be reduced to computing an approximation of the function u : Ω ⊂ R n → R. This approximation is constructed by optimizing (training) the parameters of the family of test functions (we chose the family of all realizations of a given neural network architecture) so that the value of the appropriate energy functional (for the chosen model) is minimal. The main challenge in such an approach is to assess the approximation accuracy and to ensure that the computed realization of the neural network satisfies further physical constraints, such as symmetry constraint or the boundary conditions.
Further physical, but also numerical, constraints can be built into the optimization model in several ways. The most scalable and flexible way is to use penalization [10,11,20,21]. A alternative more subtle, and more accurate way is to introduce the constraints directly into the family of test functions as it is done in the architecture of the PauliNet from [14] (see also [22,23]), or to construct a family of test functions using an ansatz that combines several components of the solution, which are themselves realizations of neural networks [12,24].
In this study we focus on the potentials for which the Hamiltonian satisfies the Krein-Rutman theorem (the scaled ground state is the unique positive and smooth function). Examples of such potentials are the effective potentials associated with the Anderson localization. Since this is a more restrictive class of potentials than those considered in [14], we opt for a direct approach. Our contribution is the introduction of the residual error estimator into the Deep Ritz Algorithm from [20]. This in turn allows us to use Temple-Kato [25,26] or similar inequalities [27,28] to ascertain if the ground state generated by the neural network is a certified small perturbation of a physical eigenstate. For activation functions that are smooth enough we can calculate the strong form of the eigenvalue residual and then compute its norm using a quadrature or quasi-Monte Carlo integration. Using the residual estimator we stop the optimization (training process) when the eigenvalue residual is small enough (satisfies the preset tolerance) and/or the convergence criterion for the optimization algorithm is met (Adam optimizer). The use of ansatz functions based on neural networks, such as [24], will undoubtedly be a method of choice for 2D or 3D problems. However, this method depends on an accurate representation of the boundary of the domain Ω and thus faces challenges in scaling to higher dimensions.
The treatment of physical symmetries becomes critical when approximating excited states. For dealing with this task, we reformulate the problem as an inverse problem based on the solution of the source problem Hu = 1. The main constraint that the solution must satisfy is again positivity, and we construct an error estimator to certify the quality of the solution.
The network architectures used so far are dense network architectures. Inspired by [3], we study a parameter-dependent family of potentials and present a fully convolutional encoder-decoder neural network as a reduced order (surrogate) model for this family of partial differential equations and the mapping L : V → u. We formulate a new certified surrogate modeling approach based on neural networks that is inspired by the work on certified surrogate modeling from [29] and the U-net architecture from [30]. We use the residual error estimator from the first part of the paper as a criterion for the surrogate (encoder-decoder) model update. For further details see Section 3.5.
Let us summarize the three classes of exemplar problems studied in this paper. First, we study the eigenvalue problem for approximating the unique positive normalized ground state u 0 . We aim to construct certified, robust and scalable-with respect to the increase in the dimension of the problem-approximation methods. Second, to approximate the eigenvalues higher in the spectrum we study the landscape function. The landscape function is obtained as a solution to the source problem Hu = 1. It is again a positive smooth function and positivity is the only physical constraint needed to study the localization phenomena for the associated eigenstates. Further, we use simple residual control to ensure that the computed solution is a small perturbation of the true landscape function. As the third class of problems, we present the data-based surrogate model of the map connecting a class of potentials to the associated landscape functions. Here we are concerned with the use of convolutional networks as a data-sparse reduced order model in the context of certified surrogate modeling of this mapping. In particular, we are interested in the possibility of updating the surrogate model based on the residual error estimator.

Theoretical Background
In order to be precise and explicit, we will present the theoretical foundations on a somewhat restricted set of neural network architectures. The network architectures that will be used in practical computations are presented in Appendix B. The change of the family of the realizations of neural networks over which the optimization is carried out does not change the presentation of the algorithms in any practical way. Our main contribution is in the introduction of the error control in the Deep Ritz algorithm from [20]. We will now summarize the basic definitions from [31], which are necessary to interpret the numerical experiments. Definition 1. Given n 1 , n 2 , L ∈ N, a neural network θ of depth D with the input dimension n 1 and the output dimension n 2 is the D-tuple θ = (A l , b l ) : l = 1, · · · , D where: By the convention n 1 = N 0 and n 2 = N D . In the case in which D = 2 we call the network shallow, and otherwise the network is called deep. The vector N = N 0 · · · N D is called the network architecture of the neural network θ.
We will use D(θ) to denote the depth of the given neural network θ. In the case in which the structure of matrices A l , l = 1, · · · , D is not further restricted, we say that the network is dense. In the case in which a sparsity pattern is assumed we have several subclasses of neural networks. For exemple, if the matrices A l , l = 1, · · · , D have a structure of a Toeplitz matrix (A l ) ij = (w l i−j ) ij -here w l , l = 1, · · · , D are parameter vectors defining a Toeplitz matrix [32]-we talk about convolutional neural networks.
Let ρ : R → R be a function that is not a polynomial. By ρ we denote the function ρ = ⊗ n l=1 ρ : R n → R n . We will now define a realization of the neural network θ with respect to the function ρ.
where T l (x) = A l x + b l , l = 1, · · · , D is called the realization of the neural network θ with respect to the activation function ρ.
Among various activation functions we single out the rectified linear unit (ReLU) function ρ LU = max{0, x} and the sigmoid function ρ S (x) = 1/(1 − exp(−x)). The set of all ReLU realizations of a neural network θ has a special structure. We call a function f : R n 1 → R n 2 piece-wise linear if there is a finite set of pairwise disjoint, closed polyhedra whose union is R n 1 such that a restriction of f onto a chosen polyhedron is an affine function. It has been shown in [33] that any piece-wise linear function can be represented by a ReLU neural network and that any ReLU realization of a neural network is piece-wise linear. This observation is key to linking the approximation theory for neural networks with the standard Sobolev space regularity theory for partial differential equations.
Let us now fix some further notation. Let m be the number of the degrees of freedom of the space of piece-wise linear functions associated to the fixed polyhedral tessellation of Ω. We use V m to denote the set of all piece-wise linear functions on this tessellation. We also use the notations P 1 , P 2 and P 3 to denote the space of piece-wise linear, quadratic and cubic functions, respectively. The corresponding interpolation operators (for continuous arguments) are denoted respectively by I P 1 , I P 2 and I P 3 .
We will now briefly review a-posteriori error estimates that are used in this work. Let us note the following convention. We use ε 0 and u 0 to denote the ground state energy and the normalized positive ground state. We use ε 1 to denote the energy of a first excited state and we note that the notation is generalized for higher excited states in an obvious way. We denote the Rayleigh quotient of the operator H for the state ψ by ε = ε(ψ) := (ψ, Hψ)/(ψ, ψ). The standard Kato-Temple estimate from [26] can be written in a dimension-free form, also known as the relative form: The quantity |ε|/|ε 1 − ε| is a measure of the so-called relative spectral gap [34,35] and it measures the distance to the unwanted part of the spectrum. It can be estimated by symmetry considerations or other a-priori information. In fact, a more careful analysis from [35] shows that the scaled residual is an asymptotically exact estimate of the relative error and so we will heuristically drop the measure of the gap even in the preasymptotic regime. A consequence of the Davis-Kahan theorem [36] is that the residual also estimates the eigenvector error: Subsequently, the error estimator ∆ 2 ε = Hψ − εψ 2 /(ε 2 ψ 2 ) is a good stopping criterion for a certified approximation of an eigenmode.
For the source problem Hu = 1, u ∈ H 1 0 (Ω) we note the following relationship: between the residual and the relative L 2 error. Subsequently we use τ = Hu − 1 L 2 / ( u L 2 1 L 2 ) as an error estimator for the source problem.

Algorithms
We will now present the modifications of algorithms that we used to study the localization phenomena. We modified the Deep Ritz algorithm from [20] with the introduction of the a-posteriori (residual) error estimator. We call our variant the Certified Deep Ritz Algorithm. It is motivated by the work on certified reduced-order modeling in [29]. In order to be able to formulate strong residuals, we chose smooth activation functions ρ, so that R θ,ρ can be used to form the strong residual.
The performance of the stochastic gradient descent, when applied to the loss function, can be highly sensitive to the choice of the learning rate. Furthermore, it can also suffer from oscillation effects introduced by the choice of the sampling method in the numerical integration routines. For this reason we have opted to use the Adam optimizer from [37], which determines the learning rate by adaptively using information from higher-order moments.
To enforce the positivity constraint, we composed a realization of the neural network with the function Ξ, Ξ > 0. We call the function Ξ a positivity mask and it must be chosen appropriately for the governing boundary conditions. As the positivity mask for Schrödinger Hamiltonians we either chose a smooth nonnegative function Ξ, which decays to zero as |x| → ∞, or set the positivity mask as the identity.
In Algorithm 1 the parameter β > 0 is the penalty parameter used to enforce the boundary conditions and the parameter η > 0 is used to normalize the eigenmode approximation. We solve the integral using a Gauss quadrature rule in 1D and for higher dimensional problems we use quasi-Monte Carlo integration from [38,39] or a sparse grid quadrature [40] to approximate the integrals in the loss function as well as for the final (more accurate) evaluation of the energy functional. Alternatively in 2D, we sometimes choose to compute the integrals by projecting the realization of a neural network into a finite element space and then use the finite element quadrature to compute the integral. According to the authors of [11,41], this is an appropriate approach for problems of moderate dimensions (Ω ⊂ R n , n ≤ 20). For higher-dimensional problems, Monte Carlo integration is the only scalable approach recommended.

Algorithm 1: Certified Deep Ritz Algorithm.
Result: Approximation ψ θ,ρ of the ground state Choose the activation function ρ, the positivity mask Ξ, the penalty parameters β, η, the maximal number of iterations maxit, the network architecture N, the integration method and the residual measure∆; Set∆ = 1 and chose the starting θ with the architecture N.
while∆ is not small enough and the number of iterations is strictly smaller than maxit do Set ψ θ,ρ = Ξ • R θ,ρ and compute L(ψ θ,ρ , β, η) using numerical integration. Update θ using the Adam optimizer for the loss function L(ψ θ,ρ , β, η). Compute∆ using numerical integration. end if maxit reached then Deepen N and start over. end To apply Algorithm 1 to an eigenvalue problem we choose: and set the normalization parameter η > 0 and β > 0 for the loss function: In the case of the source problem for the computation of the landscape function we set the normalization parameter η = 0 and β > 0 and define the loss function as: As the error indicator we take∆ Here we have chosen as an example the homogeneous Dirichlet boundary condition u ∂Ω = 0. Other self-adjoint boundary conditions can equally be implemented by penalizing the boundary conditions residual at the boundary ∂Ω. Note that computing derivatives of realizations of neural networks is efficiently implemented in many programming environments such as TensorFlow [42].

Results
In this section we present direct approximation methods for estimating the ground state u 0 , the ground state energy ε 0 and the landscape function u. We use the Certified Deep Ritz Algorithm presented as Algorithm 1.

Direct Approximations of the Ground State in 1D
We now present the results of the application of Algorithm 1 on the 1D Schrödinger operator H = −∂ xx + V. For the domain Ω = (−B, B), we choose the loss function (6) and set Ξ(x) = exp(−x 2 /10) as the positivity mask in Algorithm 1.
To benchmark the accuracy of the VPINN approximations we have solved the problem to high relative accuracy using the Chebyshev spectral method as implemented in the package chebfun [43,44]. We emphasize that chebfun was not used during the training of the network in any way. To compute the residuals and the energy of the ground state we used a Gaussian quadrature where the deep network is evaluated at the sufficient number-for the given interval (−B, B)-of Gaussian points.
We constructed the potential V as a linear combination of the finite well and two inverted Gaussian bell functions: We used 1 {x: |x−c|<t} to denote the indicator function and chose α i ∈ [8,12], The neural network has 1,162 trainable parameters and we used the DenseNet VPINN architecture N DenseNet (1, 4, 2, 10) with the activation function ρ(x) = exp(−x 2 /10); see Figure A1. Figure 1 shows the solution and the error estimate during 15,000 epochs of a run of the Adam optimizer with the learning rate δ = 10 −3 and a batch size of 1024. In this example we used 1024 quadrature points on an interval (−6, 6) and the penalty parameters β = 10 −3 and η = 20. One can observe robust, almost asymptotically exact, performance of the estimator: Let us also emphasize that ∆ ε measures the distance of the Rayleigh quotient (energy functional) to the nearest eigenvalue. For evaluation of the integrals in higher dimensions we refer the reader to Appendix B. Note that the final error for the approximation of the ground state energy was 0.1%, whereas the final relative L 2 error in the ground state was 0.9%. This is in line with the eigenmode error estimate (5).

Direct Approximations of the Ground State in Higher-Dimensional Spaces
We present VPINN approximation results for the ground state and the ground state energy of the Schrödinger equation with harmonic oscillator potential V(x) = x 2 using the dense network with the architecture depicted in Figure A1. We study the problem on the truncated domain [−3, 3] n , where n is the dimension of the space. Neural network architecture should be constructed with caution. There are multiple sources of instability when dealing with neural networks, e.g., exploding and vanishing gradients. We experimented with a variety of different activation functions: ρ S (x), ρ LU (x/10) 2 , xρ S (x), exp(−x 2 /10) etc. After training of the neural network using the quasi-Monte Carlo realization of the energy integrals to define the loss function we computed the approximate ground state energy using the approximation of the energy functional (Rayleigh quotient) using the Sobol sequences with 100,000 points. We also report on the results obtained using Smolyak grids of order 6 with the Gauss-Patterson rule. The results are presented in Table 1. The accuracy of the ground state energy approximations that were obtained using quasi-Monte Carlo integration are comparable with the accuracy reported in [20]. On the other hand, the results obtained by using Smolyak's points were unsatisfactory in dimensions higher than 3. This observation will be the subject of future research. It appears that the oscillation of the realizations of the neural network on the boundary of the computational domain together with the appearance of negative weights in sparse grid integral formulas contributed to the instability of the approach. We used the N = (n, 5,2,20) architectures for n ≤ 6 and N = (9, 3, 1, 10) in 9D and the swish function ρ(x) = xρ S (x) as the activation function. The positivity mask was chosen as the identity.

Approximations of the Landscape Function in 1D
We now present the result of the approximation method using the landscape function as a solution of the partial differential equation Hu = 1, u ∈ H 1 0 (Ω). The potential V ∈ C [−50, 50] is constructed as a random piece-wise linear function (see Figure 2). More to the point, let ξ k , k = −50, · · · , 50 be independently drawn numbers from the uniform distribution on the interval [0, 4]. We construct the potential V as the piece-wise linear interpolant of (k, ξ k ), k = −50, · · · , 50. We again used chebfun for benchmarking. We set η = 0 and defined the loss function as: Here we have used the architecture of the dense VPINN network; see Appendix C. We used β = 500 for the experiments. In Figure 2 we can see the six local minima of the effective potential W = 1/u obtained from the neural network and the first six eigenstates computed by the chebfun. Note that the potential W = 1/u is defined only in the interior of the domain (−50, 50). In Table 2 we present the results of the benchmarking of the approximation formula (2) against highly accurate chebfun eigenvalue approximations. The neural network has 6402 trainable parameters and we used the VPINN architecture N DenseNet = (1,5,2,20) with the activation function ρ LU (x/10) 2 . The positvitiy mask was chosen as the identity. The network was trained using 50,000 epochs of the Adam optimizer with the learning rate δ = 10 −3 and a batch size of 2048. In this example we also used 2048 quadrature points on an interval (−50, 50).

Direct VPINN Approximation of the Landscape Function in 2D
We now apply Algorithm 1 to the problem of approximating the landscape function in 2D. When presenting the examples we will report on the used network architecture N DenseNet = (2, k, l, m) as well as indicate the number of trainable parameters for each of the architectures. The activation function used for all neural networks in this subsection is the sigmoid function ρ S . We now set l = 2 and in the next table present a convergence study for the family of architectures N DenseNet = (2, k, 2, m).
The convergence histories of relative L 2 and H 1 errors, measured with respect to the benchmark FEniCS solution, are shown in Table 3. We can observe that the errors drop at a favorable rate with an increase in k. On the other hand, an increase in m causes a much more pronounced increase in the number of trainable parameters (complexity of the network) but incurs, in comparison, only a moderate improvement of the accuracy level. In Figure 3 we plot the effective potential W and the landscape function u. In (a) we plot the boundaries of the sets {x : ε u(x) ≥ 1} that localize the eigenstates. In ( b) we plot the circles of radius 1/ε i , forε i−1 = 3W min,i /2, i = 1, 2, 3, centered at the i-th lowermost local minimum W min,i .

Encoder-Decoder Network as a Reduced-Order Model for a Family of Landscape Functions
We now study the use of the sparse, U-Net-inspired [30], network architecture as a surrogate model for the function L : V → u. In Figure 3 we show the landscape function u with periodic boundary conditions for the potential V ω constructed as a lattice superposition of sixteen Gaussian bell functions G(x) = α exp(−|x 1 − c 1 | 2 /k 2 1 − |x 2 − c 2 | 2 /k 2 2 ). The centers c = (c 1 , c 2 ) were chosen randomly inside each block of the 4 × 4 uniform quadrilateral tessellation of Ω. The constants α, k i , i = 1, 2 were chosen randomly and independently from intervals [8,128] and [1, 1.5], respectively. To introduce local defects in the lattice, we have further randomly chosen three Gaussian bells and removed them from the potential. The choice of Gaussian bells to be removed was restricted, so that the boundary conditions were respected and that none of the erased bells were pairwise adjacent.
As the reduced order model for this family of problems, we have used the encoderdecoder fully convolutional neural network (FCNN) from Appendix C with 2,614,531 trainable parameters. This is a relatively small number of parameters in comparison with the typical convolutional neural network architectures with fully connected layers. The architecture of the neural network is shown in Figure A2.
To train the model we generated 98,400 potentials V ω and then used FEniCS to compute the associated landscape functions u ω , − u ω + V ω u ω = 1. The domain of the Hamiltonaian was Ω = [−10, 10] 2 , with the periodic boundary conditions. We used the uniform quadrilateral discretization with the step size h = 20/50 = 0.4 and P 2 elements to compute the training examples. To construct the reduced-order model, we projected (by interpolation) these P 2 functions onto the space of P 1 elements for the same mesh. After implementing the periodic boundary conditions we obtained exactly 2,500 free nodes for this space of P 1 functions.
We denote the values of the potential V in those nodes as the vector V ∈ R 2,500 and we tacitly identify the vector V with the function I P 1 V . Let I * P 1 be the extension operator from R 2,500 to the space of continuous piece-wise linear functions. Theñ L( V) := I P 1 L(I * P 1 I P 1 V) defines the mappingL : R 2,500 → R 2,500 . We used the learning rate δ = 10 −4 for the first 100 epochs of the Adam optimizer and the learning rate δ = 10 −5 for a further 50 epochs. The activation function ρ LU was used to promote sparsity and the MSE loss function was used for the training. The batch size for the Adam algorithm was 1024. We implemented the certified surrogate modeling approach by combining the evaluation of the neural network with the error estimator τ. In the case in which the residual measure τ for the function u = L(I * P 1 I P 1V ) is larger than the preset tolerance, we updated the surrogate model (neural network). To this end we solved in FEniCS the problem − u +Vu = 1 and used the standard update algorithm for the convolutional neural network and the new training example.
We evaluated the performance of the neural network reduced-order model on a set of 200 testing potentials that were not used in the training of the network, see Table 4. The benchmarking comparison against the FEniCS solution is presented in Figure 4.

Discussion
According to the authors of [45], deep learning approaches to dealing with partial differential equations fall into the following three categories: (a) Rayleigh-Ritz approximations, (b) Galerkin approximations and (c) least squares approximations. We have considered a hybrid approach that combines robust stochastic optimization of overparametrized networks based on the Rayleigh-Ritz approach with standard residual based error estimates, which together yield a hybrid approximation method. We were particularly influenced by the review in [45] and the Deep Ritz algorithm as described in [20].
In the example from Table 3 we further computed a piece-wise cubic and piece-wise quadratic approximation of the landscape function using the standard finite element method and an approximation of the landscape function using a variant of the Deep Ritz method. We measured the error of the P 2 and the VPINN approximation against the P 3 benchmark solution. The relative L 2 error of the piece-wise quadratic approximation was computed to be 0.36%, whereas the relative error of the VPINN approximation with 1203 free parameters was computed to be 1.21%. Note, however, that the piece-wise quadratic approximation required the training of 10,000 parameters. This indicates that the neural network achieves a considerable data compression when compared with a piece-wise polynomial approximation. The situation is even more interesting in 1D. There we compared the ground state approximation by the Chebyshev series, as implemented in chebfun. We needed 149 terms in the Chebyshev expansion to reach the order of machine precision. On the other hand, the Adam optimizer was able to find a realization of the dense neural network with VPINN architecture N DenseNet = (1, 2, 2, 2) from Appendix A. This architecture only has 30 trainable parameters and it achieves a relative distance (in the L 2 sense) of 1.05% to the benchmark chebfun solution. The relative error in the ground state energy is only 0.1%, see Figure 5. The integrals needed to approximate the energy functional were computed using Gaussian quadrature rules. Unfortunately, this approach does not yield stable methods in higher-dimensional problems. The reason is in the fact that sparse grid quadratures (e.g., [40]) also have nodes with negative weights and this was observed to cause severe numerical instability. An approach based on the quasi-Monte Carlo integration, which utilizes low-discrepancy sequences of integration nodes and has only positive weights (see [39]) yielded an efficient and stable method in higher-dimensional situations. Furthermore, since realizations of neural networks are frequently functions with many local extrema, computing their integrals needs to be handled with care. This is particularly relevant when enforcing discretizations of physical or normalization constraints by penalization. We point out that the scalability of sparse-grid integration schemes in this respect was not satisfactory.
Another promising technique for obtaining data-sparse compressed approximation of the solutions of partial differential equations is based on the concept of tensor networks, also known as matrix product states or tensor train decompositions [9]. This approach has been successfully converted into numerical approximation algorithms such as the quantized tensor train decompositions [8,46]. The scaling robust performance of this approach has been demonstrated on a class of multi-scale model problems in [46]. However, the numerical methods still have to be tailor-made for the chosen problems. On the other hand, there are many freely available robust and highly developed libraries for working with deep neural networks. This is the reason for our choice of the discretization method.

Conclusions
We have presented two types of neural network architectures. First, a dense deep network of the DenseNet type was used as a compressed approximation of the ground state and the landscape function of the problems under consideration. Remarkably, it achieved high accuracy and a good compression rate even when empirically compared with a Chebyshev expansion in 1D. Even though we managed to tackle problems in R n , this concept struggled to yield scalable numerical methods. We then took another approach and considered a problem of approximating a mapping L : R 2500 → R 2500 , which connects a mesh sample of a potential with the associated landscape function. A fully convolutional neural network architecture with the ReLU activation function, to further promote sparsity, turned out to be expressive enough to deal with this family of problems to a satisfactory level of accuracy (empirically measured on the test set). We have also seen that a hybrid approach-one that combines the expressivity of the set of neural network realizations with the standard error indicators-has a potential to lead to robust approximation methods. We have observed that it is particularly challenging to turn physical constraints-which are continuous-into their discrete realizations, which can be used to filter out, e.g., by judiciously applied penalization, the nonphysical neural network realizations from the set of all realizations of a given architecture. How to turn this into a robust mesh-less and scalable method for dealing with equations of mathematical physics will be a topic of further research.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: In the original Deep Ritz Algorithm from [20] the authors have used Monte Carlo integration to compute the duality products. This amounts to choosing a random sample χ i ∈ Ω, i = 1, · · · , M and then using: By contrast, in R n , we used the Smolyak quadrature [38,40] and low-discrepancy sequences for quasi-Monte Carlo integration routines [39]. For a given order of the quadrature there exist weights α i ∈ R and nodes ξ i ∈ Ω, i = 1, · · · , M such that: Unlike in the Monte Carlo approach, these nodes are fixed (for a given choice of parameters, see [38]).

Appendix B.1. Finite Element Quadrature for 2D Problems
For 2D problems we also used finite element quadrature to estimate the integrals and also to estimate the negative-order Sobolev norm. This approach does not scale to higher-dimensional problems, but is a good method for the purposes of validating algorithms based on the variational optimization and neural networks. Let V h ⊂ H 1 π (Ω) be a finite element space and let W h be another finite element space such that V h ⊂ W h ⊂ H 1 π (Ω). To V h and W h we associate standard interpolation operators I V h : C(Ω) → V h and I W h : C(Ω) → W h . For a given continuous realization of the neural network ψ θ = Ξ • R θ,ρ , we compute: and we use standard finite element quadratures to evaluate the integrals on the right-hand side, see for instance the FEniCS book [48]. Finally, to assess the negative-order Sobolev norm of the residual we use the auxiliary subspace W h and compute, using standard finite element calculus:

Appendix B.2. Direct Approximations for Higher Dimensional Problems
For higher-dimensional problems we use quadrature rules based on low discrepancy sequences [39] to compute the value of the energy functional (Rayleigh quotient) on the returned neural network realization R θ,ρ . To define the loss function, which consists of the energy functional and the normalization constraints, we use the quasi-Monte Carlo approach [49], but with fewer quadrature nodes. This is consistent with the approach taken in the original Deep Ritz algorithm. The use of Smolyak's rules can also be considered. This, however, does not lead to numerically stable optimization procedures. The realizations of neural networks are functions that can have many sharp local extrema and so optimizing using a fixed collection of nodes was observed to lead to degenerate solutions. The further problem stems from the fact that Smolyak's rules have, unlike Gaussian rules, a certain percentage of negative weights and so due to approximation errors it is possible to compute a negative approximation of an integral of a positive function. This is highly undesirable for a minimization procedure. Quasi-Monte Carlo integration routines are much less accurate than sparse grid rules, but all of their integration weights are positive and in addition they avoid the problem of overfitting. We use Sobol's points [39] to calculate nodes for the quasi-Monte Carlo approximation of the energy functional for higher-dimensional problems.

Appendix C. Architecture of the VPINN Neural Network
We will present the architectures of deep neural networks used in the paper. The network architecture N as defined in Definition 1 is sufficient to describe deep dense neural networks. Neural network architectures are presented graphically. Some further formal descriptions of the approximation classes can be found in [50]. The particular architecture that we use will have slightly more regularity in the dimensions of the layers. However, there will be more links between layers, which are inspired by the DenseNet concept [51]. The architecture is depicted in Figure A2 and we use the vector N DenseNet = (n, k, l, m) to describe the architecture of the network, which has k blocks with l layers of the size m. Realizations of this network are functions from R n to R.  Figure A2. FCNN encoder-decoder architecture inspired by the U-Net concept from [30].