Separable Gaussian Neural Networks: Structure, Analysis, and Function Approximations

The Gaussian-radial-basis function neural network (GRBFNN) has been a popular choice for interpolation and classification. However, it is computationally intensive when the dimension of the input vector is high. To address this issue, we propose a new feedforward network - Separable Gaussian Neural Network (SGNN) by taking advantage of the separable property of Gaussian functions, which splits input data into multiple columns and sequentially feeds them into parallel layers formed by uni-variate Gaussian functions. This structure reduces the number of neurons from O(N^d) of GRBFNN to O(dN), which exponentially improves the computational speed of SGNN and makes it scale linearly as the input dimension increases. In addition, SGNN can preserve the dominant subspace of the Hessian matrix of GRBFNN in gradient descent training, leading to a similar level of accuracy to GRBFNN. It is experimentally demonstrated that SGNN can achieve 100 times speedup with a similar level of accuracy over GRBFNN on tri-variate function approximations. The SGNN also has better trainability and is more tuning-friendly than DNNs with RuLU and Sigmoid functions. For approximating functions with complex geometry, SGNN can lead to three orders of magnitude more accurate results than a RuLU-DNN with twice the number of layers and the number of neurons per layer.


Introduction
Radial-basis functions have many important applications in the fields such as function interpolation (Dyn et al., 1986), meshless methods (Duan, 2008), clustering classification (Wu, 2012), surrogate models (Akhtar and Shoemaker, 2016), Autoencoder (Daoud et al., 2019), and dynamic system design (Yu et al., 2011).The Gaussian-radial-basis-function neural network (GRBFNN) is a neural network with one hidden layer and produces output in the form where G k (x) is a radially-symmetric unit represented by the Gaussian function such as (2) Herein, µ k and σ k are the center and width of the unit that can be tuned to adjust its localized response.The locality is then utilized to approximate the output of a nonlinear mapping through the linear combination of Gaussian units.Although it has been shown that GRBFNN outperforms multilayer perceptions (MLP) in generalization (Tao, 1993), tolerance to input noise (Moody and Darken, 1989), and learning efficiency with a small set of data (Moody and Darken, 1989), the network is not scalable for problems with high-dimensional input, because the neurons in need for accurate predictions and the corresponding computations exponentially increase with the rise of dimensions.This paper aims to tackle this issue and make the network available for high-dimensional problems.GRBFNN was proposed by Moody and Darken (1989) and Broomhhead and Lowe (1988) in the late of the 1980s for classification and function approximations.It was soon proved that GRBFNN is a universal approximator (Hornik et al., 1989;Park and Sandberg, 1991;Leshno et al., 1993) that can be arbitrarily close to a real-value function when the sufficient number of neurons is offered.The proof of universal approximability for GRBFNN can be interpreted as a process beginning with partitioning the domain of a target function into a grid, followed by using localized radial-basis functions to approximate the target function in each grid cell, and then aggregating the localized functions to globally approximate the target function.It is evident that this approach is not feasible for high-dimensional problems because it will lead to the exponential growth of neurons as the number of input dimensions increases.For example, O(N d ) neurons will be required to approximate a d-variate function, with the domain of each dimension divided into N segments.
To address this issue, researchers have heavily focused on selecting the optimal number of neurons as well as their centers and widths of GRBFNN such that the features of the target nonlinear map are well captured by the network.This has been mainly investigated through two strategies: (1) using supervised learning with dynamical adjustment of neurons (e.g., numbers, centers, and widths) according to the prescribed criteria and (2) performing unsupervised-learning-based preprocessing on input to estimate the placement and configuration of neurons.
For the former, Poggio and Girosi (1990) as well as Wettschereck and Dietterich (1991) applied gradient descent to train generalized-radial-basis-function networks that have trainable centers.Regularization techniques (Poggio and Girosi, 1990) were adopted to maintain the parsimonious structure of GRBFNN. Platt (1991) developed a two-layer network that dynamically allocates localized Gaussian neurons to the positions where the output pattern is not well represented.Chen et al. (1991) adopted an Orthogonal Least Square (OLS) method and introduced a procedure that iteratively selects the optimal centers that minimize the error reduction ratio until the desired accuracy is achieved.Huang et al. (2005) proposed a growing and pruning strategy to dynamically add/remove neurons based on their contributions to learning accuracy.
The latter, unsupervised-learning-based preprocessing methods, have been more popular because it decouples the estimation of centers and widths from the computation of weights, which reduces the complexity of program as well as computational load.Moody and Darken (1989) used the k-means clustering method (Wu, 2012) to determine the centers that minimize the Euclidean distance between the training set and centers, followed by the calculation of a uniform width by averaging the distance to the nearest-neighbor of all units.Carvalho and Brizzotti (2001) investigated different clustering methods such as the iterative optimization (IO) technique, depth-first search (DF), and the combination of IO and DF for target recognition by RBFNNs.Niros and Tsekouras (2009) proposed a hierarchical fuzzy clustering method to estimate the number of neurons and trainable variables.
The optimization of widths has been of great interest more recently.Yao et al. (2010) numerically observed that the optimal widths of radial basis function are affected by the spatial distribution of training data and the nonlinearity of approximated functions.With this in mind, they developed a method that determines the widths using the Euclidean distance between centers and second-order derivatives of a function.However, calculating the width of each neuron is computationally expensive.Instead of assigning each neuron a distinct width, it makes more sense to assign different widths to the neurons that represent different clusters for computational efficiency.Therefore, Yao et al. (2012) further proposed a method to optimize widths by dividing a global optimization problem into several subspace optimization problems that can be solved concurrently and then coordinated to converge to a global optimum.Similarly, Zhang et al. (2019) introduced a two-stage fuzzy clustering method to split the input space into multiple overlapped regions that are then used to construct a local Gaussian-radial-basis-function network.
However, the aforementioned methods all suffer from the curse of dimensionality.As the input dimension grows, the selection of optimal neurons itself can become cumbersome.To compound the problem, the number of optimal neurons can also rise exponentially when approximating high-dimensional and geometrically complex functions.Furthermore, the methods are designed for CPU-based, general-purpose computing machines but are not appropriate for tapping into the modern GPU-oriented machine-learning tools (Abadi et al., 2016;Paszke et al., 2019) whose computational efficiency drop significantly when handling branching statements and dynamical memory allocation.This gap motivates us to reevaluate the structure of GRBFNN.As stated previously, the localized property of Gaussian functions is beneficial for identifying the parsimonious structure of GRBFNN with low input dimensions, but it also leads to the blow-up of the number of neurons in high dimensional situations.
Given that the recent development of deep neural networks has shown promise in solving such problems, the main goal of this paper is to develop a deep-neural-network representation of GRBFNN such that it can be used for solving very high dimensional problems.We approach this problem by utilizing the separable property of Gaussian radial basis functions.That is, every Gaussian-radial-basis function can be decomposed into the product of multiple uni-variate Gaussian functions.Based on this property, we construct a new neural network, namely separable Gaussian neural network (SGNN), whose number of layers is equal to the number of input dimensions, with the neurons of each layer formed by the corresponding uni-variate Gaussian functions.Through dividing the input into multiple columns by their dimensions and feeding them into the corresponding layers, the output equivalent to that of a GRBFNN is constructed from multiplications and summations in the forward propagation.It should be noted that Poggio and Girosi (1990) have reported the separable property of Gaussian-radial-basis functions and proposed using it for neurobiology even in 1990.
SGNN offers several advantages.
• The number of neurons of SGNN is given by O(dN) and increases linearly with the dimension of the input while the number of neurons of GRBFNN given by O(N d ) grows exponentially.This reduction of neurons also decreases the number of trainable variables from O(N d ) to O(dN 2 ), yielding a more compact network than GRBFNN.
• The reduction of trainable variables further decreases the computational load during training and testing of neural networks.As shown in Section 3, this has led to 100 times speedup of training time for approximating tri-variate functions.
• SGNN is much easier to tune than other MLPs.Since the number of layers in SGNN is equal to the number of dimension of the input data, the only tunable networkstructural hyper-parameter is the layer width, i.e. the number of neurons in a layer.This can significantly alleviate the tuning workload as compared to other MLPs that must simultaneously tune the width and depth of layers.
• SGNN holds a similar level of accuracy as GRBFNN, making it particularly suitable for approximating multi-variate functions with complex geometry.In Section 7, it is shown that SGNN can yield three-order-of magnitude more accurate approximations for complex functions than MLPs with ReLU and Sigmoid functions.
The rest of this paper is organized as follows.In Section 2, we introduce the structure of SGNN and use it to approximate a multi-variate real-value function.In Section 3, we compare SGNN and GRBFNN regarding the number of trainable variables and computational complexity of forward and backward propagation.In Section 4, we show that SGNN can preserve the dominant sub-eigenspace of the Hessian of GRBFNN in the gradient descent search.This property can help SGNN maintain a similar level of accuracy as GRBFNN while substantially improving computational efficiency.In Section 5, we show the computational time of SGNN scales linearly with the increase of dimension and demonstrate its efficacy in function approximations through numerous examples.In Sections 6 and 7, extensive comparisons between SGNN and GRBFNN and between SGNN and MLPs are performed.At last, the conclusion is summarized in Section 8.

Hidden Gaussian Layers
Output Layer f Input Layer x 1 x 2 x 3 Figure 1: The SGNN that approximates a tri-variate function.The input is divided and fed sequentially to each layer.Therefore, the depth (layers) of the NN is identical to the number of input dimensions.In this paper, the weights of the output layer are unity.
Remark.The Gaussian radial-basis function is separable and can be represented in the form where The product chain in Eq. ( 5) can be constructed through the forward propagation of a feedforward network with a single neuron per layer where ϕ (k) (x k ) is the neuron of the k-th layer.This way, the multi-variate Gaussian function G(x) is reconstructed at the output of the network.By adding more neurons to each layer and assigning weights to all edges, we can eventually construct a network whose output is equivalent to the output of a GRBFNN.Fig. 1 shows an example of an SGNN approximating a tri-variate function.Next we use this property to define SGNN.

Definition 2.2. The separable-Gaussian neural network (SGNN) with d-dimensional input can be constructed in the form
i , σ (1) where N i (i = 1, 2, . . ., d) represents the number of neurons of the l-th layer, N (l) i represents the output of the i-th Gaussian neuron (activation function) of the l-th layer.Substitution of Eqs. ( 6) to (8) into Eq.( 9) yields i l+1 i l (l = 1, 2, . . ., d) represents the weight of the i (l+1) -th neuron of the (l + 1)-th layer and the i l -th neuron of the l-th layer.The loss function of the SGNN is defined in the form The center µ (l) i and width σ i can also be treated as trainable.They are not included in this discussion for simplicity.

SGNN vs. GRBFNN
Without loss of generality, the analysis below will assume that each hidden layer has N neurons.To understand how the weights of SGNN relate to those of GRBFNN, we equate Eqs. ( 1) and ( 10), which yields a nonlinear map whose explicit form is It is evident that SGNN can be transformed into GRBFNN.However, GRBFNN can be converted into SGNN if and only if the mapping of Eq. ( 14) is invertible.
Because the mapping of Eq. ( 14) is not uniquely invertible, it is difficult to prove the universal approximability of SGNN.However, this paper will present extensive numerical experiments to show that SGNN can achieve comparable (occasionally even greater) accuracy with much less computation effort than GRBFNN.In addition, SGNN can have superior performance in approximating complex functions than deep neural networks with activation functions such as ReLU and Sigmoid as shown in Section 7.
In the following, we demonstrate the computational efficiency of SGNN over GRBFNN in terms of trainable variables and the number of floating-point operations of forward and backward propagation.

Trainable Variables
Let us now treat the center and width of the uni-variate Gaussian function in SGNN as trainable.The total number N t of trainable variables of SGNN is given by Note that the number of trainable variables of GRBFNN is N d , identical to its number of neurons.SGNN and GRBFNN have identical weights when the number of layers is smaller than or equal to two.In other words, they are mutually convertible and the mapping of Eq. ( 14) is invertible when d ≤ 2. However, for high-dimensional problems, as shown in Table 1, SGNN can substantially reduce the number of trainable variables, making it more tractable than GRBFNN.

Forward Propagation
Assume the size of the input dataset is m.Using Eqs. ( 6) to ( 9), we can estimate the number of floating-point operations (FLOP) of the forward pass in SGNN.More specifically, the number of FLOP to calculate the output of the k-th layer with the input from the previous layer is where 2N 2 is the number of arithmetic operations by the product of weights and Gaussian functions of the k-th layer, and 6N is the number of calculations for Gaussian functions of the layer, and m is the size of input dataset.In addition, the number of FLOP associated with the first and output layer are Therefore, the total number of FLOP is The number of operations increases linearly with the increase of the number of layers or the dimension of the input vector d.On the other hand, the computational complexity of regardless of the trainability of the centers and width of Gaussian functions.

Backward Propagation
Accurately estimating the computational complexity of backward propagation is challenging because techniques such as auto differentiation (Baydin et al., 2018) and computational graphs (Abadi et al., 2016) have optimized the original mathematical operations for improving the performance.Auto differentiation evaluates the derivative of numerical functions using dual numbers with the chain rule broken into a sequence of operations such as addition, multiplication, and composition.During forward propagation, intermediate values in computational graphs are recorded for backward propagation.
We analyze the operations of backward propagation with respect to a single neuron of the l-th layer.The partial derivatives of f (x) with respect to where The backward prorogation with respect to the j-th neuron of the l-th (1 ≤ l ≤ n − 1) layer can be divided into three steps: 1. Compute the gradient of f with respect to the output of the 1-st neuron in the (l+1)-th layer, N , as shown in Eq. ( 25), where T can be accessed from the back propagation of the (l + 2)-th layer.This leads to 2N FLOP due to the dot product of two vectors.

Calculate the partial derivatives of N (l+1) j
with respect to weights, center, and width.Since the calculation of derivatives is computationally cheap, the analysis below will neglect the operations used to evaluate derivatives.This shall not affect the conclusion.3. Propagate the gradients backward.This produces N + 2 operations.
Therefore, the number of FLOP of the l-th layer is approximately m(3N 2 + 2N), where m is the volume of the input dataset.The backward propagation of the last layer leads to N operations.In total, the number of FLOP by backward propagation is On the other hand, the backward propagation FLOP number of GRBFNN is

Subspace Gradient Descent
As illustrated in Section 3, SGNN has exponentially fewer trainable variables than the associated GRBFNN for high-dimensional input.In other words, GRBFNN may be overparameterized.The recent work (Sagun et al., 2017;L. et al., 2018;Gur-Ari et al., 2018) has shown that optimizing a loss function constructed by an over-parameterized neural network can lead to Hessian matrices that possess few dominant eigenvalues with many near-zero ones before and after training.This means the gradient descent can happen in a small subspace.Inspired by their work, we consider infinitesimal variation of the loss function J for GRBFNN a where θ represents a vector of all trainable weights, and is the associated Hessian matrix.The centers and widths of Gaussian functions are assumed to be constant for simplicity.Since the Hessian matrix H is symmetric, we can represent it in the form where λ d = diag(λ 1 , λ 2 , . . ., λ k , . . ., λ dN ) are the k dominant eigenvalues padded by (dN − k) non-dominant ones (assuming k < dN), and λ s = diag(λ dN +1 , λ dN +2 , . . ., λ N d ) are the rest non-dominant eigenvalues.
Let θ be the weights of SGNN.The variation of the mapping from θ to θ in Eq. ( 13) reads where ∂g ∂θ : R dN → R N d ×dN .It should be noted that ∂g ∂θ is a super sparse matrix.
Substitution of Eq. ( 32) into Eq.( 29) with where . Substitution Eq. ( 35) into Eq.( 34) yields Therefore, the dominant eigenvalues of the Hessian of GRBFNN are also included in the corresponding SGNN.This means that the gradient of SGNN can descend in the mapped dominant non-flat subspace of GRBFNN, which may contribute to the comparable accuracy and training efficiency of SGNN as opposed to GRBFNN, as discussed in Section 3.

Candidate Functions
We consider ten candidate functions from (Andras, 2014(Andras, , 2018) ) and modified them, as listed in Table 2.The functions cover a range of distinct features, including sinks, sources, flat and s-shaped surfaces, and multiple sinks and sources, which can assist in benchmarking function approximations of different neural networks.
We generate uniformly distributed sample sets to train neural networks for each run, with upper and lower bounds of each dimension ranging from -8 to 8.During the training process, we employ mini-batch gradient descent with the optimizer Adam in Tensorflow to update model parameters.The optimizer uses its default training parameters and stops if no improvement of loss values is achieved in four consecutive epochs.The dataset is divided into a training set comprising 80% of the data and a validation set consisting of the remaining 20%.The mini-batch size, number of neurons, and data points are selected to balance the convergence speed and accuracy.All tests are performed on a Windows-10 desktop with a 3.6HZ, 8-core, Intel i7-9700K CPU and a 64GB Samsung DDR-3 RAM.

Dimension Scalability
In order to understand the dimensional scalability of SGNN, we applied SGNN to candidate functions with the number of dimensions from two to five.For comparison, the data points were kept as 16384 such that sufficient data was sampled for 5-D functions, i.e. d = 5.Each layer has fixed 20 uni-variate Gaussian neurons, with initial centers evenly distributed in each dimension and widths being the distance between two adjacent centers.The training time per epoch grows linearly as the dimension increases, with an increment of 0.02 seconds/epoch per layer.For the majority of candidate functions, SGNN can achieve the accuracy level of 10 −4 .It is sufficient to approximate the 5-D functions by SGNN with 5 layers and in total 100 neurons.The configuration of SGNN cannot well approximate the function f 5 in 4-D.This can be easily resolved by adding more neurons to the neural network (see a similar example in Table 7).In summary, the computational time of SGNN scales linearly with the increase of dimensions.size are fine-tuned to achieve optimal results.

2-D Examples
First, SGNN is used to approximate the two-dimension function f 3 (x) = 1/5e (x 2 1 +x 2 2 )/50 , which has four sharp peaks and one flat valley in the domain.As illustrated in Fig. 2(a), the optimizer converges in 400 steps, with the difference between training and test sets at the magnitude level of 10 −4 .Figs. 2(b)-(e) show that the prediction by SGNN is nearly identical to the ground truth, except for the domain near boundaries.This can be attributed to fewer sampling points in the neighborhood of the boundaries.Better alignment can be achieved by sampling extra boundary points to the input dataset.
SGNN maintains its level of accuracy as candidate functions become more complex.For example, Fig. 3 presents the approximation of f 4 (x) = 1 5 (e x 2 1 /50 sin x 2 + e x 2 2 /50 sin x 1 ).SGNN can approximate f 4 with the same level of accuracy as f 3 even with fewer training epochs, possibly led by the localization property of Gaussian function.The largest error again appears near boundaries, with a percentage error less than 8%.Inside the domain, the computed values precisely match the exact ones.As visualized in Figs. 3(d) and (e), the prediction by SGNN can fully capture features of the function.This finding is corroborated in Fig. 4, which presents the approximation of the function f 5 (x) = 1 50 (x 2 1 cos x 1 + x 2 2 cos 2x 2 ) by SGNN.The function, different from f 4 , possesses peaks and valleys near boundaries and becomes flat in the vicinity of origin, as illustrated in Figs.4(c)-(e).Interestingly, the neural network converges faster than the network for f 4 .This indicates that the loss function may become more convex and contain fewer flat regions.One possible reason is that as the function becomes more complex, more Gaussian neurons are active and have larger weights, increasing the loss gradients.The largest error is again observed near boundaries.As shown in Fig. 4, SGNN can well capture the features of the target function f 5 .Due to the gradient configuration of the color bar, the small offset with respect to the ground truth occurs near the origin, but the corresponding absolute errors are very small, as shown in Fig. 4(b).

5-D Examples
The approximation of five-dimensional functions from f 1 to f 10 by SGNN is illustrated through cross-sectional plots in the x 1 − x 2 plane with three other variables fixed to zero, as shown in Figs. 5 and  generated for all functions in order to maintain consistency.However, fewer points can be used when the function shape is simple (e.g., sink or source).The validation set used to produce the prediction plots is generated by uniformly partitioning the subspace with a step size twice the number of neurons per layer.
The SGNN can accurately capture the features of all candidates regardless of their geometric complexity.Although the predictions of SGNN show minor disagreements with the ground truth when the function (e.g., f 10 ) is constant, the differences are less than 3%.

Comparison of SGNN and GRBFNN
The performance of SGNN and GRBFNN in approximating two-dimensional and threedimensional candidate functions are presented in Tables 4 and 5, respectively.For comparison, the centers and width of Gaussian neurons of GRBFNN are set to be trainable variables as well.We focus on the differences of total epochs, training time per epoch, and losses for comparison.The results are obtained by averaging the results of 30 runs.As shown in Table 4, when approximating two-dimensional functions, SGNN can achieve comparable accuracy as GRBFNN, with differences less than one-order-of magnitude in most cases.The worse case occurs with approximating f 1 .However, the absolute difference is around 1.0E-3, and SGNN can still give a reasonably good approximation.On the other hand, the training time per epoch of SGNN is roughly one-tenth of that of GRBFNN.
The advantage of SGNN becomes more evident in three-dimensional function approximations.SGNN can gain a one-hundred-time speedup over GRBFNN but still maintain a similar level of accuracy.Surprisingly, SGNN can also yield more accurate results when approximating f 3 to f 6 .

Comparison with Deep NNs
In this section, we compare the performance of SGNN with deep ReLU and Sigmoid NNs, which are two popular choices of activation functions.Through the approximation of fourdimensional candidate functions, SGNN shows much better trainability and approximability over deep ReLU and Sigmoid NNs.7) because no weights connect the input and first layer, and the output layer is not trainable.
Although SGNN has appreciably larger training epochs, this also leads to more accurate predictions.The loss values of SGNN after training are uniformly smaller than those of ReLU-NN and Sigmoid-NN except for f 10 .In fact, for f 2 , f 4 , f 6 , and f 7 , the accuracy of SGNN is even two orders of magnitude better than the other two models.
Despite the efficient training speed of Sigmoid-NN, the network is more difficult to train with random weight initialization for f 1 and f 5 .In fact, the approximation of f 5 by Sigmoid-NN is nowhere close to the group truth after training.When functions become more complex, SGNN outperforms ReLU-NN and Sigmoid-NN in minimizing loss through stochastic gradient descent.This could be attributed to the locality of Gaussian functions that increase the active neurons, reducing the flat subspace whose gradients diminish.Sigmoid-NN aborts with significantly fewer epoch numbers.This could be led by the small derivatives of Sigmoid functions when input stays within the saturation region, which makes it more difficult to train the network.
Next, we further compare the trainability of SGNN with ReLU-DNN.We train the two networks with different configurations to approximate the function f 5 , which has a more complex geometry and is more difficult to approximate.The configuration of the NNs and the training performance are listed in Table 7.
Because the layer of SGNN is fixed by the number of function variables, its only tunable network hyper-parameter is the number of neurons per layer.Doubling the neurons/layer of However, the accuracy of ReLU-DNN slightly increases with the increase of width and depth of the model.Close to 50% loss reduction is achieved by adding 7 more layers and 50 neurons per layer.However, the error is still three orders of magnitude higher than the error by a 4-layer SGNN with one-tenth of trainable variables, with half training time per epoch.According to the universal approximation theorem, although one can keep expanding the network structure to improve accuracy, it is against the observation in the last row.This is because the convergence of gradient descent can be a practical obstacle when the network becomes over-parametrized.In this situation, the network may impose a very high requirement on the initial weights to yield optimal solutions.
To visualize the differences in the expressiveness between SGNN and ReLU-NN, the predictions of one run in Table 7 are selected and plotted through a cross-sectional cut in x 1 − x 2 plane with the other two variables x 3 and x 4 fixed at zero, as shown in Fig. 7.The  8. network configures are listed in Table 8.The predictions of SGNN in Fig. 7(b) match well the ground truth in Fig. 7(a).Despite the minor differences in colors near the origin, their maximum magnitude is less than 0.1.The ReLU-NN with the same structure has a much worse approximation.Although the network gradually captures the main geometric features of f 5 with significantly augmenting its structure to 10 layers and 70 neurons per layer, the difference of magnitude can still be as large as 0.5, as shown in Fig. 7(f).

Conclusions
In this paper, we reexamined the structure of GRBFNN in order to make it tractable for problems with high-dimensional input.By using the separable property of Gaussian radialbasis functions, we proposed a new feedforward network called Separable-Gaussian-Neural-Network (SGNN), which has identical output as a GRBFNN.Different from the traditional MLPs, SGNN splits the input data into multiple columns by dimensions and feeds them into the corresponding layers in sequence.As opposed to GRBFNN, SGNN significantly reduces the number of neurons, trainable variables, and computational load of forward and backward propagation, leading to exponential improvement of training efficiency.SGNN can also preserve the dominant subspace of the Hessian matrix of GRBFNN in gradient descent and, therefore, offer comparable minimal loss.Extensive numerical experiments have been carried out, demonstrating that SGNN has superior computational performance over GRBFNN while maintaining a similar level of accuracy.In addition, SGNN is superior to MLPs with ReLU and Sigmoid units when approximating complex functions.Further investigation should focus on the universal approximability of SGNN and its applications to Physics-informed neural networks (PINNs) and reinforcement learning.

Figure 5 :
Figure 5: Prediction vs. exact of f 1 -f 5 in five dimensions.The plots are generated by projecting the surface to x 1 − x 2 plane with other coordinates fixed to zero.The left panel is for prediction; the right panel is for the exact value.Size of training dataset: 32768.

Figure 6 :
Figure 6: Prediction vs. exact of f 6 -f 10 in five dimensions.The plots are generated by projecting the surface to x 1 − x 2 plane with other coordinates fixed to zero.Left panel: prediction; right panel: ground truth.Size of training dataset: 32768.

Figure 7 :
Figure 7: Approximation of the four-dimensional f 5 using SGNN and ReLU-NNs.The plots are generated by projecting the surface to x 1 − x 2 plane with other coordinates all zero.(a) Ground truth; (b) SGNN, (c) -(f) ReLu-NNs with different network configurations.The layers and neurons per layer of the NNs are listed in Table8.

Table 1 :
Neurons and trainable variables of SGNN and GRBFNN.

Table 2 :
Candidate functions and their features.

Table 3 :
The computation time per epoch of SGNN scales linearly with the increase of dimensions.
Next, two-and five-dimensional examples are selected to illustrate the expressiveness of SGNN in function approximations.The number of neurons, training size, and mini-batch

Table 4 :
Two-dimensional function approximations using SGNN and GRBFNN.Data is generated by averaging the results of 30 runs.Sample points: 1024; mini-batch size: 64; neurons per layer: 10.

Table 5 :
Approximations of tri-variate functions using SGNN and GRBFNN.SGNN can achieve 100 times speedup over GRBFNN, with even smaller loss values for functions f 3 -f 5 (highlighted).Data was generated by averaging the results of 30 runs.Sample points: 2048; mini-batch size: 64; Neurons per layer: 10.
Table 6 presents the training time per epoch, total epoch for training, and loss after training of three deep NNs by averaging the results of 30 runs.All NNs possess four hidden layers with 20 neurons per layer.The training-set size is fixed to 16384, with a mini-batch size of 256.As opposed to SGNN and Sigmoid-NN, which have stable training time per epoch of across all candidate functions, the time of ReLU-NN fluctuates.This might be led by the difference in calculating derivatives of a ReLU unit with input less or greater than zero.SGNN has a longer training time per epoch because of the computation of Gaussian function and derivative of µ and σ.One may argue that this comparison is unfair because SGNN has extra trainable variables.However, SGNN has fewer trainable weights (see Table

Table 6 :
Performance comparison of SGNN and deep neural networks with ReLU and Sigmoid activation functions.Data is generated by averaging the results of 30 runs.All NNs have four hidden layers, with 20 neurons per layer.

Table 7 :
Comparison of SGNN and ReLU-based NN in approximation of f 5 .Results are generated by averaging the data of 30 runs.

Table 8 :
Network configurations of subplots of Fig.7.