Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network

: With high dimensionality and dependence in spatial data, traditional parametric methods suffer from the curse of dimensionality problem. The theoretical properties of deep neural network estimation methods for high-dimensional spatial models with dependence and heterogeneity have been investigated only in a few studies. In this paper, we propose a deep neural network with a ReLU activation function to estimate unknown trend components, considering both spatial dependence and heterogeneity. We prove the compatibility of the estimated components under spatial dependence conditions and provide an upper bound for the mean squared error ( MSE ). Simulations and empirical studies demonstrate that the convergence speed of neural network methods is signiﬁcantly better than that of local linear methods.


Introduction
Spatial data arise in many fields, including environmental science, econometrics, epidemiology, image analysis, oceanography, geography, geology, plant ecology, archaeology, agriculture, and psychology.Spatial correlation and spatial heterogeneity are two significant features of spatial data.Various spatial modeling methods have been applied to explore the effect of spatial heterogeneity.Notably, numerous local spatial techniques have been proposed to accommodate spatial heterogeneity.For example, Hallin et al. [1] and Biau and Cadre [2] proposed a local linear method for the modeling of spatial heterogeneity.Bentsen et al. [3] used a graph neural network architecture to extract spatial dependencies with different update functions to learn temporal correlations.
Spatial data often exhibit high dimensionality, a large scale, heterogeneity, and strong complexity.These challenges often make traditional statistical methods ineffective.Statistical machine learning methods can effectively address such challenges.Du et al. [4] pointed out the issues of traditional spatial data being large-scale and generally complex and summarized the effectiveness and application potential of four advanced machine learning methods-support vector machine (SVM)-based kernel learning, semi-supervised and active learning, ensemble learning, and deep learning-in handling complex spatial data.Farrell et al. [5] highlighted the challenges that high-dimensional spatial data, large volumes of data, and multicollinearity among covariates pose to traditional statistical models in variable selection.Three machine learning algorithms-maximum entropy (MaxEnt), random forests (RFs), and support vector machines (SVMs)-were employed to mitigate the issues of multicollinearity in high-dimensional spatial data.Nikparvar et al. [6] pointed out that the properties of spatially explicit data are often ignored or inadequately addressed in machine learning applications within spatial domains.They argued that the future prospects for spatial machine learning are very promising.
Statistical machine learning methods have advanced rapidly, while theoretical ones are not well established.Schmidt-Hieber [7] investigated the following nonparametric model: where the noise variables i are assumed to be i.i.d., X i ∈ [0, 1] d , i = 1, . . ., n, are independently and identically distributed, Y i ∈ R, i = 1, . . ., n and are independently and identically distributed.It was shown that estimators based on sparsely connected deep neural networks with the ReLU activation function and a properly chosen network architecture achieve minimax rates of convergence (up to log n-factors) under a general composition assumption on the regression function.
Considering the dependencies and heterogeneity of spatial models, we study nonparametric high-dimensional spatial models as follows: where i ∈ Λ n = (i 1 , i 2 . . . ,i N ) : i j = 1, 2, . . ., n j , j = 1, 2, . . .N , Y i ∈ R, X i ∈ R d , m(X i ) represents the trend function, R i satisfies the α-mixing condition (the definition of the α-mixing condition can be found at the beginning of Section 2.2), n = n 1 + . . .+ n N .For example, Y denotes the hourly ozone concentrations; X is a 5 × 1 vector that consists of the following explanatory variables observed at each station: wind speed, air pressure, air temperature, relative humidity, and elevation.The observation locations are recorded in longitude i 1 and latitude i 2 .In this case, d = 5 and N = 2; see [8].
Under general assumptions, we prove the consistency of the estimator and provide bounds for the mean squared error MSE.In the simulation aspect, a comparison with the local linear regression method demonstrates that the neural network method converges much faster than does the local linear regression.In the empirical study, considering the air pollution index, air pollutants, and environmental factors, the effectiveness of the neural network is demonstrated through a comparison with the local linear regression method, especially in small sample cases.
Throughout the rest of the paper, bold letters are used to represent vectors; for example, x := (x 1 , . . . ,x d ) .We define |x| ∞ := max i |x i |, |x| 0 := ∑ i I(x i = 0), where I represents the indicator function: |x| 0 denotes the total number of x i , which is not equal to zero.|x| p := ∑ d i=1 |x i | p 1/p , and we write | f | p := | f | L p (D) as the L p norm on D; D is some domain, and different situations may be different.For two sequences (a n ) n and (b n ) n , we write a n b n if there exists a constant C such that for all n, a n ≤ Cb n .Moreover, a n b n means a n b n and b n a n .log 2 denotes the logarithm base 2, ln denotes the logarithm base e, x represents the smallest integer ≥ x, and x represents the largest integer ≤ x.

Mathematical Modeling of Deep Network Features
Definition 1. Fitting a multilayer neural network requires the choice of an activation function σ : R → R and the network architecture.Motivated by its importance in deep learning, we study the rectifier linear unit (ReLU) activation function; see [7].σ(x) = max(x, 0).
For v = (v 1 , . . . ,v r ) ∈ R r , the displacement activation function σ v : R r → R r is defined as follows: The neural network architecture (L, p) consists of a positive integer L known as the number of hidden layers or depth and a width vector p = (p 0 , . . . ,p L+1 ) ∈ N L+2 .A neural network with the network structure (L, p) is any function of the following form: where W i is a p i+1 × p i weight matrix and v i ∈ R pi is a displacement vector, where i = 1, 2, . . ., L. Therefore, the network function is constructed by alternating matrix vector multiplications and the action of nonlinear activation functions σ v .In Equation ( 3), the shift vectors can also be omitted by considering the input as (x, 1) and augmenting the weight matrices with an additional row and column.To fit the network to data generated by a d-dimensional nonparametric regression model, it is required to have p 0 = d and p L+1 = 1.Given a network function x, the network parameters are the elements of the matrices W j j=0,...,L and the vectors v j j=1,...,L .These parameters need to be estimated/learned from the data.In this context, "estimate" and "learn" can be used interchangeably, as the process of estimating the parameters from data is often referred to as learning in the context of neural networks and machine learning.
The purpose of this paper is to consider a framework that encompasses the fundamental characteristics of modern deep network architectures.In particular, in this paper, we allow for a large depth L and a significant number of potential network parameters without requiring an upper bound on the number of network parameters for the main results.Consequently, this approach deals with high-dimensional settings that have more parameters than training data.Another characteristic of trained networks is that the learned network parameters are typically not very large; see [7].In practice, the weights of trained networks often do not differ significantly from the initialized weights.As all elements in orthogonal matrices are bounded by 1, the weights of trained networks also do not become excessively large.However, existing theoretical results often demand that the size of the network parameters tends to infinity.To be more consistent with what is observed in practice, all parameters considered in this paper are bounded by 1.By projecting the network parameters at each iteration onto the interval [−1, 1], this constraint can be easily incorporated into deep learning algorithms.
Let W j ∞ denote the maximum element norm of W j , and let us consider the network function space with a given network structure and network parameters bounded by 1 as follows: where v 0 is a vector with all components being 0.
In this work, we model the network sparsity assuming that there are only a few nonzero/active network parameters.If ||W j || 0 denotes the number of nonzero entries of W j and ||| f | ∞ || ∞ stands for the supnorm of the function x → | f (x)| ∞ , then the s-sparse networks are given by F(L, p, s) := F(L, p, s, F) where F is a constant; the upper bound on the uniform norm of the function f is often unnecessary and is thus omitted in the notation.Here, we consider cases where the number of network parameters s is very small compared to the total number of parameters in the network.
For any estimate mn that returns a network in class F(L, p, s, F), the corresponding quantity is defined as follows: The sequence ∆ n ( mn , m) measures the discrepancy between the expected empirical risk of mn and the global minimum of this class over all networks.The subscript m in E m indicates the sample expectation with respect to the nonparametric regression model generated by the regression function m.Notice that ∆ n ( mn , m) ≥ 0, and ∆ n ( mn , m) = 0 if mn is an empirical risk minimizer.
Therefore, ∆ n ( mn , m) is a critical quantity that, together with the minimax estimation rates, determines the convergence rate of mn .
To evaluate the statistical performance of mn under general assumptions, the mean squared error of the estimator is defined as (6)

Estimation and Theoretical Properties
In order to obtain asymptotic results, we will assume throughout this paper that X i , i ∈ Λ n satisfies the following α-mixing condition: there exists a function ϕ(t) ↓ as t → ∞ with ϕ(0) = 1, such that whenever Ξ, (Ξ) ⊆ Λ n are finite sets, it is the case that where B(Ξ) denotes the Borel σ-field generated by {X i , i ∈ Ξ}, card(Ξ) is the cardinality of Ξ, and d(Ξ, (Ξ)) = min{||i − i || : i ∈ Ξ, i ∈ (Ξ)} is the distance between Ξ and (Ξ), where ||i|| 2 stands for the Euclidean norm and ψ : N 2 → R + is a symmetric positive function that is nondecreasing in each variable; see [8].
The theoretical performance of neural networks depends on the underlying function class, and a classic approach in nonparametric statistics is to assume that the regression function is β-smooth.In this paper, we assume that the regression function m(X i ) is a composition of multiple functions, i.e., We denote the components of g i as g i = g ij j=1,...,d i+1 , and we let t i be the maximum variable that each g i depends on.Thus, each g i is a function with t i variables.
Theorem 1.We consider the nonparametric regression model with d variables for the composite regression function m = g q • g q−1 • . . .• g 1 • g 0 in the class G(q, d, t, β, K), as described in Equation ( 2).Let mn be an estimator from the function class F L, (p i ) i=0,...,L+1 , s, F satisfying the following conditions: (1) F ≥ max(K, 1), nφ n min i=1,...,L p i , (4) s nφ n lnn, where φ n is a positive sequence; then, there exist constants C and C depending only on q, d, t, β, F, such that if if ∆ n ( mn , m) ≥ CφnLln 2 n, then To minimize φ n Lln 2 n, let ∆ n ( mn , m) ≤ Cφ n ln 3 n; then, R( mn , m) ≤ C φ n ln 3 n.
The convergence rate in Theorem 1 depends on φ n and ∆ n ( mn , m).The following reasoning shows that φ n serves as a lower bound for the supremum infimum estimation risk over this class.For any empirical risk minimizer, where the definition of the ∆ n term becomes zero, the following corollary holds.
2 be an empirical risk minimizer under the same conditions as in Theorem 1.There exists a constant C , depending only on q, d, t, β, F, such that R( mn , m) ≤ C φ n Lln 2 n.
Condition (1) in Theorem 1 is very mild and states only that the network functions should have at least the same supremum norm as the regression function.From the other assumptions in Theorem 1, it becomes clear that there is a lot of flexibility in selecting a good network architecture as long as the number of active parameters s is taken to be in the right order.
In a fully connected network, the number of network parameters is ∑ L i=0 p i p i+1 .This implies that Theorem 1 requires a sparse network.More precisely, the network must have at least ∑ L i=1 p i − s completely inactive nodes, meaning that all incoming signals are zero.Condition (4) chooses s nφ n lnn to balance the mean squared error and variance.From the proof of this theorem (Appendix B), convergence rates for various orders of s can also be derived.
Deep learning excels over other methods only in the large sample regime.This suggests that the method may be adaptable to the underlying structures in the data.This may produce rapid convergence rates, but with larger constants or remainders, which can lead to relatively poor performance in small sample scenarios.
The proof of the risk bounds in Theorem 1 is based on the following oracletype inequality.
Theorem 2. Let us consider the d-dimensional nonparametric regression model given by Equation (2) with an unknown regression function m, where F ≥ 1 and |m| ∞ ≤ F, let mn be an arbitrary estimator taking values in the class F(L, p, s, F), and let and for any ε ∈ (0, 1], there exists a constant C ε , depending only on ε, such that and we have In the context of oracle-type inequalities, an increase in the number of layers can lead to a deterioration in the upper bound on the risk.In practice, it has also been observed that having too many layers can result in a decline in performance.We refer to Section 4.4 in He et al. [9] and He and Sun [10] for more details.
The proof relies on two specific properties of the ReLU activation function rather than other activation functions.The first property is its projection property, which is expressed as where the composite of the ReLU activation function is considered, given that the foundation of approximation theory lies in constructing smaller networks to perform simpler tasks, which may not all require the same network depth.To combine these subnetworks, it is necessary to synchronize the network depth by adding hidden layers that do not alter the output.This can be achieved by selecting weight matrices in the network (assuming an equal width for consecutive layers) and by utilizing the projection property of the ReLU activation function, given by σ • σ = σ.This property is beneficial not only theoretically but also in practice, as it greatly aids in passing a result to deeper layers through skip connections.
Next, we prove that φ n serves as a lower bound for the supremum infimum estimation risk over class G(q, d, t, β, K) with t i ≤ min(d 0 , . . ., d i−1 ).This means that in the composition of functions, no additional dimensions are added at deeper abstract layers.In particular, this approach avoids the case where t i exceeds the input dimension d 0 .
Theorem 3. Let us consider the nonparametric regression model (2), where X i is drawn from a distribution with a Lebesgue density on [0, 1] d , and the lower and upper bounds of this distribution are positive constants.For any nonnegative integer q, arbitrary dimension vectors d and t, and for all i such that t i ≤ min(d 0 , . . ., d i−1 ), and any smoothness vector β, along with all sufficiently large constants K > 0, there exists a positive constant c such that where inf is taken over all estimators mn .By combining the supremum infimum lower bound with the oracle-type inequality, we can easily obtain the following result.Lemma 1.Given β, K > 0, and d ∈ N, there exist constants c 1 , c 2 depending only on β, K, d, such that for ε ≤ c 2 , we have and then, for any width vector p, where p 0 = d and p L+1 = 1, we know that

Suboptimality of Wavelet Series Estimation
In this section, we show that wavelet series estimators are unable to take advantage of the underlying composition structure in the regression function and achieve, in some setups, much slower convergence rates.Wavelet estimation is susceptible to the curse of dimensionality, whereas neural networks can achieve faster convergence rates.
To construct a counterexample, it is sufficient to consider the nonparametric regression model The empirical wavelet coefficients are obtained; furthermore, , an unbiased estimate for the wavelet coefficients is obtained; furthermore, We study the estimators of the following form and for any subset I ⊂ Λ × . . .× Λ, we have ψ ∈ L 2 (R) possesses compact support; thus, without loss of generality, we assume that ψ is zero outside [0, 2 q ] for some integer q > 0.
Lemma 2. For any integer q > 0, ν := log 2 d + 1, and any 0 < α ≤ 1 and K > 0, there exists a nonzero constant c(ψ, d), which depends solely on d and the properties of the wavelet function ψ.Thus, for any j, we can find a function f j,α (x) = h j,α (x 1 + . . .
Theorem 4. If mn represents a wavelet estimator with compact support ψ and an arbitrary index set I, therefore, for any 0 < α ≤ 1 and any Hölder radius K > 0, we have As a result, the convergence rate of the wavelet series estimation is slower than n −2α/(2α+d) .If d is large, this rate becomes significantly slower.Therefore, wavelet estimation is sensitive to the curse of dimensionality, while neural networks can achieve rapid convergence.

Simulation Experiments
In this section, we conduct a comparative study through 100 repeated experimental simulations, evaluating the mean squared error of the estimation using both the local linear regression method and deep neural networks with the ReLU activation function.We consider the following models.
Model 1: where , and e i is the noise variable following a standard normal distribution.
A neural network with a depth of 3 and a width of 32 was created, where the first two layers are fully connected layers, and the output layer uses the ReLU activation function.
We consider four scenarios with the same sample size but different dimensions.where dim represents the dimension and n is the sample size.For each scenario, the mean squared error (MSE) is calculated to compare the performance of the local linear regression method and the deep neural network method.The MSE in this study is defined as follows: Table 1 presents the estimates at different values of n.In this table, NE represents the MSE of the nonparametric estimation method, and DNN represents the MSE of the deep neural network method.It is evident that as MSE approaches 0, the estimation accuracy increases.From Table 1, we observe that for the same dimension, as the sample size increases, DNN tends towards 0. For the same sample size, DNN is significantly smaller than NE, and with the increase in dimension, the superiority of the neural network method over the local linear regression method becomes more pronounced.Therefore, the neural network method achieves much higher estimation accuracy, especially for large sample sizes and high dimensions.
As shown in Figure 1, with the increase in dimension, the performance of the neural network fitting surpasses the local linear regression method, where the x-axis represents the sample size n, and the y-axis represents the mean squared error (MSE).In higher dimensions, the MSE of the neural network method approaches almost zero, which is attributed to the avoidance of the curse of dimensionality by deep neural networks.This demonstrates that the convergence rate of the deep neural network is superior to that of the local linear regression method and approaches the optimal convergence rate.
Next, we consider high-dimensional spatial models with dependency structures to compare the MSE s of the two methods mentioned above.
Model 2: where i = 1, . . ., n 1 , j = 1, . . ., n 2 , and X i,j follows a zero-mean second-order stationary process.R i,j follows a standard normal distribution.Similarly to the work of Cressie and Wikle [12], high-dimensional spatial processes X i,j are generated using spectral methods.
In this case, w(i, k) for i = 1, 2 follows a standard normal distribution and is independent of r(k) for k = 1, . . ., M, where r(k) are independently and identically distributed from a uniform distribution on [−π, π].As n → ∞, X i,j converges to a Gaussian random process.We consider the case with dimension 5 and sample sizes [200,600,1000,1400,1800,2200].The network structure is the same as in Model 1.
As shown in Table 2, it can be observed that the convergence rate of the highdimensional spatial model with dependence is worse than the convergence rate of the high-dimensional spatial model without dependence.However, in comparison to the local linear regression method, the MSE values of the neural network are much smaller, indicating that the neural network achieves better convergence performance.As shown in Figure 2, we see that in the case of large sample sizes and high-dimensional spatial models, the neural network achieves superior convergence compared to that of the local linear regression method, where the x-axis represents the sample size n, and the y-axis represents the mean squared error (MSE).

Case Study
To compare the consistency of the local linear regression method and deep neural network for high-dimensional spatial models, we consider the relationship between air pollution and respiratory diseases in the New Territories East of Hong Kong from 1 January 2000 to 15 January 2001, as studied by Wang et al. [13].There is a dataset consisting of 821 observations, where we mainly consider the air pollution index X 1 and five pollutants, sulfur dioxide (g/m 3 ) X 2 , inhalable particulate matter (g/m 3 ) X 3 , nitrogen compounds (g/m 3 ) X 4 , nitrogen dioxide (g/m 3 ) X 5 , and ozone (g/m 3 ) X 6 , as well as two environmental factors: temperature ( • C) X 7 and relative humidity (%) X 8 .In this section, we examine the relationship between the levels of chemical pollutants in the New Territories East of Hong Kong and the daily hospital admissions for respiratory diseases (Y).The specific parameter settings are the same as those in the numerical simulation part.
After dimensionless and standardization processing, we use 397 data points for the case study.Among these, 80% of the data are used to train the model, and the remaining 20% are used to evaluate the quality of the trained model.The MSE values are shown in Table 3.It can be observed that the MSE 2 values are closer to 0, indicating that in real-world cases, the mean squared error of the deep neural network method is much smaller than that of the local linear regression method.Therefore, the deep neural network shows better convergence performance.Figure 3 presents a visual representation of the MSE values for both methods, where the x-axis represents the sample size n, and the y-axis represents the mean squared error (MSE).From the graph, it is evident that the deep neural network method exhibits a faster convergence rate compared to that of the local linear regression method.

Conclusions
In this study, we employ neural networks with ReLU activation functions for nonparametric estimation in high-dimensional spaces.By constructing suitable network architectures, we estimate unknown trend functions and prove the consistency of the estimators while also comparing and analyzing the deep neural network approach with traditional nonparametric methods.
The focus is on high-dimensional space models with unknown error distributions.Considering the spatial dependencies and heterogeneity in the space models, a deep neural network with ReLU activation functions is used to estimate the unknown trend functions.Under general assumptions, the consistency of the estimators is established, and bounds for the mean squared error (MSE) are provided.The estimators exhibit a convergence rate that is related to the sample size but independent of the dimensionality d, thereby avoiding the curse of dimensionality.Moreover, the proposed estimators achieve convergence speeds close to optimality.
Considering the spatial dependencies in high-dimensional settings with large sample sizes, the deep neural network method outperforms traditional nonparametric estimation methods.
Additional Layer/Depth Synchronization: To synchronize the number of hidden layers in two networks, an additional layer can be added with a unit weight matrix, such that (A1) Parallelization: Let f and g be two networks with the same number of hidden layers and the same input dimension, i.e., f ∈ F(L, p) and g ∈ F(L, p ), where p 0 = p 0 .The parallel network ( f , g) computes both f and g simultaneously in the joint network class F(L, (p 0 , p 1 + p 1 , . . ., p L+1 + p L+1 )).
Removing Inactive Nodes: We have In this context, we have L, p, s).If all entries in the j-th column of W i are zero, we can remove this column along with the j-th row of W i−1 and the j-th element of v i without changing the function.This implies that f ∈ F(L, (p 0 , . . . ,p i−1 , p i − 1, p i+1 , . . ., p L+1 ), s).Since there are s active parameters, for any i = 1, . . ., L, we need to iterate at least p i − s times.This proves that f ∈ F(L, (p 0 , p 1 ∧ s, p 2 ∧ s, . . ., p L ∧ s, p L+1 ), s).
In this paper, we often utilize the following fact.For a fully connected network in F(L, p), there are ∑ L =0 p p +1 weight matrix parameters and ∑ L =1 p network parameters from bias vectors.Therefore, the total number of parameters is

Appendix B. Approximation by Polynomial Neural Networks
We construct a network with all parameters bounded by 1 to approximate the calculation of xy for given inputs x and y.Let T k : 0, 2 2−2k → 0, 2 −2k , where k is a positive integer.
. This lemma can be seen as a variation of Lemma 2.4 in Telgarsky's work [14] and Proposition 2 in Yarotsky's work [15].Compared to existing results, this result allows us to construct networks with parameters equal to 1 and provides an explicit bound on the approximation error.Lemma A1.For any positive integer m,

Proof.
Step 1: We prove by induction that R k (x) is a triangular wave.More precisely, R k (x) is piecewise linear on the intervals /2 k , ( + 1)/2 k , where is an integer.If is odd, the endpoints are R k /2 k = 2 −2k , and if is even, the endpoints are R k /2 k = 0.
When k = 1, the equality R 1 = T 1 holds obviously.We assume that the statement holds for k, and we let − r be divisible by 4, denoted as ≡ r mod 4. Consider x in the interval /2 k+1 , ( + 1)/2 k+1 .When ≡ 0 mod 4, then The statement holds for k + 1, and the induction is complete.
Thus, the interpolation of ∑ m k=1 R k (x) at the point 2 −m with function g has been proven, and it is linear over the interval g is a Lipschitz function with Lipschitz constant 1.Therefore, for any x, there exists an determined by and we have thus proving the lemma.Let g(x) = x(1 − x).As proven above, to construct a network that takes inputs x and y and approximates the product xy, we use polar-type identities.
, where T + (x) = (x/2) + and Step 1: We prove the existence of a network N m with m hidden layers and width vector (3, 3, . . ., 3, 1) to compute this function: for all u ∈ [0, 1], as shown in Figure A1; it is worth noting that all parameters in this network are bounded by 1.
Step 2: We prove the existence of a network with m hidden layers to compute the following functions: Given the input (x, y), the computation of this network in the first layer is as follows: Applying the network N m on the first three elements and the last three elements, we obtain a network with m + 1 hidden layers and a width vector of (2, 6, . . ., 6, 2), and we compute applying the two-hidden-layer network (u, v) → (1 to the output.Therefore, the composite network Mult m (x, y) has m + 4 hidden layers and computes and this implies that the output is always in the interval [0, 1].According to Equation (A4) and Lemma A1, we can obtain the following: ).For all input pairs (0, y) and (0, x), the output in Equation (A5) becomes zero.
Proof.Let q := log 2 (r) .We now construct the network Mult r m and perform calculations in the first hidden layer.
If a, b, c, d ∈ [0, 1], then by Lemma A2 and the triangle inequality, we have therefore, by iteration and induction, we obtain By using Lemma A2 and the construction described above, it is evident that if one of the components of x is 0, then Mult r m (x) = 0. We construct a sufficiently large network to approximate all monomials x for nonnegative integers α i up to a certain specified degree.Typically, we use multi-index notation: r , where α = (α 1 , . . . ,α r ) and |α| := ∑ r =1 |α | represents the degree of the monomial.
Following the classical local Taylor approximation, previously used for network approximation by Yarotsky [15], for a vector a ∈ [0, 1] r , we define According to Taylor's theorem for multivariable functions, for an appropriate ξ ∈ [0, 1], we have We can express Equation (A7) as a linear combination of monomials.
The number of elements in this set is (M + 1) r .Let x = x j represent the elements of X .We define Proof.For all x = (x 1 , . . . ,x r ) ∈ [0, 1] r , we have and we use mathematical induction, assuming M = 1.The left-hand side of (A11) is and after summing, we obtain while the middle-hand side, we have and therefore the equation holds when M = 1.
Proof.The first hidden layer uses 2r(M + 1) units and 4r(M + 1) nonzero parameters to compute the functions x j − /M + and /M − x j + .The second hidden layer uses r(M + 1) units and 3r(M + 1) nonzero parameters to compute the function . These functions take values in the interval [0, 1], and the result holds when r = 1.For r > 1, we combine the obtained network with the network approximating the product ∏ r j=1 1/M − x j − /M + .According to Lemma A3, there exists a network Mult r m in the following class: F((m + 5) log 2 r , (r, 6r, 6r, . . ., 6r, 1)).
We compute ∏ r j=1 1/M − x j − x j + with an error bounded by r 2 2 −m .From Equation (A3), it follows that a Mult r m network has nonzero parameters as a bound, and since these networks have (M + 1) r parallel instances, each hidden layer requires 6(M + 1) r units and 42r 2 (M + 1) r (1 + (m + 5) log 2 r ) nonzero parameters for multiplication operations.Adding the 7(M + 1) r nonzero parameters from the first two layers, the total bound on the number of nonzero parameters is 49r 2 (M + 1) r (1 + (m + 5) log 2 r ).
According to Lemma A3, if one of the components of x is zero, then Mult r m (x) = 0.This implies that for any x ∈ D(M), the support of the function x → (Hat r (x)) x is contained within the support of the function x → ∏ r j=1 1/M − x j − x j + .
Theorem A1.For any function f ∈ C β r ([0, 1] r , K) and any integers m ≥ 1 and N ≥ (β + 1) r ∨ (K + 1)e r , there exists a network f ∈ F(L, (r, 6(r + β )N, . . ., 6(r + β )N, 1), s, ∞), and the number of parameters s ≤ 141(r + β + 1) 3+r N(m + 6), Proof.In this proof, all constructed networks take the form F(L, p, s) = F(L, p, s, ∞), where F = ∞.Let M be the largest integer such that (M + 1) r ≤ N, and we define L * := (m + 5) log 2 (β ∨ r) .With the help of Equations ( A9) and (A10), and Lemma A4, we can add a hidden layer to the network Mon r m,β , resulting in a new network r and for any x ∈ [0, 1] r , we have where B := 2Ke r and e is the natural logarithm.According to Equation (A3), the number of nonzero parameters in network Q 1 is bounded by 6r According to Lemma A6, the network Hat r calculates the product ∏ r j=1 1/M − x j − x j + with an error bounded by r 2 2 −m .It requires at most 49r 2 N(1 + L * ) active parameters.Now, consider the parallel network (Q 1 , Hat r ).Based on the definition of C r,β and the assumption on N, we observe that C r,β ≤ (β + 1) r ≤ N. According to Lemma A6, networks Q 1 and Hat r can be embedded into a joint network (Q 1 , Hat r ) with 2 + L * hidden layers.The weight vector (r, 6(r + β )N, . . ., 6(r + β )N, 2(M + 1) r ) and all parameters are bounded by 1.By using C r,β ∨ (M + 1) r ≤ N, the bound on the number of nonzero parameters in the combined network where, for the last inequality, we use C r,β ≤ (β + 1) r , the definition of L * , and the property that for any x ≥ 1, we have 1 Next, we pair the outputs of Q 1 and Hat r corresponding to the x term and apply the Mult m network described in Lemma A2 to each of the (M + 1) r pairs.In the final layer, we sum all the terms together.According to Lemma A2, this requires at most 42(m + 5)(M + 1) r + (M + 1) r ≤ 43(m + 5)N active parameters for the total (M + 1) r multiplications.By using Lemmas A2 and A6, Equation (A12), and the triangle inequality, we can construct a network such that, for any x ∈ [0, 1] r , we have where the first inequality follows from the fact that the support of (Hat r (x))x is con- tained within the support of ∏ r j=1 1/M − x j − x j +, as stated in Lemma A6.Due to Equation (A3), the network Q 2 has at most 98(r + β + 1) 3+r N(m + 5) + 43(m + 5)N ≤ 141(r + β + 1) 3+r N(m + 5). (A15) To obtain the network reconstruction of the function f , it is necessary to apply scaling and shifting to the output terms.This is primarily due to the finite parameter weights in the network.We recall that B = 2Ke r .The network x → BM r x belongs to the class F (3, (1, M r , 1, [2Ke r ], 1)), where the shift vectors v j are all zero and all entries of the weight matrices W j are equal to 1. Since N ≥ (K + 1)e r , the number of parameters in this network is bounded by 2M r + 2 2Ke r ≤ 6N.This implies the existence of a network in the class F(4, (1, 2, 2M r , 2, 2 2Ke r , 1)), which computes a → BM r (a − c), where c := 1/(2M r ).This network computes (a − c) + and (c − a) + in the first hidden layer, and then applies the network x → BM r x to these two units.In the output layer, the first value is subtracted from the second value.This requires at most 6 + 12N active parameters.
Due to Equations (A11) and (A14), there exists a network Q 3 in the following class and for all x ∈ [0, 1] r , we have Under the condition of Equation (A15), the bound for the nonzero parameters of Q 3 is Thus, the result is proven.
Based on Theorem A1, we can now construct a network that approximates f = g q • . . .• g 0 .In the first step, we show that f can always be represented as a composition of functions defined on hypercubes [0, 1] t i .As in the previous theorem, let and we assume K i ≥ 1 for i = 1, . . ., q − 1. Define From the definition of the Hölder ball C β r (D, K), we can see that h 0j takes values in the interval [0, 1].
where, for i = 1, . . ., q − 1, we have h qj ∈ C β q t q [0, 1] t q , K q 2K q−1 β q .Without loss of generality, we can always assume that the radius of the Hölder ball is at least 1, i.e., K i ≥ 1.

Lemma A7
. Let h ij be as defined above, with K i ≥ 1.Then, for any function hi = hij T j , where hij : an upper bound of the Hölder seminorm of h ij for j = 1, . . ., d i+1 , then, by the triangle inequality, we have Combining this with the inequality (y + z) α ≤ y α + z α , which holds for all y, z ≥ 0 and all α ∈ [0, 1], the lemma is proven.

Proof of Theorem 1.
Here, all n are assumed to be sufficiently large.Throughout the entire proof, C is a constant that depends only on the variation of (q, d, t, β, F).Combining Theorem 2 with the bounds on the depth L and network sparsity s assumed, for n ≥ 3, we have where, for the lower bound, we set ε = 1/2, and for the upper bound, we set ε = 1.We take C = 8C ; then, when ∆ n ( mn , m) ≥ CφnL ln 2 n, we have 1  8 ∆ n ( mn , m) ≥ C φnL ln 2 n.Substituting this into the left-hand side of Equation (A17), we obtain Thus, the lower bound for Equation ( 8) is established.
To obtain upper bounds for Equations ( 7) and ( 8), it is necessary to constrain the approximation error.For this purpose, the regression function m is rewritten as Equation (A16), i.e., m = h q • . . .• h 0 , where h i = h ij T j and h ij is defined on [0, 1] t i and maps to [0, 1] for any i < q.
Here, we apply Theorem A1 to each function h ij separately.Let m = log 2 n and consider the following.
where Q i is the Hölder norm upper bound of h ij .If i < q, two additional layers 1 − (1 − x) + are applied to the output, requiring four additional parameters.The resulting network is denoted as where , according to the construction rules in Appendix A, we can realize it in the following class: where E := 3(q − 1) + ∑ q i=0 L i .By observation, there exists an A n bounded by n such that For all sufficiently large n, utilizing the inequality x < x + 1, we have E ≤ ∑ q i=0 (log 2 4 + log 2 (ti ∨ βi)) log 2 n ≤ L, according to Equation (A1), and for sufficiently large n, the space defined in Equation (A20) can be embedded into F(L, p, s), where L, p, s satisfies the assumptions of the theorem.We choose N = c max i=0,...,q n ti 2βi * +t i with a sufficiently small constant c > 0, depending only on q, d, t, β.Combining Theorem A1 with Equations (A18) and (A19), we have inf For the approximation error in Equation (A17), we need to find a network function bounded by the supnorm of F. According to the previous inequalities, there exists a sequence of functions ( mn ) n such that, for all sufficiently large n, mn ∈ F(L, p, s), and where the last inequality is based on Assumption (1).Additionally, m * n ∈ F(L, p, s, F).We can denote m * n − m = (m * n − mn ) + ( mn − m), and we This shows that if we take the lower bound on a smaller space F(L, p, s, F), then Equation (A21) also holds.Combining this with the upper bound of Equation (A17), we obtain when ∆ n ( mn , m) ≤ Cφ n L ln 2 n, that R( mn , m) ≤ C φ n L ln 2 n, and when ∆ n ( mn , m) ≥ Cφ n L ln 2 n, that R( mn , m) ≤ C ∆ n ( mn , m); therefore, the upper bounds in Equations ( 7) and ( 8) hold for any constant C > 0. This completes the proof.
We begin by utilizing several oracle inequalities for the least squares estimators, as presented in Gyo et al. [16][17][18][19][20].However, these inequalities assume bounded response variables, which are violated in the nonparametric regression model with Gaussian measurement noise.Additionally, we provide a lower bound for the risk and offer proof that can be easily generalized to any noise distribution.Let N(δ, F, | • | ∞ ) be the covering number, which represents the minimum number of | • | ∞ balls with radius δ needed to cover F (where the center does not necessarily have to be within F).
Lemma A8.We consider the nonparametric regression model in d-dimensional variables given by Equation ( 2) with an unknown regression function m.Let m be an arbitrary estimator taking values in F. Let us define For F ≥ 1 and assuming {m} ∪ For any estimator m, we introduce the empirical risk R n ( m, m) := E | m − m| 2 n .
Step 1: We show that the upper bound holds under the restriction ln N n ≤ n.Since R( m, m) ≤ 4F 2 , the upper bound naturally holds when ln N n ≥ n.In this case, let m ∈ arg min f ∈F ∑ i∈Λ n (Y i − f (X i )) 2 be a global risk minimizer.We observe that From this equation, we see that ∆ n ≤ 8F 2 , which implies a lower bound on the logarithm ln N n ≥ n in the argument.Therefore, we assume that ln N n ≤ n.The proof is divided into four parts, denoted as (I)-(IV).
(I) Establishing a connection between risk R( m, m) = E ( m(X) − m(X)) 2 and its empir- (II) For any estimate m taking values in F, we know that Since F ≥ 1, the lower bound of the lemma can be obtained by combining (I) and (IV), while the upper bound can be obtained from (I) and (III).
(I) Given a minimum δ-covering of F, let f j represent the centers of the balls.According to the construction, there exists a random j * such that m − f j * ∞ ≤ δ.Without loss of generality, we can assume that f j ∞ ≤ F. The random variables X i , i ∈ 1, 2, . . ., n have the same distribution as X and are independent of (X i ), i ∈ 1, 2, . . ., n.We can use where g j X i , We replace f j with f j * , and we define g j by using the same method.Similarly, we set r j := 2 , and define r * as r j when j = j * .
In the last part, we use the triangle inequality and f j * − m ∞ ≤ δ.
For random variables U and T, the Cauchy-Schwarz inequality states that and Observing that E g j X i , X i = 0, Bernstein's inequality states that for independent and centered random variables Combining Bernstein's inequality and the bound argument, we obtain and since r j ≥ n −1 ln N n , for all t ≥ 6 √ n ln N n , we have Thus, for large values of t, the denominator in the exponential is dominated by the first term.We have According to the assumption, N n ≥ 3; hence, ln N n ≥ 1.By using a similar approach to the upper bound for E[T], we can obtain the quadratic case.
Since all parameters are bounded by 1 in absolute value, we can discretize the nonzero parameters by using a grid size of δ/(2(L + 1)V), and the covering number Taking the logarithm yields the proof.Note 1: Similarly, applying Equation (A4) to Lemma A10 gives Proof of Theorem 2. Let δ = 1/n.The proof follows directly from Lemmas A8 and A10, and Note 1.
Proof of Theorem 3. In this proof, we define Let us assume that there exist positive constants γ ≤ Γ such that the Lebesgue density of X over [0, 1] d is bounded below by γ and above by Γ.For this particular design, we have R( mn , m) ≥ γ| mn − m| 2 2 .Let P f represent the data mechanism in the nonparametric regression model given by Equation (13).For the Kullback-Leibler divergence, we have KL P f , 2 .In Alexandre's work [22], Theorem 2.7 states that if, for M ≥ 1 and κ > 0, we have f (0) , . . ., f (M) ∈ G(q, d, t, β, K), then ≤ M ln(M)/(9Γ).
Proof of Lemma 1.Let c 2 ≤ 1.Since |m| ∞ ≤ K, we need to consider only the lower bound on F(L, p, s, F), where F = K + 1.Let fn be the empirical risk minimizer.Recall that ∆ n ( mn , m) = 0. Due to the minimax lower bound in Theorem 3, there exists a constant c 3 such that for all sufficiently large n, we have c 3 n −2β/(2β+d) ≤ sup m∈C β 1 ([0,1],K) R( mn , m).
The constants C 1 and C 2 depend only on K, β and d.By using the condition s ≤ c 1 ε −d/β /(L ln(1/ε)) and choosing a sufficiently small c 1 , the proof is completed.
Proof of Lemma 2. Let r be the smallest positive integer such that µ r := x r ψ(x)dx = 0.Such an r exists because the span of {x r : r = 0, 1, . ..} is dense in L 2 [0, A], and ψ cannot be a constant function.If h ∈ L 2 (R), then, for the wavelet coefficients, we have

Figure 1 .
Figure 1.MSE s of local linear regression method (dashed blue line) and the neural network method (solid orange line) of dimension 3, 5, 8 and 10.(a) Three dimensions.(b) Five dimensions.(c) Eight dimensions.(d) Ten dimensions.

Figure 2 .
Figure 2. Comparison of MSE s between two methods for Model 2.

Figure 3 .
Figure 3.Comparison of MSE s between two methods for this case.
where the smallest L is the Lipschitz constant.The combination of two Lipschitz functions with Lipschitz constants L 1 and L 2 results in a new Lipschitz function with a Lipschitz constant of L 1 L 2 .Therefore, the Lipschitz constant of A − k f is bounded by ∏ L =k−1 p .Given ε > 0, let f , f ∈ F(L, p, s) be two network functions with parameters differing from each other by at most ε.Let f have parameters (v k , W k ) and f * have parameters v * k , W * k .Then, we have for any α and |α| < β * , we have |∂ α ψ u | ∞ ≤ 1, by using the fact that K ∈ C β * (R, 1).For α = (α 1 , .

Table 1 .
Various dimensional MSE values of two methods for Model 1.

Table 2 .
The MSE values of both methods for Model 2.

Table 3 .
The MSE values of both methods for this case.