Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network

Hongxia Wang; Xiao Jin; Jianian Wang; Hongxia Hao

doi:10.3390/math11183899

,

and

School of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China

^*

Author to whom correspondence should be addressed.

Mathematics2023, 11(18), 3899;https://doi.org/10.3390/math11183899

Version Notes

Order Reprints

Abstract

With high dimensionality and dependence in spatial data, traditional parametric methods suffer from the curse of dimensionality problem. The theoretical properties of deep neural network estimation methods for high-dimensional spatial models with dependence and heterogeneity have been investigated only in a few studies. In this paper, we propose a deep neural network with a ReLU activation function to estimate unknown trend components, considering both spatial dependence and heterogeneity. We prove the compatibility of the estimated components under spatial dependence conditions and provide an upper bound for the mean squared error (

M S E

). Simulations and empirical studies demonstrate that the convergence speed of neural network methods is significantly better than that of local linear methods.

Keywords:

deep neural network; spatial dependence; spatial heterogeneity; ReLU activation function

MSC:

62G05; 68T07

1. Introduction

Spatial data arise in many fields, including environmental science, econometrics, epidemiology, image analysis, oceanography, geography, geology, plant ecology, archaeology, agriculture, and psychology. Spatial correlation and spatial heterogeneity are two significant features of spatial data. Various spatial modeling methods have been applied to explore the effect of spatial heterogeneity. Notably, numerous local spatial techniques have been proposed to accommodate spatial heterogeneity. For example, Hallin et al. [1] and Biau and Cadre [2] proposed a local linear method for the modeling of spatial heterogeneity. Bentsen et al. [3] used a graph neural network architecture to extract spatial dependencies with different update functions to learn temporal correlations.

Spatial data often exhibit high dimensionality, a large scale, heterogeneity, and strong complexity. These challenges often make traditional statistical methods ineffective. Statistical machine learning methods can effectively address such challenges. Du et al. [4] pointed out the issues of traditional spatial data being large-scale and generally complex and summarized the effectiveness and application potential of four advanced machine learning methods—support vector machine (SVM)-based kernel learning, semi-supervised and active learning, ensemble learning, and deep learning—in handling complex spatial data. Farrell et al. [5] highlighted the challenges that high-dimensional spatial data, large volumes of data, and multicollinearity among covariates pose to traditional statistical models in variable selection. Three machine learning algorithms—maximum entropy (MaxEnt), random forests (RFs), and support vector machines (SVMs)—were employed to mitigate the issues of multicollinearity in high-dimensional spatial data. Nikparvar et al. [6] pointed out that the properties of spatially explicit data are often ignored or inadequately addressed in machine learning applications within spatial domains. They argued that the future prospects for spatial machine learning are very promising.

Statistical machine learning methods have advanced rapidly, while theoretical ones are not well established. Schmidt-Hieber [7] investigated the following nonparametric model:

Y_{i} = f_{0} (X_{i}) + ϵ_{i}, i = 1, \dots, n,

(1)

where the noise variables

ϵ_{i}

are assumed to be i.i.d.,

X_{i} \in {[0, 1]}^{d}, i = 1, \dots, n,

are independently and identically distributed,

Y_{i} \in R, i = 1, \dots, n

and are independently and identically distributed. It was shown that estimators based on sparsely connected deep neural networks with the ReLU activation function and a properly chosen network architecture achieve minimax rates of convergence (up to log n-factors) under a general composition assumption on the regression function.

Considering the dependencies and heterogeneity of spatial models, we study nonparametric high-dimensional spatial models as follows:

Y_{i} = m (X_{i}) + R_{i},

(2)

where

i \in Λ_{n} = \{(i_{1}, i_{2} \dots, i_{N}) : i_{j} = 1, 2, \dots, n_{j}, j = 1, 2, \dots N\}, Y_{i} \in R, X_{i} \in R^{d}, m (X_{i})

represents the trend function,

R_{i}

satisfies the

α

-mixing condition (the definition of the

α

-mixing condition can be found at the beginning of Section 2.2),

n = n_{1} + \dots + n_{N} .

For example, Y denotes the hourly ozone concentrations;

X

is a

5 \times 1

vector that consists of the following explanatory variables observed at each station: wind speed, air pressure, air temperature, relative humidity, and elevation. The observation locations are recorded in longitude

i_{1}

and latitude

i_{2}

. In this case,

d = 5

and

N = 2

; see [8].

Under general assumptions, we prove the consistency of the estimator and provide bounds for the mean squared error

M S E

. In the simulation aspect, a comparison with the local linear regression method demonstrates that the neural network method converges much faster than does the local linear regression. In the empirical study, considering the air pollution index, air pollutants, and environmental factors, the effectiveness of the neural network is demonstrated through a comparison with the local linear regression method, especially in small sample cases.

Throughout the rest of the paper, bold letters are used to represent vectors; for example,

x : = {(x_{1}, \dots, x_{d})}^{⊤}

. We define

{| x |}_{\infty} : = {max}_{i} |x_{i}|, {| x |}_{0} : = \sum_{i} I (x_{i} \neq 0),

where

I

represents the indicator function:

\begin{matrix} I_{A} (w) = \{\begin{matrix} 0 & if w \notin A, \\ 1 & if w \in A, \end{matrix} \end{matrix}

{| x |}_{0}

denotes the total number of

x_{i}

, which is not equal to zero.

{| x |}_{p} : = {(\sum_{i = 1}^{d} {|x_{i}|}^{p})}^{1 / p}

, and we write

{| f |}_{p} : = {| f |}_{L^{p} (D)}

as the

L^{p}

norm on D; D is some domain, and different situations may be different. For two sequences

{(a_{n})}_{n}

and

{(b_{n})}_{n}

, we write

a_{n} ≲ b_{n}

if there exists a constant C such that for all n,

a_{n} \leq C b_{n}

. Moreover,

a_{n} ≍ b_{n}

means

a_{n} ≲ b_{n}

and

b_{n} ≲ a_{n}

.

{log}_{2}

denotes the logarithm base 2,

\ln

denotes the logarithm base e,

⌈ x ⌉

represents the smallest integer

\geq x

, and

⌊ x ⌋

represents the largest integer

\leq x

.

2. Nonparametric High-Dimensional Space Model Estimation

2.1. Mathematical Modeling of Deep Network Features

Definition 1.

Fitting a multilayer neural network requires the choice of an activation function

σ : R \to R

and the network architecture. Motivated by its importance in deep learning, we study the rectifier linear unit (ReLU) activation function; see [7].

σ (x) = \max (x, 0) .

For

v = (v_{1}, \dots, v_{r}) \in R^{r}

, the displacement activation function

σ_{v} : R^{r} \to R^{r}

is defined as follows:

σ_{v} : R^{r} \to R^{r}

,

σ_{v} (\begin{matrix} y_{1} \\ ⋮ \\ y_{r} \end{matrix}) = (\begin{matrix} σ (y_{1} - v_{1}) \\ ⋮ \\ σ (y_{r} - v_{r}) \end{matrix}) .

The neural network architecture

(L, p)

consists of a positive integer L known as the number of hidden layers or depth and a width vector

p = (p_{0}, \dots, p_{L + 1}) \in N^{L + 2}

. A neural network with the network structure

(L, p)

is any function of the following form:

f : R^{p_{0}} \to R^{p_{L + 1}}, x \mapsto f (x) = W_{L} σ_{v_{L}} W_{L - 1} σ_{v_{L - 1}} \dots W_{1} σ_{v_{1}} W_{0} x,

(3)

where

W_{i}

is a

p_{i + 1} \times p_{i}

weight matrix and

v_{i} \in R^{p i}

is a displacement vector, where

i = 1, 2, \dots, L

. Therefore, the network function is constructed by alternating matrix vector multiplications and the action of nonlinear activation functions

σ_{v}

. In Equation (3), the shift vectors can also be omitted by considering the input as

(x, 1)

and augmenting the weight matrices with an additional row and column. To fit the network to data generated by a d-dimensional nonparametric regression model, it is required to have

p_{0} = d

and

p_{L + 1} = 1

.

Given a network function

f (x) = W_{L} σ_{v_{L}} W_{L - 1} σ_{v_{L - 1}} \dots W_{1} σ_{v_{1}} W_{0} x

, the network parameters are the elements of the matrices

{(W_{j})}_{j = 0, \dots, L}

and the vectors

{(v_{j})}_{j = 1, \dots, L}

. These parameters need to be estimated/learned from the data. In this context, “estimate” and “learn” can be used interchangeably, as the process of estimating the parameters from data is often referred to as learning in the context of neural networks and machine learning.

The purpose of this paper is to consider a framework that encompasses the fundamental characteristics of modern deep network architectures. In particular, in this paper, we allow for a large depth L and a significant number of potential network parameters without requiring an upper bound on the number of network parameters for the main results. Consequently, this approach deals with high-dimensional settings that have more parameters than training data. Another characteristic of trained networks is that the learned network parameters are typically not very large; see [7]. In practice, the weights of trained networks often do not differ significantly from the initialized weights. As all elements in orthogonal matrices are bounded by 1, the weights of trained networks also do not become excessively large. However, existing theoretical results often demand that the size of the network parameters tends to infinity. To be more consistent with what is observed in practice, all parameters considered in this paper are bounded by 1. By projecting the network parameters at each iteration onto the interval

[- 1, 1]

, this constraint can be easily incorporated into deep learning algorithms.

Let

{|W_{j}|}_{\infty}

denote the maximum element norm of

W_{j}

, and let us consider the network function space with a given network structure and network parameters bounded by 1 as follows:

F (L, p) : = \{f in the form of (2.1) : max_{j = 0, \dots, L} {∥W_{j}∥}_{\infty} \lor {|v_{j}|}_{\infty} \leq 1\},

(4)

where

v_{0}

is a vector with all components being 0.

In this work, we model the network sparsity assuming that there are only a few nonzero/active network parameters. If

| | W_{j} {| |}_{0}

denotes the number of nonzero entries of

W_{j}

and

{| | | f |}_{\infty} {| |}_{\infty}

stands for the supnorm of the function

x \to {| f (x) |}_{\infty}

, then the s-sparse networks are given by

\begin{matrix} F (L, p, s) & : = F (L, p, s, F) \\ : = \{f \in F (L, p) : \sum_{j = 0}^{L} {∥W_{j}∥}_{0} + {|v_{j}|}_{0} \leq s, {∥{| f |}_{\infty}∥}_{\infty} \leq F\}, \end{matrix}

where F is a constant; the upper bound on the uniform norm of the function f is often unnecessary and is thus omitted in the notation. Here, we consider cases where the number of network parameters s is very small compared to the total number of parameters in the network.

For any estimate

{\hat{m}}_{n}

that returns a network in class

F (L, p, s, F)

, the corresponding quantity is defined as follows:

\begin{matrix} Δ_{n} ({\hat{m}}_{n}, m) \\ : = E_{m} [\frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - {\hat{m}}_{n} (X_{i}))}^{2} - inf_{f \in F (L, p, s, F)} \frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - f (X_{i}))}^{2}] . \end{matrix}

(5)

The sequence

Δ_{n} ({\hat{m}}_{n}, m)

measures the discrepancy between the expected empirical risk of

{\hat{m}}_{n}

and the global minimum of this class over all networks. The subscript m in

E_{m}

indicates the sample expectation with respect to the nonparametric regression model generated by the regression function m. Notice that

Δ_{n} ({\hat{m}}_{n}, m) \geq 0

, and

Δ_{n} ({\hat{m}}_{n}, m) = 0

if

{\hat{m}}_{n}

is an empirical risk minimizer.

Therefore,

Δ_{n} ({\hat{m}}_{n}, m)

is a critical quantity that, together with the minimax estimation rates, determines the convergence rate of

{\hat{m}}_{n}

.

To evaluate the statistical performance of

{\hat{m}}_{n}

under general assumptions, the mean squared error of the estimator is defined as

R ({\hat{m}}_{n}, m) : = E_{m} [{({\hat{m}}_{n} (X) - m (X))}^{2}] .

(6)

2.2. Estimation and Theoretical Properties

In order to obtain asymptotic results, we will assume throughout this paper that

X_{i}

,

i

\in Λ_{n}

satisfies the following

α

-mixing condition: there exists a function

φ (t) ↓

as

t \to \infty

with

φ (0) = 1

, such that whenever

Ξ

,

\tilde{(Ξ)}

⊆

Λ_{n}

are finite sets, it is the case that

\begin{matrix} τ (B (Ξ), B \tilde{(Ξ)}) \\ = s u p {| P r (A B) - P r (B) | A \in B ((Ξ)), B \in B \tilde{(Ξ)}} \\ \leq ψ (c a r d (Ξ), \tilde{(Ξ)}) φ (d (Ξ, Ξ)), \end{matrix}

where

B (Ξ)

denotes the Borel

σ

-field generated by

{X_{i}, i \in Ξ}

, card

(Ξ)

is the cardinality of

Ξ

, and d(

Ξ, \tilde{(Ξ)}

) =

m i n {| | i - i^{^{'}} | | : i \in Ξ, i^{^{'}} \in \tilde{(Ξ)}}

is the distance between

Ξ

and

\tilde{(Ξ)}

, where

| | i | | = {(i_{1}^{2} + \dots + i_{N}^{2})}^{\frac{1}{2}}

stands for the Euclidean norm and

ψ : N^{2} \to R^{+}

is a symmetric positive function that is nondecreasing in each variable; see [8].

The theoretical performance of neural networks depends on the underlying function class, and a classic approach in nonparametric statistics is to assume that the regression function is

β

-smooth. In this paper, we assume that the regression function

m (X_{i})

is a composition of multiple functions, i.e.,

m = g_{q} \circ g_{q - 1} \circ \dots \circ g_{1} \circ g_{0},

where

g_{i} : {[a_{i}, b_{i}]}^{d_{i}} \to {[a_{i + 1}, b_{i + 1}]}^{d_{i + 1}}

. We denote the components of

g_{i}

as

g_{i} = {(g_{i j})}_{j = 1, \dots, d_{i + 1}}^{⊤}

, and we let

t_{i}

be the maximum variable that each

g_{i}

depends on. Thus, each

g_{i}

is a function with

t_{i}

variables.

If all partial derivatives up to order

⌊ β ⌋

of a function exist and are bounded, the

⌊ β ⌋

-th order partial derivatives are

β

-

⌊ β ⌋

Hölder, where

⌊ β ⌋

represents the largest integer strictly less than

β

. Then, the ball of

β

-Hölder functions with radius K is defined as follows:

\begin{matrix} C_{r}^{β} (D, K) = & \{f : D \subset R^{r} \to R : \\ \sum_{α : | α | < β} {∥\partial^{α} f∥}_{\infty} + \sum_{α : | α | = ⌊ β ⌋} sup_{\begin{matrix} x, y \in D \\ x \neq y \end{matrix}} \frac{|\partial^{α} f (x) - \partial^{α} f (y)|}{{| x - y |}_{\infty}^{β - ⌊ β ⌋}} \leq K\}, \end{matrix}

where we use multi-index notation, i.e.,

\partial^{α} = \partial^{α_{1}} \dots \partial^{α_{r}}

, where

α = (α_{1}, \dots, α_{r}) \in N^{r}

; see [7].

We assume that each function

g_{i j}

has Hölder smoothness

β_{i}

. Since

g_{i j}

is also a function of

t_{i}

variables,

g_{i j} \in C_{t_{i}}^{β_{i}} ({[a_{i}, b_{i}]}^{t_{i}}, K_{i})

, the underlying function space is then defined as

\begin{matrix} G (q, d, t, β, K) : = \{m = g_{q} \circ \dots \circ g_{0} : g_{i} = {(g_{i j})}_{j} : {[a_{i}, b_{i}]}^{d_{i}} \to {[a_{i + 1}, b_{i + 1}]}^{d_{i + 1}}, \\ g_{i j} \in C_{t_{i}}^{β_{i}} ({[a_{i}, b_{i}]}^{t_{i}}, K), |a_{i}|, |b_{i}| \leq K\}, \end{matrix}

where

d : = (d_{0}, \dots, d_{q + 1}), t : = (t_{0}, \dots, t_{q}), β : = (β_{0}, \dots, β_{q})

.

Theorem 1.

We consider the nonparametric regression model with d variables for the composite regression function

m = g_{q} \circ g_{q - 1} \circ \dots \circ g_{1} \circ g_{0}

in the class

G (q, d, t, β, K)

, as described in Equation (2). Let

{\hat{m}}_{n}

be an estimator from the function class

F (L, {(p_{i})}_{i = 0, \dots, L + 1}, s, F)

satisfying the following conditions:

(1): $F \geq max (K, 1)$ ,
(2): $\sum_{i = 0}^{q} {log}_{2} (4 t_{i} \lor 4 β_{i}) {log}_{2} n \leq L ≲ n ϕ_{n}$ ,
(3): $n ϕ_{n} ≲ {min}_{i = 1, \dots, L} p_{i}$ ,
(4): $s ≍ n ϕ_{n} \ln n$ ,

where

ϕ_{n}

is a positive sequence; then, there exist constants C and

C^{'}

depending only on

q, d, t, β, F

, such that if

Δ_{n} ({\hat{m}}_{n}, m) \leq C ϕ n L {ln}^{2} n

, then

R ({\hat{m}}_{n}, m) \leq C^{'} ϕ_{n} L {ln}^{2} n,

(7)

if

Δ_{n} ({\hat{m}}_{n}, m) \geq C ϕ n L \ln^{2} n

, then

\frac{1}{C^{'}} Δ_{n} ({\hat{m}}_{n}, m) \leq R ({\hat{m}}_{n}, m) \leq C^{'} Δ_{n} ({\hat{m}}_{n}, m) .

(8)

To minimize

ϕ_{n} L \ln^{2} n

, let

Δ_{n} ({\hat{m}}_{n}, m) \leq C ϕ_{n} \ln^{3} n

; then,

R ({\hat{m}}_{n}, m) \leq C^{'} ϕ_{n} \ln^{3} n .

The convergence rate in Theorem 1 depends on

ϕ_{n}

and

Δ_{n} ({\hat{m}}_{n}, m)

. The following reasoning shows that

ϕ_{n}

serves as a lower bound for the supremum infimum estimation risk over this class. For any empirical risk minimizer, where the definition of the

Δ_{n}

term becomes zero, the following corollary holds.

Corollary 1.

Let

{\tilde{m}}_{n} \in arg {min}_{f \in F (L, p, s, F)} \sum_{i \in Λ_{n}} {(Y_{i} - m (X_{i}))}^{2}

be an empirical risk minimizer under the same conditions as in Theorem 1. There exists a constant

C^{'}

, depending only on

q, d, t, β, F

, such that

R ({\tilde{m}}_{n}, m) \leq C^{'} ϕ_{n} L \ln^{2} n .

(9)

Condition (1) in Theorem 1 is very mild and states only that the network functions should have at least the same supremum norm as the regression function. From the other assumptions in Theorem 1, it becomes clear that there is a lot of flexibility in selecting a good network architecture as long as the number of active parameters s is taken to be in the right order.

In a fully connected network, the number of network parameters is

\sum_{i = 0}^{L} p_{i} p_{i + 1}

. This implies that Theorem 1 requires a sparse network. More precisely, the network must have at least

\sum_{i = 1}^{L} p_{i} - s

completely inactive nodes, meaning that all incoming signals are zero. Condition (4) chooses

s ≍ n ϕ_{n} \ln n

to balance the mean squared error and variance. From the proof of this theorem (Appendix B), convergence rates for various orders of s can also be derived.

Deep learning excels over other methods only in the large sample regime. This suggests that the method may be adaptable to the underlying structures in the data. This may produce rapid convergence rates, but with larger constants or remainders, which can lead to relatively poor performance in small sample scenarios.

The proof of the risk bounds in Theorem 1 is based on the following oracle-type inequality.

Theorem 2.

Let us consider the d-dimensional nonparametric regression model given by Equation (2) with an unknown regression function m, where

F \geq 1

and

{|m|}_{\infty} \leq F

, let

{\hat{m}}_{n}

be an arbitrary estimator taking values in the class

F (L, p, s, F)

, and let

\begin{matrix} Δ_{n} ({\hat{m}}_{n}, m) \\ : = E_{m} [\frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - {\hat{m}}_{n} (X_{i}))}^{2} - inf_{f \in F (L, p, s, F)} \frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - f (X_{i}))}^{2}], \end{matrix}

(10)

and for any

ε \in (0, 1]

, there exists a constant

C_{ε}

, depending only on ε, such that

τ_{ε, n} : = C_{ε} F^{2} \frac{(s + 1) ln (n {(s + 1)}^{L} p_{0} p_{L + 1})}{n},

and we have

\begin{matrix} {(1 - ε)}^{2} Δ_{n} ({\hat{m}}_{n}, m) - τ_{ε, n} \leq R ({\hat{m}}_{n}, m) \\ \leq {(1 + ε)}^{2} (inf_{f \in F (L, p, s, F)} {∥f - m∥}_{\infty}^{2} + Δ_{n} ({\hat{m}}_{n}, m)) + τ_{ε, n} . \end{matrix}

In the context of oracle-type inequalities, an increase in the number of layers can lead to a deterioration in the upper bound on the risk. In practice, it has also been observed that having too many layers can result in a decline in performance. We refer to Section 4.4 in He et al. [9] and He and Sun [10] for more details.

The proof relies on two specific properties of the ReLU activation function rather than other activation functions. The first property is its projection property, which is expressed as

σ \circ σ = σ,

where the composite of the ReLU activation function is considered, given that the foundation of approximation theory lies in constructing smaller networks to perform simpler tasks, which may not all require the same network depth. To combine these subnetworks, it is necessary to synchronize the network depth by adding hidden layers that do not alter the output. This can be achieved by selecting weight matrices in the network (assuming an equal width for consecutive layers) and by utilizing the projection property of the ReLU activation function, given by

σ \circ σ = σ

. This property is beneficial not only theoretically but also in practice, as it greatly aids in passing a result to deeper layers through skip connections.

Next, we prove that

ϕ_{n}

serves as a lower bound for the supremum infimum estimation risk over class

G (q, d, t, β, K)

with

t_{i} \leq min (d_{0}, \dots, d_{i - 1})

. This means that in the composition of functions, no additional dimensions are added at deeper abstract layers. In particular, this approach avoids the case where

t_{i}

exceeds the input dimension

d_{0}

.

Theorem 3.

Let us consider the nonparametric regression model (2), where

X_{i}

is drawn from a distribution with a Lebesgue density on

{[0, 1]}^{d}

, and the lower and upper bounds of this distribution are positive constants. For any nonnegative integer q, arbitrary dimension vectors

d

and

t

, and for all i such that

t_{i} \leq min (d_{0}, \dots, d_{i - 1})

, and any smoothness vector β, along with all sufficiently large constants

K > 0

, there exists a positive constant c such that

inf_{{\hat{m}}_{n}} sup_{m \in G (q, d, t, β, K)} R ({\hat{m}}_{n}, m) \geq c ϕ_{n},

where inf is taken over all estimators

{\hat{m}}_{n}

.

By combining the supremum infimum lower bound with the oracle-type inequality, we can easily obtain the following result.

Lemma 1.

Given

β, K > 0

, and

d \in N

, there exist constants

c_{1}, c_{2}

depending only on

β, K, d

, such that for

ε \leq c_{2}

, we have

s \leq c_{1} \frac{ε^{- d / β}}{L ln (1 / ε)},

and then, for any width vector

p

, where

p_{0} = d

and

p_{L + 1} = 1

, we know that

sup_{m \in C_{d}^{β} ({[0, 1]}^{d}, K)} inf_{f \in F (L, p, s)} {∥f - m∥}_{\infty} \geq ε .

2.3. Suboptimality of Wavelet Series Estimation

In this section, we show that wavelet series estimators are unable to take advantage of the underlying composition structure in the regression function and achieve, in some setups, much slower convergence rates. Wavelet estimation is susceptible to the curse of dimensionality, whereas neural networks can achieve faster convergence rates.

We consider a compressed wavelet system

(ψ_{λ}, λ \in Λ)

, restricted to

L^{2} [0, 1]

from

L^{2} (R)

, as referred to in Cohen et al. [11]. Here,

Λ = \{(j, k) : j = - 1, 0, 1, \dots; k \in I_{j}\}

, and

ψ_{- 1, k} : = ϕ (\cdot - k)

denotes the shift-scaled function. For any function

f \in L^{2} {[0, 1]}^{d}

, we have

f (x) = \sum_{(λ_{1}, \dots, λ_{d}) \in Λ \times \dots \times Λ} d_{λ_{1} \dots λ_{d}} (f) \prod_{r = 1}^{d} ψ_{λ_{r}} (x_{r}),

and the convergence on

L^{2} [0, 1]

entails wavelet coefficients.

d_{λ_{1} \dots λ_{d}} (f) : = \int f (x) \prod_{r = 1}^{d} ψ_{λ_{r}} (x_{r}) d x .

To construct a counterexample, it is sufficient to consider the nonparametric regression model

Y_{i} = m (X_{i}) + R_{i},

i \in Λ_{n} = \{(i_{1}, i_{2}, \dots, i_{N}) : i_{j} = 1, 2, \dots, n_{j}, j = 1, 2, \dots, N\},

X_{i} : = (U_{i, 1}, \dots, U_{i, d}) \sim U {[0, 1]}^{d}

. The empirical wavelet coefficients are obtained; furthermore,

{\hat{d}}_{λ_{1} \dots λ_{d}} (f_{0}) = \frac{1}{n} \sum_{i \in Λ_{n}} Y_{i} \prod_{r = 1}^{d} ψ_{λ_{r}} (U_{i, r}) .

Since

E [{\hat{d}}_{λ_{1} \dots λ_{d}} (f_{0})] = d_{λ_{1} \dots λ_{d}} (f_{0})

, an unbiased estimate for the wavelet coefficients is obtained; furthermore,

\begin{matrix} Var ({\hat{d}}_{λ_{1} \dots λ_{d}} (f_{0})) & = \frac{1}{n} Var (Y_{1} \prod_{r = 1}^{d} ψ_{λ_{r}} (U_{1, r})) \\ \geq \frac{1}{n} E [Var (Y_{1} \prod_{r = 1}^{d} ψ_{λ_{r}} (U_{1, r}) ∣ U_{1, 1,}, \dots, U_{1, d})] \\ = \frac{1}{n} . \end{matrix}

We study the estimators of the following form

{\hat{f}}_{n} (x) = \sum_{(λ_{1}, \dots, λ_{d}) \in I} {\hat{d}}_{λ_{1} \dots λ_{d}} (f_{0}) \prod_{r = 1}^{d} ψ_{λ_{r}} (x_{r}),

(11)

and for any subset

I \subset Λ \times \dots \times Λ

, we have

\begin{matrix} R ({\hat{f}}_{n}, f_{0}) = \sum_{(λ_{1}, \dots, λ_{d}) \in I} E [{({\hat{d}}_{λ_{1} \dots λ_{d}} (f_{0}) - d_{λ_{1} \dots λ_{d}} (f_{0}))}^{2}] + \sum_{(λ_{1}, \dots, λ_{d}) \in I^{c}} d_{λ_{1} \dots λ_{d}} {(f_{0})}^{2} \\ \geq \sum_{(λ_{1}, \dots, λ_{d}) \in Λ \times \dots \times Λ} \frac{1}{n} \land d_{λ_{1} \dots λ_{d}} {(f_{0})}^{2} . \end{matrix}

(12)

ψ \in L^{2} (R)

possesses compact support; thus, without loss of generality, we assume that

ψ

is zero outside

[0, 2^{q}]

for some integer

q > 0

.

Lemma 2.

For any integer

q > 0

,

ν : = ⌈{log}_{2} d⌉ + 1

, and any

0 < α \leq 1

and

K > 0

, there exists a nonzero constant

c (ψ, d)

, which depends solely on d and the properties of the wavelet function ψ. Thus, for any j, we can find a function

f_{j, α} (x) = h_{j, α} (x_{1} + \dots + x_{d})

, where

h_{j, α} \in C_{1}^{α} ([0, d], K)

, such that for all

p_{1}, \dots, p_{d} \in \{0, 1, \dots, 2^{j - q - ν} - 1\}

, we have

d_{(j, 2^{q + ν} p_{1}) \dots (j, 2^{q + ν} p_{d})} (f_{j, α}) = c (ψ, d) K 2^{- \frac{j}{2} (2 α + d)} .

Theorem 4.

If

{\hat{m}}_{n}

represents a wavelet estimator with compact support ψ and an arbitrary index set I,

{\hat{m}}_{n} (x) = \sum_{(λ_{1}, \dots, λ_{d}) \in I} {\hat{d}}_{λ_{1} \dots λ_{d}} (f_{0}) \prod_{r = 1}^{d} ψ_{λ_{r}} (x_{r}),

therefore, for any

0 < α \leq 1

and any Hölder radius

K > 0

, we have

sup_{m (x) = h (\sum_{r = 1}^{d} x_{r}), h \in C_{1}^{α} ([0, d], K)} R ({\hat{m}}_{n}, m) ≳ n^{- \frac{2 α}{2 α + d}} .

As a result, the convergence rate of the wavelet series estimation is slower than

n^{- 2 α / (2 α + d)}

. If d is large, this rate becomes significantly slower. Therefore, wavelet estimation is sensitive to the curse of dimensionality, while neural networks can achieve rapid convergence.

3. Simulation Experiments and Case Study

3.1. Simulation Experiments

In this section, we conduct a comparative study through 100 repeated experimental simulations, evaluating the mean squared error of the estimation using both the local linear regression method and deep neural networks with the ReLU activation function. We consider the following models.

Model 1:

Y_{i} = 2 + 2 X_{i} β + e_{i},

where

X_{i}

follows a uniform distribution on

[- 1, 1]

,

β

is the coefficient, simulated from a uniform distribution on

[0, 1]

, and

e_{i}

is the noise variable following a standard normal distribution.

A neural network with a depth of 3 and a width of 32 was created, where the first two layers are fully connected layers, and the output layer uses the ReLU activation function.

We consider four scenarios with the same sample size but different dimensions.

Scenario 1 : \dim = 3, n = [200, 600, 1000, 1400, 1800, 2200];

Scenario 2 : \dim = 5, n = [200, 600, 1000, 1400, 1800, 2200];

Scenario 3 : \dim = 8, n = [200, 600, 1000, 1400, 1800, 2200];

Scenario 4 : \dim = 10, n = [200, 600, 1000, 1400, 1800, 2200];

where dim represents the dimension and n is the sample size. For each scenario, the mean squared error

(M S E)

is calculated to compare the performance of the local linear regression method and the deep neural network method. The

M S E

in this study is defined as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y^{(i)} - {\hat{y}}^{(i)})}^{2} .

(13)

Table 1 presents the estimates at different values of n.

Table 1. Various dimensional MSE values of two methods for Model 1.

In this table,

N E

represents the

M S E

of the nonparametric estimation method, and

D N N

represents the

M S E

of the deep neural network method. It is evident that as

M S E

approaches 0, the estimation accuracy increases. From Table 1, we observe that for the same dimension, as the sample size increases,

D N N

tends towards 0. For the same sample size,

D N N

is significantly smaller than

N E

, and with the increase in dimension, the superiority of the neural network method over the local linear regression method becomes more pronounced. Therefore, the neural network method achieves much higher estimation accuracy, especially for large sample sizes and high dimensions.

As shown in Figure 1, with the increase in dimension, the performance of the neural network fitting surpasses the local linear regression method, where the x-axis represents the sample size n, and the y-axis represents the mean squared error (

M S E

). In higher dimensions, the

M S E

of the neural network method approaches almost zero, which is attributed to the avoidance of the curse of dimensionality by deep neural networks. This demonstrates that the convergence rate of the deep neural network is superior to that of the local linear regression method and approaches the optimal convergence rate.

Figure 1.

M S E_{s}

of local linear regression method (dashed blue line) and the neural network method (solid orange line) of dimension 3, 5, 8 and 10. (a) Three dimensions. (b) Five dimensions. (c) Eight dimensions. (d) Ten dimensions.

Next, we consider high-dimensional spatial models with dependency structures to compare the

M S E_{s}

of the two methods mentioned above.

Model 2:

Y_{i, j} = 2 sin (2 π X_{i, j}) + R_{i, j},

where

i = 1, \dots, n_{1}

,

j = 1, \dots, n_{2}

, and

X_{i, j}

follows a zero-mean second-order stationary process.

R_{i, j}

follows a standard normal distribution. Similarly to the work of Cressie and Wikle [12], high-dimensional spatial processes

X_{i, j}

are generated using spectral methods.

X_{i, j} = {(2 / M)}^{1 / 2} \sum_{k = 1}^{M} cos (w (1, k) * i + w (2, k) * j + r (k)) .

In this case,

w (i, k)

for

i = 1, 2

follows a standard normal distribution and is independent of

r (k)

for

k = 1, \dots, M

, where

r (k)

are independently and identically distributed from a uniform distribution on

[- π, π]

. As

n \to \infty

,

X_{i, j}

converges to a Gaussian random process. We consider the case with dimension 5 and sample sizes [200, 600, 1000, 1400, 1800, 2200]. The network structure is the same as in Model 1.

As shown in Table 2, it can be observed that the convergence rate of the high-dimensional spatial model with dependence is worse than the convergence rate of the high-dimensional spatial model without dependence. However, in comparison to the local linear regression method, the

M S E

values of the neural network are much smaller, indicating that the neural network achieves better convergence performance.

Table 2. The

M S E

values of both methods for Model 2.

As shown in Figure 2, we see that in the case of large sample sizes and high-dimensional spatial models, the neural network achieves superior convergence compared to that of the local linear regression method, where the x-axis represents the sample size n, and the y-axis represents the mean squared error (

M S E

).

Figure 2. Comparison of

M S E_{s}

between two methods for Model 2.

3.2. Case Study

To compare the consistency of the local linear regression method and deep neural network for high-dimensional spatial models, we consider the relationship between air pollution and respiratory diseases in the New Territories East of Hong Kong from 1 January 2000 to 15 January 2001, as studied by Wang et al. [13]. There is a dataset consisting of 821 observations, where we mainly consider the air pollution index

X_{1}

and five pollutants, sulfur dioxide (

g / m^{3}

)

X_{2}

, inhalable particulate matter (

g / m^{3}

)

X_{3}

, nitrogen compounds (

g / m^{3}

)

X_{4}

, nitrogen dioxide (

g / m^{3}

)

X_{5}

, and ozone (

g / m^{3}

)

X_{6}

, as well as two environmental factors: temperature (

^{°}

C)

X_{7}

and relative humidity (%)

X_{8}

. In this section, we examine the relationship between the levels of chemical pollutants in the New Territories East of Hong Kong and the daily hospital admissions for respiratory diseases (Y). The specific parameter settings are the same as those in the numerical simulation part.

After dimensionless and standardization processing, we use 397 data points for the case study. Among these, 80% of the data are used to train the model, and the remaining 20% are used to evaluate the quality of the trained model. The

M S E

values are shown in Table 3. It can be observed that the

M S E_{2}

values are closer to 0, indicating that in real-world cases, the mean squared error of the deep neural network method is much smaller than that of the local linear regression method. Therefore, the deep neural network shows better convergence performance.

Table 3. The

M S E

values of both methods for this case.

Figure 3 presents a visual representation of the

M S E

values for both methods, where the x-axis represents the sample size n, and the y-axis represents the mean squared error (

M S E

). From the graph, it is evident that the deep neural network method exhibits a faster convergence rate compared to that of the local linear regression method.

Figure 3. Comparison of

M S E_{s}

between two methods for this case.

4. Conclusions

In this study, we employ neural networks with ReLU activation functions for nonparametric estimation in high-dimensional spaces. By constructing suitable network architectures, we estimate unknown trend functions and prove the consistency of the estimators while also comparing and analyzing the deep neural network approach with traditional nonparametric methods.

The focus is on high-dimensional space models with unknown error distributions. Considering the spatial dependencies and heterogeneity in the space models, a deep neural network with ReLU activation functions is used to estimate the unknown trend functions. Under general assumptions, the consistency of the estimators is established, and bounds for the mean squared error (

M S E

) are provided. The estimators exhibit a convergence rate that is related to the sample size but independent of the dimensionality d, thereby avoiding the curse of dimensionality. Moreover, the proposed estimators achieve convergence speeds close to optimality.

Considering the spatial dependencies in high-dimensional settings with large sample sizes, the deep neural network method outperforms traditional nonparametric estimation methods.

Author Contributions

Methodology, H.W.; software & writing, X.J.; editing, H.H.; review J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Social Science Fund of China, grant number 22BTJ021.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Embedding Property of Network Function Classes

To approximate functions by using neural networks, we first construct smaller networks to compute simpler objects. Let

p = (p_{0}, \dots, p_{L + 1})

and

p^{'} = (p_{0}^{'}, \dots, p_{L + 1}^{'})

. To merge networks, the following rules are commonly used in this paper.

Enlargement:

F (L, p, s) \subseteq F (L, q, s^{'})

,

p \leq q

and

s \leq s^{'}

.

Composition: Let

f \in F (L, p)

and

g \in F (L^{'}, p^{'})

, where

p_{L + 1} = p_{0}^{'}

. For a vector

v \in R^{p_{L + 1}}

, in the space

F (L + L^{'} + 1, (p_{0}, \dots, p_{L + 1}, p_{0}^{'}, \dots, p_{L + 1}^{'}))

, i.e.,

F (L + L^{'} + 1, (p, p_{1}^{'}, \dots, p_{L^{'} + 1}^{'}))

, the composition network

g \circ σ_{v} (f)

is defined.

Additional Layer/Depth Synchronization: To synchronize the number of hidden layers in two networks, an additional layer can be added with a unit weight matrix, such that

F (L, p, s) \subset F (L + q, (\underset{q}{\underset{︸}{p_{0}, \dots, p_{0}}}, p), s + q p_{0}) .

(A1)

Parallelization: Let f and g be two networks with the same number of hidden layers and the same input dimension, i.e.,

f \in F (L, p)

and

g \in F (L, p^{'})

, where

p_{0} = p_{0}^{'}

. The parallel network

(f, g)

computes both f and g simultaneously in the joint network class

F (L, (p_{0}, p_{1} + p_{1}^{'}, \dots, p_{L + 1} + p_{L + 1}^{'}))

.

Removing Inactive Nodes: We have

F (L, p, s) = F (L, (p_{0}, p_{1} \land s, p_{2} \land s, \dots, p_{L} \land s, p_{L + 1}), s) .

(A2)

In this context, we have

p_{i} \land s = m i n \{p_{i}, s\}, i = 1, \dots, L

. Let

f (x) = W_{L} σ_{v L} W_{L - 1} \dots σ_{v_{1}} W_{0} x \in F (L, p, s)

. If all entries in the j-th column of

W_{i}

are zero, we can remove this column along with the j-th row of

W_{i - 1}

and the j-th element of

v_{i}

without changing the function. This implies that

f \in F (L, (p_{0}, \dots, p_{i - 1}, p_{i} - 1, p_{i + 1}, \dots, p_{L + 1}), s)

. Since there are s active parameters, for any

i = 1, \dots, L

, we need to iterate at least

p_{i} - s

times. This proves that

f \in F (L, (p_{0}, p_{1} \land s, p_{2} \land s, \dots, p_{L} \land s, p_{L + 1}), s)

.

In this paper, we often utilize the following fact. For a fully connected network in

F (L, p)

, there are

\sum_{ℓ = 0}^{L} p_{ℓ} p_{ℓ + 1}

weight matrix parameters and

\sum_{ℓ = 1}^{L} p_{ℓ}

network parameters from bias vectors. Therefore, the total number of parameters is

\sum_{ℓ = 0}^{L} (p_{ℓ} + 1) p_{ℓ + 1} - p_{L + 1} .

(A3)

Appendix B. Approximation by Polynomial Neural Networks

We construct a network with all parameters bounded by 1 to approximate the calculation of

x y

for given inputs x and y. Let

T^{k} : [0, 2^{2 - 2 k}] \to [0, 2^{- 2 k}]

, where k is a positive integer.

T^{k} (x) : = (x / 2) \land (2^{1 - 2 k} - x / 2) = {(x / 2)}_{+} - {(x - 2^{1 - 2 k})}_{+},

and

R^{k} : [0, 1] \to [0, 2^{- 2 k}]

, where

R^{k} : = T^{k} \circ T^{k - 1} \circ \dots \circ T^{1} .

Next, we prove that

\sum_{k = 1}^{m} R^{k} (x)

converges exponentially to

x (1 - x)

as m increases, especially in

L^{\infty} [0, 1]

:

x (1 - x) = \sum_{k = 1}^{\infty} R^{k} (x)

. This lemma can be seen as a variation of Lemma 2.4 in Telgarsky’s work [14] and Proposition 2 in Yarotsky’s work [15]. Compared to existing results, this result allows us to construct networks with parameters equal to 1 and provides an explicit bound on the approximation error.

Lemma A1.

For any positive integer m,

|x (1 - x) - \sum_{k = 1}^{m} R^{k} (x)| \leq 2^{- m} .

Proof.

Step 1: We prove by induction that

R^{k} (x)

is a triangular wave. More precisely,

R^{k} (x)

is piecewise linear on the intervals

[ℓ / 2^{k}, (ℓ + 1) / 2^{k}]

, where ℓ is an integer. If ℓ is odd, the endpoints are

R^{k} (ℓ / 2^{k}) = 2^{- 2 k}

, and if ℓ is even, the endpoints are

R^{k} (ℓ / 2^{k}) = 0

.

When

k = 1

, the equality

R^{1} = T^{1}

holds obviously.

We assume that the statement holds for k, and we let

ℓ - r

be divisible by 4, denoted as

ℓ \equiv r mod 4

. Consider x in the interval

[ℓ / 2^{k + 1}, (ℓ + 1) / 2^{k + 1}]

. When

ℓ \equiv 0 mod 4

, then

R^{k} (x) = 2^{- k} (x - ℓ / 2^{k + 1})

. When

ℓ \equiv 2 mod 4

, then

R^{k} (x) = 2^{- 2 k} - 2^{- k} (x - ℓ / 2^{k + 1})

. When

ℓ \equiv 1 mod 4

, where

ℓ + 1 \equiv 2 mod 4

, then

R^{k} (x) = 2^{- k} (x - ℓ / 2^{k}) - 2^{- 2 k}

. When

ℓ \equiv 3 mod 4

, then

R^{k} (x) = 2^{- 2 k} - 2^{- k} (x - ℓ / 2^{k})

. Then, for

k + 1

, the statement also holds.

\begin{matrix} R^{k + 1} (x) & = T^{k + 1} \circ R^{k} (x) \\ = \frac{R^{k} (x)}{2} I (R^{k} (x) \leq 2^{- 2 k - 1}) + (2^{- 2 k - 1} - \frac{R^{k} (x)}{2}) I (R^{k} (x) > 2^{- 2 k - 1}) . \end{matrix}

The statement holds for

k + 1

, and the induction is complete.

Step 2: For convenience, let us denote

g (x) = x (1 - x)

. We now prove that for any

m \geq 1

and

ℓ \in \{0, 1, \dots, 2^{m}\}

, the following holds:

g (ℓ 2^{- m}) = \sum_{k = 1}^{m} R^{k} (ℓ 2^{- m}) .

To prove this, we use mathematical induction on m. For

m = 1

, when

ℓ = 0

, we have

g (0) = 0

and

R^{1} (0) = 0

. When

ℓ = 1

, we have

g (1 / 2) = 1 / 4

and

R^{1} (1 / 2) = 1 / 2

. When

ℓ = 2

, we have

g (1) = 0

and

R^{1} (1) = 0

.

g (ℓ 2^{- 1}) = R^{k} (ℓ 2^{- 1})

holds. Therefore, for the inductive step, assuming that it holds for m, when

m + 1

is considered, if ℓ is even, then

R^{m + 1} (ℓ 2^{- m - 1}) = 0

, which implies

g (ℓ 2^{- m - 1}) = \sum_{k = 1}^{m} R^{k} (ℓ 2^{- m - 1}) = \sum_{k = 1}^{m + 1} R^{k} (ℓ 2^{- m - 1})

. If ℓ is odd, then the function

x \mapsto \sum_{k = 1}^{m} R^{k} (x)

is linear over the interval

[(ℓ - 1) 2^{- m - 1}, (ℓ + 1) 2^{- m - 1}]

. Furthermore, for any t, we have

g (x) - \frac{g (x + t) + g (x - t)}{2} = t^{2} .

Since

x = ℓ 2^{- m - 1}

and

t = 2^{- m - 1}

, and considering ℓ such that

R^{m + 1} (ℓ 2^{- m - 1}) = 2^{- 2 m - 2}

, we can deduce

g (ℓ 2^{- m - 1}) = 2^{- 2 m - 2} + \sum_{k = 1}^{m} R^{k} (ℓ 2^{- m - 1}) = \sum_{k = 1}^{m + 1} R^{k} (ℓ 2^{- m - 1}),

and the result also holds for

m + 1

, completing the induction.

Thus, the interpolation of

\sum_{k = 1}^{m} R^{k} (x)

at the point

ℓ 2^{- m}

with function g has been proven, and it is linear over the interval

[ℓ 2^{- m}, (ℓ + 1) 2^{- m}]

. Let

\sum_{k = 1}^{m} R^{k} (x) = y

; g is a Lipschitz function with Lipschitz constant 1. Therefore, for any x, there exists an ℓ determined by

\frac{g (2^{- m} (ℓ + 1)) - y}{2^{- m} (ℓ + 1) - x} = \frac{y - g (2^{- m} ℓ)}{x - 2^{- m} ℓ},

and we have

y = 2^{- m} g (2^{- m} (ℓ + 1)) x - g (2^{- m} (ℓ + 1)) ℓ - (ℓ + 1) g (2^{- m} ℓ) + 2^{- m} x g (2^{- m} ℓ),

which implies

\begin{matrix} |g (x) - \sum_{k = 1}^{m} R^{k} (x)| & = |g (x) - (2^{m} x - ℓ) g ((ℓ + 1) 2^{- m}) - (ℓ + 1 - 2^{m} x) g (ℓ 2^{- m})| \\ \leq 2^{- m}, \end{matrix}

thus proving the lemma.

Let

g (x) = x (1 - x)

. As proven above, to construct a network that takes inputs x and y and approximates the product

x y

, we use polar-type identities.

g (\frac{x - y + 1}{2}) - g (\frac{x + y}{2}) + \frac{x + y}{2} - \frac{1}{4} = x y .

(A4)

□

Lemma A2.

For any positive integer m, there exists a network

{M u l t}_{m} \in F (m + 4, (2, 6, 6, \dots, 6, 1))

such that

{Mult}_{m} (x, y) \in [0, 1]

for all

x, y \in [0, 1]

, and

|{Mult}_{m} (x, y) - x y| \leq 2^{- m},

and

{Mult}_{m} (0, y) = {Mult}_{m} (x, 0) = 0

.

Proof.

Let

T_{k} (x) = {(x / 2)}_{+} - {(x - 2^{1 - 2 k})}_{+} = T_{+} (x) - T_{-}^{k} (x)

, where

T_{+} (x) = {(x / 2)}_{+}

and

T_{-}^{k} (x) = {(x - 2^{1 - 2 k})}_{+}

. Consider a nonnegative function

h : [0, 1] \to [0, \infty)

.

Step 1: We prove the existence of a network

N_{m}

with m hidden layers and width vector

(3, 3, \dots, 3, 1)

to compute this function:

(T_{+} (u), T_{-}^{1} (u), h (u)) \mapsto \sum_{k = 1}^{m + 1} R^{k} (u) + h (u),

for all

u \in [0, 1]

, as shown in Figure A1; it is worth noting that all parameters in this network are bounded by 1.

Figure A1. Network

(T_{+} (u), T_{-}^{1} (u), h (u)) \mapsto \sum_{k = 1}^{m + 1} R^{k} (u) + h (u)

.

Step 2: We prove the existence of a network with m hidden layers to compute the following functions:

(x, y) \mapsto {(\sum_{k = 1}^{m + 1} R^{k} (\frac{x - y + 1}{2}) - \sum_{k = 1}^{m + 1} R^{k} (\frac{x + y}{2}) + \frac{x + y}{2} - \frac{1}{4})}_{+} \land 1 .

Given the input

(x, y)

, the computation of this network in the first layer is as follows:

(T_{+} (\frac{x - y + 1}{2}), T_{-}^{1} (\frac{x - y + 1}{2}), {(\frac{x + y}{2})}_{+}, T_{+} (\frac{x + y}{2}), T_{-}^{1} (\frac{x + y}{2}), \frac{1}{4}) .

Applying the network

N_{m}

on the first three elements and the last three elements, we obtain a network with

m + 1

hidden layers and a width vector of

(2, 6, \dots, 6, 2)

, and we compute

(x, y) \mapsto (\sum_{k = 1}^{m + 1} R^{k} (\frac{x - y + 1}{2}) + \frac{x + y}{2}, \sum_{k = 1}^{m + 1} R^{k} (\frac{x + y}{2}) + \frac{1}{4}),

applying the two-hidden-layer network

(u, v) \mapsto {(1 - (1 - (u - v)) +)}_{+} = {(u - v)}_{+} \land 1

to the output. Therefore, the composite network

{Mult}_{m} (x, y)

has

m + 4

hidden layers and computes

(x, y) \mapsto {(\sum_{k = 1}^{m + 1} R^{k} (\frac{x - y + 1}{2}) - \sum_{k = 1}^{m + 1} R^{k} (\frac{x + y}{2}) + \frac{x + y}{2} - \frac{1}{4})}_{+} \land 1,

(A5)

and this implies that the output is always in the interval

[0, 1]

. According to Equation (A4) and Lemma A1, we can obtain the following:

\begin{matrix} |{Mult}_{m} (x, y) - x y| & = |\sum_{k = 1}^{m + 1} R^{k} (\frac{x - y + 1}{2}) - \sum_{k = 1}^{m + 1} R^{k} (\frac{x + y}{2}) + \frac{x + y}{2} - \frac{1}{4} \\ - g (\frac{x - y + 1}{2}) + g (\frac{x + y}{2}) - \frac{x + y}{2} + \frac{1}{4}| \\ \leq |\sum_{k = 1}^{m + 1} R^{k} (\frac{x - y + 1}{2}) - g (\frac{x - y + 1}{2})| \\ + |\sum_{k = 1}^{m + 1} R^{k} (\frac{x + y}{2}) - g (\frac{x + y}{2})| \\ \leq 2^{- m - 1} + 2^{- m - 1} \leq 2^{- m} . \end{matrix}

For all

0 \leq u \leq 1

, we have

R^{1} ((1 - u) / 2) = R^{1} ((1 + u) / 2)

and

R^{2} ((1 + u) / 2) = R^{2} (u / 2)

. Therefore, when k is odd,

R^{k} ((1 - u) / 2) = R^{k} ((1 + u) / 2)

; when k is even,

R^{k} ((1 + u) / 2) = R^{k} (u / 2)

. For all input pairs

(0, y)

and

(0, x)

, the output in Equation (A5) becomes zero. □

Lemma A3.

For any positive integer m, there exists a network

{Mult}_{m}^{r} \in F ((m + 5) ⌈{log}_{2} r⌉, (r, 6 r, 6 r, \dots, 6 r, 1)),

such that

{Mult}_{m}^{r} \in [0, 1]

, for all

x = (x 1, \dots, x_{r}) \in {[0, 1]}^{r}

, and we have

|{Mult}_{m}^{r} (x) - \prod_{i = 1}^{r} x_{i}| \leq r^{2} 2^{- m} .

Furthermore, if one of the components of

x

is 0, then

{Mult}_{m}^{r} (x) = 0

.

Proof.

Let

q : = ⌈{log}_{2} (r)⌉

. We now construct the network

{Mult}_{m}^{r}

and perform calculations in the first hidden layer.

(x_{1}, \dots, x_{r}) \mapsto (x_{1}, \dots, x_{r}, \underset{2^{q} - r}{\underset{︸}{1, \dots, 1}}) .

(A6)

We apply the network

Mult m

from Lemma A2 to each pair

(x_{1}, x_{2}), (x_{3}, x_{4}), \dots, (1, 1)

to compute

({Mult}_{m} (x_{1}, x_{2})

,

{Mult}_{m} (x_{3}, x_{4}), \dots

, and Mult

_{m} (1, 1)) \in R^{2^{q - 1}} .

Now, we pair adjacent terms and apply

{Mult}_{m}

again. We continue this process until only one term remains. The resulting network is denoted as

{Mult}_{m}^{r}

, which has

q (m + 5)

hidden layers, and all parameters are bounded by 1.

If

a, b, c, d \in [0, 1]

, then by Lemma A2 and the triangle inequality, we have

∣ {Mult}_{m} (a, b) - c d |\leq ∣ {Mult}_{m} (a, b) - a b | + | a b - c d | \leq 2^{- m} +| a - c | + | b - d ∣;

therefore, by iteration and induction, we obtain

\begin{matrix} |{Mult}_{m}^{r} (x) - \prod_{i = 1}^{r} x_{i}| & \leq 3^{q - 1} 2^{- m} \leq 3^{⌈{log}_{2} (r)⌉ - 1} 2^{- m} \\ \leq 3^{⌈{log}_{2} (r)⌉} 2^{- m} \leq 4^{⌈{log}_{2} (r)⌉} 2^{- m} \\ \leq r^{2} 2^{- m} . \end{matrix}

By using Lemma A2 and the construction described above, it is evident that if one of the components of

x

is 0, then

{Mult}_{m}^{r} (x) = 0 .

We construct a sufficiently large network to approximate all monomials

x_{1}^{α_{1}} \cdot \dots \cdot x_{r}^{α_{r}}

for nonnegative integers

α_{i}

up to a certain specified degree. Typically, we use multi-index notation:

x^{α} : = x_{1}^{α_{1}} \cdot \dots \cdot x_{r}^{α_{r}}

, where

α = (α_{1}, \dots, α_{r})

and

| α | : = \sum_{ℓ = 1}^{r} |α_{ℓ}|

represents the degree of the monomial.

The number of monomials with degrees satisfying

| α | < γ

is denoted by

C_{r, γ}

, and since each

α_{i}

takes values in

{0, 1, \dots, ⌊ γ ⌋}

, we have

C_{r, γ} \leq {(γ + 1)}^{r}

. □

Lemma A4.

For

γ > 0

and positive integers m, there exists a network

{Mon}_{m, γ}^{r} \in F (1 + (m + 5) ⌈{log}_{2} (γ \lor 1)⌉, (r, 6 ⌈ γ ⌉ C_{r, γ}, \dots, 6 ⌈ γ ⌉ C_{r, γ}, C_{r, γ})),

such that

{Mon}_{m, γ}^{r} \in {[0, 1]}^{C r, γ}

, and for all

x \in {[0, 1]}^{r}

, we have

{|{Mon}_{m, γ}^{r} (x) - {(x^{α})}_{| α | < γ}|}_{\infty} \leq γ^{2} 2^{- m} .

Proof.

For

| α | \leq 1

, the monomials are either linear or constant functions. There exists a shallow network in the class

F (1, (1, 1, 1))

that precisely represents the monomial

x^{α}

.

Considering the multiplicities in Equation (A6), Lemma A3 can be directly extended to monomials. For

| α | \geq 2

, this implies that in the class

F ((m + 5) ⌈{log}_{2} | α |⌉, (r, 6 | α |, \dots, 6 | α |, 1)),

there exists a network in the class that takes values within the interval

[0, 1]

and approximates

x^{α}

to a supnorm error of

{| α |}^{2} 2^{- m}

. By utilizing the parallelization and depth synchronization properties discussed in Appendix B, the proof of Lemma A4 can be established.

Following the classical local Taylor approximation, previously used for network approximation by Yarotsky [15], for a vector

a \in {[0, 1]}^{r}

, we define

P_{a}^{β} f (x) = \sum_{0 \leq | α | < β} (\partial^{α} f) (a) \frac{{(x - a)}^{α}}{α!} .

(A7)

According to Taylor’s theorem for multivariable functions, for an appropriate

ξ \in [0, 1]

, we have

f (x) = \sum_{α : | α | < β - 1} (\partial^{α} f) (a) \frac{{(x - a)}^{α}}{α!} + \sum_{β - 1 \leq | α | < β} (\partial^{α} f) (a + ξ (x - a)) \frac{{(x - a)}^{α}}{α!} .

We have

|{(x - a)}^{α}| = \prod_{i} {|x_{i} - a_{i}|}^{α_{i}} \leq {| x - a |}_{\infty}^{| α |}

. Therefore, for

f \in C_{r}^{β} ({[0, 1]}^{r}, K)

, we have

\begin{matrix} |f (x) - P_{a}^{β} f (x)| \\ \leq \sum_{β - 1 \leq | α | < β} \frac{|{(x - a)}^{α}|}{α!} |(\partial^{α} f) (a + ξ (x - a)) - (\partial^{α} f) (a)| \\ \leq {K | x - a |}_{\infty}^{β} . \end{matrix}

(A8)

We can express Equation (A7) as a linear combination of monomials.

P_{a}^{β} f (x) = \sum_{0 \leq | γ | < β} x^{γ} c_{γ},

(A9)

for suitable coefficients

c_{γ}

, and, for convenience, the dependence on

a

in

c_{γ}

is omitted here. Since

\partial^{γ} P_{a}^{β} f (x)| x = 0 = γ! c γ

, it follows that

c_{γ} = \sum_{γ \leq α & | α | < β} (\partial^{α} f) (a) \frac{{(- a)}^{α - γ}}{γ! (α - γ)!},

and since

a \in {[0, 1]}^{r}

and

f \in C_{r}^{β} ({[0, 1]}^{r}, K)

, we have

|c_{γ}| \leq K / γ! and \sum_{γ \geq 0} |c_{γ}| \leq K \prod_{j = 1}^{r} \sum_{γ_{j} \geq 0} \frac{1}{γ_{j}!} = K e^{r} .

(A10)

We consider the grid points

D (M) : = \{x_{ℓ} = {(ℓ_{j} / M)}_{j = 1, \dots, r} : ℓ = (ℓ_{1}, \dots, ℓ_{r}) \in {0, 1, \dots, M}^{r}\}

. The number of elements in this set is

{(M + 1)}^{r}

. Let

x_{ℓ} = (x_{j}^{ℓ})

represent the elements of

X_{ℓ}

. We define

P^{β} f (x) : = \sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+} .

□

Lemma A5.

If

f \in C_{r}^{β} ({[0, 1]}^{r}, K)

, then

{|P^{β} f - f|}_{L^{\infty} {[0, 1]}^{r}} \leq K M^{- β}

.

Proof.

For all

x = (x_{1}, \dots, x_{r}) \in {[0, 1]}^{r}

, we have

\sum_{x_{ℓ} \in D (M)} \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+} = \prod_{j = 1}^{r} \sum_{ℓ = 0}^{M} {(1 - M |x_{j} - ℓ / M|)}_{+} = 1,

(A11)

and we use mathematical induction, assuming

M = 1

. The left-hand side of (A11) is

{(1 - |x_{1} - 0|)}_{+} = 1 - x_{1}, {(1 - |x_{1} - 1|)}_{+} = x_{1} = 1,

and after summing, we obtain

(1 - x_{1}) + x_{1} = 1,

while the middle-hand side, we have

{(1 - |x_{j} - 0|)}_{+} + {(1 - |x_{j} - 1|)}_{+} = (1 - x_{1}) + x_{1} = 1,

and therefore the equation holds when

M = 1

.

Assuming

M = 2

, we have

x_{1} \in (0, \frac{1}{2}),

x_{2} \in (\frac{1}{2}, 1),

with

x_{1}^{ℓ} = \frac{ℓ_{1}}{M}, ℓ_{1} = 0, 1, 2

;

x_{2}^{ℓ} = \frac{ℓ_{2}}{M}, ℓ_{2} = 0, 1, 2 .

Then, the left-hand side is

{(1 - 2 |x_{1} - \frac{0}{2}|)}_{+} {(1 - 2 |x_{2} - \frac{0}{2}|)}_{+} = 0,

{(1 - 2 |x_{1} - 0|)}_{+} {(1 - 2 |x_{2} - \frac{1}{2}|)}_{+} = (1 - 2 x_{1}) (1 - 2 x_{2} + 1) = (1 - 2 x_{1}) (2 - 2 x_{2}),

{(1 - 2 |x_{1} - 0|)}_{+} {(1 - 2 |x_{2} - 1|)}_{+} = (1 - 2 x_{1}) (1 - 2 + 2 x_{2}) = (1 - 2 x_{1}) (2 x_{2} - 1),

{(1 - 2 |x_{1} - \frac{1}{2}|)}_{+} {(1 - 2 |x_{2} - 0|)}_{+} = 0,

{(1 - 2 |x_{1} - \frac{1}{2}|)}_{+} {(1 - 2 |x_{2} - \frac{1}{2}|)}_{+} = 2 x_{1} (2 - 2 x_{2}),

{(1 - 2 |x_{1} - \frac{1}{2}|)}_{+} {(1 - 2 |x_{2} - 1|)}_{+} = 2 x_{1} (2 x_{2} - 1),

{(1 - 2 |x_{1} - 1|)}_{+} {(1 - 2 |x_{2} - 0|)}_{+} = 0,

{(1 - 2 |x_{1} - 1|)}_{+} {(1 - 2 |x_{2} - \frac{1}{2}|)}_{+} = 0,

{(1 - 2 |x_{1} - 1|)}_{+} {(1 - 2 |x_{2} - 1|)}_{+} = 0,

and after summation, we obtain

(1 - 2 x_{1}) (2 - 2 x_{2}) + (1 - 2 x_{1}) (2 x_{2} - 1) + 2 x_{1} (2 - 2 x_{2}) + 2 x_{1} (2 x_{2} - 1) = 1 .

In the middle, we have

\begin{matrix} \prod_{j = 1}^{2} ({(1 - 2 |x_{j} - 0|)}_{+} + {(1 - 2 |x_{j} - \frac{1}{2}|)}_{+} + {(1 - 2 |x_{j} - 1|)}_{+}) \\ = ({(1 - 2 |x_{1} - 0|)}_{+} + {(1 - 2 |x_{1} - \frac{1}{2}|)}_{+} + {(1 - 2 |x_{1} - 1|)}_{+}) \cdot \\ ({(1 - 2 |x_{2} - 0|)}_{+} + {(1 - 2 |x_{2} - \frac{1}{2}|)}_{+} + {(1 - 2 |x_{2} - 1|)}_{+}) \\ = ((1 - 2 x_{1}) + 2 x_{1}) ((2 - 2 x_{2}) + (2 x_{2} - 1)) = 1, \end{matrix}

and, therefore, when

M = 2

, the equation holds true.

Next, we calculate the second equation in (A11); when

x_{j} \in (0, \frac{1}{M})

, we have

{(1 - M |x_{j} - \frac{0}{M}|)}_{+} = 1 - M x_{j},

{(1 - M |x_{j} - \frac{1}{M}|)}_{+} = {(1 - 1 + M x_{j})}_{+} = M x_{j},

{(1 - M |x_{j} - \frac{2}{M}|)}_{+} = {(1 - 2 + M x_{j})}_{+} = {(M x_{j} - 1)}_{+} = 0,

and when ℓ takes values

3, 4, \dots, M

, the sum above is zero, resulting in 1. By analogy, for

x_{j} \in (\frac{k}{M}, \frac{k + 1}{M})

, where

k = 0, \dots, M - 1

, the same holds true. Therefore, we can deduce that

\prod_{j = 1}^{r} \sum_{ℓ = 0}^{M} {(1 - M |x_{j} - ℓ / M|)}_{+} = 1

.

By using

f (x) = \sum_{x_{ℓ} \in D (M) : {|x - x_{ℓ}|}_{\infty} \leq 1 / M} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+}

and Equation (A8), we obtain

\begin{matrix} |P^{β} f (x) - f (x)| \\ = |\sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+} \\ - \sum_{x_{ℓ} \in D (M) : {∥x - x_{ℓ}∥}_{\infty} < 1 / M} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+}| \\ \leq max_{x_{ℓ} \in D (M) : {∥x_{ℓ}∥}_{\infty} \leq 1 / M} |P_{x_{ℓ}}^{β} f (x) - f (x)| \\ \leq K {∥x - x_{ℓ}∥}_{\infty}^{β} \\ \leq K M^{β} . \end{matrix}

Then, we describe how to construct a network that approximates

P^{β} f

. □

Lemma A6.

For any positive integers M and m, there exists a network

{H a t}^{r} \in F (2 + (m + 5) ⌈{log}_{2} r⌉, (r, 6 r {(M + 1)}^{r}, \dots, 6 r {(M + 1)}^{r}, {(M + 1)}^{r}), s, 1),

where

s \leq 49

r^{2} {(M + 1)}^{r} (1 + (m + 5) ⌈{log}_{2} r⌉)

, such that

{Hat}^{r} \in {[0, 1]}^{{(M + 1)}^{r}}

, and for any

x = (x_{1}, \dots, x_{r}) \in {[0, 1]}^{r}

, we have

{|{Hat}^{r} (x) - {(\prod_{j = 1}^{r} {(1 / M - |x_{j} - x_{j}^{ℓ}|)}_{+})}_{x_{ℓ} \in D (M)}|}_{\infty} \leq r^{2} 2^{- m} .

For any

x_{ℓ} \in D (M)

, the support of the function

x \mapsto {({Hat}^{r} (x))}_{x_{ℓ}}

is contained within the support of the function

x \mapsto \prod_{j = 1}^{r} {(1 / M - |x_{j} - x_{j}^{ℓ}|)}_{+}

.

Proof.

The first hidden layer uses

2 r (M + 1)

units and

4 r (M + 1)

nonzero parameters to compute the functions

(x_{j} - ℓ / M) +

and

{(ℓ / M - x_{j})}_{+}

. The second hidden layer uses

r (M + 1)

units and

3 r (M + 1)

nonzero parameters to compute the function

{(1 / M - |x_{j} - ℓ / M|)}_{+} = (1 / M - {{(x_{j} - ℓ / M)}_{+} - {(ℓ / M - x j)}_{+})}_{+}

. These functions take values in the interval

[0, 1]

, and the result holds when

r = 1

.

For

r > 1

, we combine the obtained network with the network approximating the product

\prod_{j = 1}^{r} {(1 / M - |x_{j} - ℓ / M|)}_{+}

. According to Lemma A3, there exists a network Mult

_{m}^{r}

in the following class:

F ((m + 5) ⌈{log}_{2} r⌉, (r, 6 r, 6 r, \dots, 6 r, 1)) .

We compute

\prod_{j = 1}^{r} {(1 / M - |x_{j} - x_{j}^{ℓ}|)}_{+}

with an error bounded by

r^{2} 2^{- m}

. From Equation (A3), it follows that a Mult

_{m}^{r}

network has nonzero parameters

(36 r^{2} + 6 r) (1 + ((m + 5) ⌈{log}_{2} r⌉) \leq 42 r^{2} (1 + (m + 5) ⌈{log}_{2} r⌉),

as a bound, and since these networks have

{(M + 1)}^{r}

parallel instances, each hidden layer requires

6 {(M + 1)}^{r}

units and

42 r^{2} {(M + 1)}^{r} (1 + (m + 5) ⌈{log}_{2} r⌉)

nonzero parameters for multiplication operations. Adding the

7 {(M + 1)}^{r}

nonzero parameters from the first two layers, the total bound on the number of nonzero parameters is

49 r^{2} {(M + 1)}^{r} (1 + (m + 5) ⌈{log}_{2} r⌉) .

According to Lemma A3, if one of the components of

x

is zero, then Mult

_{m}^{r} (x) = 0

. This implies that for any

x_{ℓ} \in D (M)

, the support of the function

x \mapsto {({Hat}^{r} (x))}_{x_{ℓ}}

is contained within the support of the function

x \mapsto \prod_{j = 1}^{r} {(1 / M - |x_{j} - x_{j}^{ℓ}|)}_{+}

. □

Theorem A1.

For any function

f \in C_{r}^{β} ({[0, 1]}^{r}, K)

and any integers

m \geq 1

and

N \geq {(β + 1)}^{r} \lor (K + 1) e^{r}

, there exists a network

\tilde{f} \in F (L, (r, 6 (r + ⌈ β ⌉) N, \dots, 6 (r + ⌈ β ⌉) N, 1), s, \infty),

with a depth of

L = 8 + (m + 5) (1 + ⌈{log}_{2} (r \lor β)⌉),

and the number of parameters

s \leq 141 {(r + β + 1)}^{3 + r} N (m + 6),

such that

∥ \tilde{f} {- f ∥}_{L^{\infty} ({[0, 1]}^{r})} \leq (2 K + 1) (1 + r^{2} + β^{2}) 6^{r} N 2^{- m} + K 3^{β} N^{- \frac{β}{r}} .

Proof.

In this proof, all constructed networks take the form

F (L, p, s) = F (L, p, s, \infty)

, where

F = \infty

. Let M be the largest integer such that

{(M + 1)}^{r} \leq N

, and we define

L^{*} : = (m + 5) ⌈{log}_{2} (β \lor r)⌉

. With the help of Equations (A9) and (A10), and Lemma A4, we can add a hidden layer to the network

{Mon}_{m, β}^{r}

, resulting in a new network

Q_{1} \in F (2 + L^{*}, (r, 6 ⌈ β ⌉ C_{r, β}, \dots, 6 ⌈ β ⌉ C_{r, β}, C_{r, β}, {(M + 1)}^{r})),

such that

Q_{1} (x) \in {[0, 1]}^{{(M + 1)}^{r}}

and for any

x \in {[0, 1]}^{r}

, we have

{|Q_{1} (x) - {(\frac{P_{x_{ℓ}}^{β} f (x)}{B} + \frac{1}{2})}_{x_{ℓ} \in D (M)}|}_{\infty} \leq β^{2} 2^{- m},

(A12)

where

B : = ⌈ 2 K e^{r} ⌉

and e is the natural logarithm. According to Equation (A3), the number of nonzero parameters in network

Q_{1}

is bounded by

6 r (β + 1) C_{r, β} + 42 {(β + 1)}^{2} C_{r, β}^{2} (L^{*} + 1) + C_{r, β} {(M + 1)}^{r}

.

According to Lemma A6, the network

{Hat}^{r}

calculates the product

\prod_{j = 1}^{r} {(1 / M - |x_{j} - x_{j}^{ℓ}|)}_{+}

with an error bounded by

r^{2} 2^{- m}

. It requires at most

49 r^{2} N (1 + L^{*})

active parameters. Now, consider the parallel network

(Q_{1}, {Hat}^{r})

. Based on the definition of

C_{r, β}

and the assumption on N, we observe that

C_{r, β} \leq {(β + 1)}^{r} \leq N

. According to Lemma A6, networks

Q_{1}

and

{Hat}^{r}

can be embedded into a joint network

(Q_{1}, {Hat}^{r})

with

2 + L^{*}

hidden layers. The weight vector

(r, 6 (r + ⌈ β ⌉) N, \dots, 6 (r + ⌈ β ⌉) N, 2 {(M + 1)}^{r})

and all parameters are bounded by 1. By using

C_{r, β} \lor {(M + 1)}^{r} \leq N

, the bound on the number of nonzero parameters in the combined network

(Q_{1}, {Hat}^{r})

is

\begin{matrix} 6 r (β + 1) C_{r, β} + 42 {(β + 1)}^{2} C_{r, β}^{2} (L^{*} + 1) + C_{r, β} {(M + 1)}^{r} + 49 r^{2} N (1 + L^{*}) \\ \leq 49 {(r + β + 1)}^{2} C_{r, β} N (1 + L^{*}) \\ \leq 98 {(r + β + 1)}^{3 + r} N (m + 5), \end{matrix}

(A13)

where, for the last inequality, we use

C_{r, β} \leq {(β + 1)}^{r}

, the definition of

L^{*}

, and the property that for any

x \geq 1

, we have

1 + ⌈{log}_{2} (x)⌉ \leq 2 + {log}_{2} (x) \leq 2 (1 + ln (x)) \leq 2 x

.

Next, we pair the outputs of

Q_{1}

and

{Hat}^{r}

corresponding to the

x_{ℓ}

term and apply the Mult

_{m}

network described in Lemma A2 to each of the

{(M + 1)}^{r}

pairs. In the final layer, we sum all the terms together. According to Lemma A2, this requires at most

42 (m + 5) {(M + 1)}^{r} + {(M + 1)}^{r} \leq 43 (m + 5) N

active parameters for the total

{(M + 1)}^{r}

multiplications. By using Lemmas A2 and A6, Equation (A12), and the triangle inequality, we can construct a network

Q_{2} \in F (3 + (m + 5) (1 + ⌈{log}_{2} (β \lor r)⌉), (r, 6 (r + ⌈ β ⌉) N, \dots, 6 (r + ⌈ β ⌉) N, 1)),

such that, for any

x \in {[0, 1]}^{r}

, we have

\begin{matrix} |Q_{2} (x) - \sum_{x_{ℓ} \in D (M)} (\frac{P_{x_{ℓ}}^{β} f (x)}{B} + \frac{1}{2}) \prod_{j = 1}^{r} {(\frac{1}{M} - |x_{j} - x_{j}^{ℓ}|)}_{+}| \\ \leq \sum_{x_{ℓ} \in D (M) : {∥x - x_{ℓ}∥}_{\infty} \leq 1 / M} (1 + r^{2} + β^{2}) 2^{- m} \leq (1 + r^{2} + β^{2}) 2^{r - m}, \end{matrix}

(A14)

where the first inequality follows from the fact that the support of

({Hat}^{r} (x)) x_{ℓ}

is contained within the support of

\prod_{j = 1}^{r} (1 / M - |x_{j} - x_{j}^{ℓ}|) +

, as stated in Lemma A6. Due to Equation (A3), the network

Q_{2}

has at most

98 {(r + β + 1)}^{3 + r} N (m + 5) + 43 (m + 5) N \leq 141 {(r + β + 1)}^{3 + r} N (m + 5) .

(A15)

To obtain the network reconstruction of the function f, it is necessary to apply scaling and shifting to the output terms. This is primarily due to the finite parameter weights in the network. We recall that

B = ⌈2 K e^{r}⌉

. The network

x \mapsto B M^{r} x

belongs to the class

F (3, (1, M^{r}, 1, [2 K e^{r}], 1))

, where the shift vectors

v_{j}

are all zero and all entries of the weight matrices

W_{j}

are equal to 1. Since

N \geq (K + 1) e^{r}

, the number of parameters in this network is bounded by

2 M^{r} + 2 ⌈2 K e^{r}⌉ \leq 6 N

. This implies the existence of a network in the class

F (4, (1, 2, 2 M^{r}, 2, 2 ⌈2 K e^{r}⌉, 1))

, which computes

a \mapsto B M^{r} (a - c)

, where

c : = 1 / (2 M^{r})

. This network computes

{(a - c)}_{+}

and

{(c - a)}_{+}

in the first hidden layer, and then applies the network

x \mapsto B M^{r} x

to these two units. In the output layer, the first value is subtracted from the second value. This requires at most

6 + 12 N

active parameters.

Due to Equations (A11) and (A14), there exists a network

Q_{3}

in the following class

F (8 + (m + 5) (1 + ⌈{log}_{2} (r \lor β)⌉), (r, 6 (r + ⌈ β ⌉) N, \dots, 6 (r + ⌈ β ⌉) N, 1)),

and for all

x \in {[0, 1]}^{r}

, we have

\begin{matrix} |Q_{3} (x) - \sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+}| \\ \leq (2 K + 1) M^{r} (1 + r^{2} + β^{2}) {(2 e)}^{r} 2^{- m} . \end{matrix}

Under the condition of Equation (A15), the bound for the nonzero parameters of

Q_{3}

is

141 {(r + β + 1)}^{3 + r} N (m + 6) .

By constructing

{(M + 1)}^{r} \leq N \leq {(M + 2)}^{r} \leq {(3 M)}^{r}

, it follows that

M^{- β} \leq N^{- β / r} 3^{β}

. Combined with Lemma A5, we have

\begin{matrix} ∥ \tilde{f} {- f ∥}_{L^{\infty} ({[0, 1]}^{r})} = |Q_{3} - f (x)| \\ = |Q_{3} (x) - \sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+} + \\ \sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+} - f (x)| \\ \leq |Q_{3} (x) - \sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+}| + \\ |\sum_{x_{ℓ} \in D (M)} P_{x_{ℓ}}^{β} f (x) \prod_{j = 1}^{r} {(1 - M |x_{j} - x_{j}^{ℓ}|)}_{+} - f (x)| \\ \leq (2 K + 1) (1 + r^{2} + β^{2}) {(2 e)}^{r} M^{r} 2^{- m} + K M^{- β} \\ \leq (2 K + 1) (1 + r^{2} + β^{2}) 6^{r} N 2^{- m} + K 3^{β} N^{- \frac{β}{r}} . \end{matrix}

Thus, the result is proven.

Based on Theorem A1, we can now construct a network that approximates

f = g_{q} \circ \dots \circ g_{0}

. In the first step, we show that f can always be represented as a composition of functions defined on hypercubes

{[0, 1]}^{t_{i}}

. As in the previous theorem, let

g_{i j} \in C_{t_{i}}^{β_{i}} ({[a_{i}, b_{i}]}^{t_{i}}, K_{i})

, and we assume

K_{i} \geq 1

for

i = 1, \dots, q - 1

. Define

h_{0} : = \frac{g_{0}}{2 K_{0}} + \frac{1}{2}, h_{i} : = \frac{g_{i} (2 K_{i - 1} \cdot - K_{i - 1})}{2 K_{i}} + \frac{1}{2}, h_{q} = g_{q} (2 K_{q - 1} \cdot - K_{q - 1}),

where

2 K_{i - 1} x - K_{i - 1}

means applying the transformation

2 K_{i - 1} x_{j} - K_{i - 1}

to all j. It is evident that

f = g_{q} \circ \dots g_{0} = h_{q} \circ \dots \circ h_{0} .

(A16)

From the definition of the Hölder ball

C_{r}^{β} (D, K)

, we can see that

h_{0 j}

takes values in the interval

[0, 1]

.

h_{0 j} \in C_{t_{0}}^{β_{0}} ({[0, 1]}^{t_{0}}, 1), h_{i j} \in C_{t_{i}}^{β_{i}} ({[0, 1]}^{t_{i}}, {(2 K_{i - 1})}^{β_{i}}),

where, for

i = 1, \dots, q - 1

, we have

h_{q j} \in C_{t_{q}}^{β_{q}} ({[0, 1]}^{t_{q}}, K_{q} {(2 K_{q - 1})}^{β_{q}})

. Without loss of generality, we can always assume that the radius of the Hölder ball is at least 1, i.e.,

K_{i} \geq 1

. □

Lemma A7.

Let

h_{i j}

be as defined above, with

K_{i} \geq 1

. Then, for any function

{\tilde{h}}_{i} = {({\tilde{h}}_{i j})}_{j}^{T}

, where

{\tilde{h}}_{i j} : {[0, 1]}^{t_{i}} \to [0, 1]

, we have

\begin{matrix} {∥h_{q} \circ \dots \circ h_{0} - {\tilde{h}}_{q} \circ \dots \circ {\tilde{h}}_{0}∥}_{L^{\infty} {[0, 1]}^{d}} \leq K_{q} \prod_{ℓ = 0}^{q - 1} {(2 K_{ℓ})}^{β_{ℓ + 1}} \sum_{i = 0}^{q} {∥{|h_{i} - {\tilde{h}}_{i}|}_{\infty}∥}_{{L^{\infty} [0, 1]]}^{d_{i}}}^{\prod_{ℓ = i + 1} β_{ℓ} \land 1} . \end{matrix}

Proof.

Let

H_{i} = h_{i} \circ \dots \circ h_{0}

and

{\tilde{H}}_{i} = {\tilde{h}}_{i} \circ \dots \circ {\tilde{h}}_{0}

. If

Q_{i}

is an upper bound of the Hölder seminorm of

h_{i j}

for

j = 1, \dots, d_{i + 1}

, then, by the triangle inequality, we have

\begin{matrix} {|H_{i} (x) - {\tilde{H}}_{i} (x)|}_{\infty} \\ \leq {|h_{i} \circ H_{i - 1} (x) - h_{i} \circ {\tilde{H}}_{i - 1} (x)|}_{\infty} + {|h_{i} \circ {\tilde{H}}_{i - 1} (x) - {\tilde{h}}_{i} \circ {\tilde{H}}_{i - 1} (x)|}_{\infty} \\ \leq Q_{i} {|H_{i - 1} (x) - {\tilde{H}}_{i - 1} (x)|}_{\infty}^{β_{i} \land 1} + {∥h_{i} - {{\tilde{h}}_{i}|}_{\infty}∥}_{L^{\infty} {[0, 1]}^{d_{i}}} . \end{matrix}

Combining this with the inequality

{(y + z)}^{α} \leq y^{α} + z^{α}

, which holds for all

y, z \geq 0

and all

α \in [0, 1]

, the lemma is proven. □

Proof of Theorem 1.

Here, all n are assumed to be sufficiently large. Throughout the entire proof,

C^{'}

is a constant that depends only on the variation of

(q, d, t, β, F)

. Combining Theorem 2 with the bounds on the depth L and network sparsity s assumed, for

n \geq 3

, we have

\begin{matrix} \frac{1}{4} Δ_{n} ({\hat{m}}_{n}, m) - C^{'} ϕ_{n} L {ln}^{2} n \leq R (\hat{m}, m) \\ \leq 4 inf_{m^{*} \in F (L, p, s, F)} {∥m^{*} - m∥}_{\infty}^{2} + 4 Δ_{n} ({\hat{m}}_{n}, m) + C^{'} ϕ_{n} L {ln}^{2} n, \end{matrix}

(A17)

where, for the lower bound, we set

ε = 1 / 2

, and for the upper bound, we set

ε = 1

. We take

C = 8 C^{'}

; then, when

Δ_{n} ({\hat{m}}_{n}, m) \geq C ϕ n L {ln}^{2} n

, we have

\frac{1}{8} Δ_{n} ({\hat{m}}_{n}, m) \geq C^{'} ϕ n L {ln}^{2} n

. Substituting this into the left-hand side of Equation (A17), we obtain

\frac{1}{4} Δ_{n} ({\hat{m}}_{n}, m) - \frac{1}{8} Δ_{n} ({\hat{m}}_{n}, m) \leq R (\hat{m}, m),

that is,

\frac{1}{8} Δ_{n} ({\hat{m}}_{n}, m) \leq R (\hat{m}, m) .

Thus, the lower bound for Equation (8) is established.

To obtain upper bounds for Equations (7) and (8), it is necessary to constrain the approximation error. For this purpose, the regression function m is rewritten as Equation (A16), i.e.,

m = h_{q} \circ \dots \circ h_{0}

, where

h_{i} = {(h_{i j})}_{j}^{T}

and

h_{i j}

is defined on

{[0, 1]}^{t_{i}}

and maps to

[0, 1]

for any

i < q

.

Here, we apply Theorem A1 to each function

h_{i j}

separately. Let

m = ⌈{log}_{2} n⌉

and consider the following.

L_{i}^{'} : = 8 + (⌈{log}_{2} n⌉ + 5) (1 + ⌈{log}_{2} (t_{i} \lor β_{i})⌉);

this means that there exists a network

{\tilde{h}}_{i j} \in F (L_{i}^{'}, (t_{i}, 6 (t_{i} + ⌈β_{i}⌉) N, \dots, 6 (t_{i} + ⌈β_{i}⌉) N, 1), s_{i}),

where

s_{i} \leq 141 {(t_{i} + β_{i} + 1)}^{3 + t_{i}} N (⌈{log}_{2} n⌉ + 6)

, such that

{∥{\tilde{h}}_{i j} - h_{i j}∥}_{L^{\infty} ({[0, 1]}^{t_{i}})} \leq (2 Q_{i} + 1) (1 + t_{i}^{2} + β_{i}^{2}) 6^{t_{i}} N n^{- 1} + Q_{i} 3^{β_{i}} N^{- \frac{β_{i}}{t_{i}}},

(A18)

where

Q_{i}

is the Hölder norm upper bound of

h_{i j}

. If

i < q

, two additional layers

1 - {(1 - x)}_{+}

are applied to the output, requiring four additional parameters. The resulting network is denoted as

h_{i j}^{*} \in F (L_{i}^{'} + 2, (t_{i}, 6 (t_{i} + ⌈β_{i}⌉) N, \dots, 6 (t_{i} + ⌈β_{i}⌉) N, 1), s_{i} + 4),

and it is observed that

σ (h_{i j}^{*}) = ({\tilde{h}}_{i j} (x) \lor 0) \land 1

. Since

h_{i j} (x) \in [0, 1]

, we have

|σ (h_{i j}^{}) - h_{i j}| L^{\infty} ({[0, 1]}^{r}) \leq {|\tilde{h} i j - h_{i j}|}_{L^{\infty} ({[0, 1]}^{r})} .

(A19)

If the network

h_{i j}^{}

is parallelized,

h_{i}^{} = (h_{i j}^{*}) j = 1, \dots, d_{i + 1}

belongs to the class

{∥σ (h_{i j}^{*}) - h_{i j}∥}_{L^{\infty} ({[0, 1]}^{r})} \leq {∥{\tilde{h}}_{i j} - h_{i j}∥}_{L^{\infty} ({[0, 1]}^{r})},

where

r_{i} : = {max}_{i} d_{i + 1} (t_{i} + [β_{i}])

. Finally, constructing the composite network

m^{*} = {\tilde{h}}_{q 1} \circ σ (h_{q - 1}^{*}) \circ \dots \circ σ (h_{0}^{*})

, according to the construction rules in Appendix A, we can realize it in the following class:

F (E, (d, 6 r_{i} N, \dots, 6 r_{i} N, 1), \sum_{i = 0}^{q} d_{i + 1} (s_{i} + 4)),

(A20)

where

E : = 3 (q - 1) + \sum_{i = 0}^{q} L_{i}^{'}

. By observation, there exists an

A_{n}

bounded by n such that

E = A_{n} + {log}_{2} n (\sum_{i = 0}^{q} ⌈{log}_{2} (t_{i} \lor β_{i})⌉ + 1) .

For all sufficiently large n, utilizing the inequality

⌈ x ⌉ < x + 1

, we have

E \leq \sum_{i = 0}^{q} ({log}_{2} 4 + {log}_{2}

(t i \lor β i)) {log}_{2} n \leq L

, according to Equation (A1), and for sufficiently large n, the space defined in Equation (A20) can be embedded into

F (L, p, s)

, where

L, p, s

satisfies the assumptions of the theorem. We choose

N = ⌈c {max}_{i = 0, \dots, q} n^{\frac{t i}{2 β i^{*} + t_{i}}}⌉

with a sufficiently small constant

c > 0

, depending only on

q, d, t, β

. Combining Theorem A1 with Equations (A18) and (A19), we have

\begin{matrix} inf_{f^{*} \in F (L, p, s)} {∥m^{*} - m∥}_{\infty}^{2} \\ = inf_{f^{*} \in F (L, p, s)} {∥{\tilde{h}}_{q 1} \circ σ (h_{q - 1}^{*}) \circ \dots \circ σ (h_{0}^{*}) - h_{q} \circ \dots h_{0}∥}_{\infty}^{2} \\ \leq inf_{f^{*} \in F (L, p, s)} {∥{\tilde{h}}_{q 1} \circ {\tilde{h}}_{q - 1} \circ \dots \circ {\tilde{h}}_{0} - h_{q} \circ \dots h_{0}∥}_{\infty}^{2} \\ \leq K_{q} \prod_{ℓ = 0}^{q - 1} {(2 K_{ℓ})}^{β_{ℓ + 1}} \sum_{i = 0}^{q} {∥{|h_{i} - {\tilde{h}}_{i}|}_{\infty}∥}^{2} \\ \leq {((2 Q_{i} + 1) (1 + t_{i}^{2} + β_{i}^{2}) 6^{t_{i}} N n^{- 1} + Q_{i} 3^{β_{i}} N^{- \frac{β_{i}}{t_{i}}})}^{2} \\ \leq C^{'} max_{i = 0, \dots, q} N^{- \frac{2 β_{i}^{*}}{t_{i}}} \\ \leq C^{'} max_{i = 0, \dots, q} c^{- \frac{2 β_{i}^{*}}{t_{i}}} n^{- \frac{2 β_{i}^{*}}{2 β_{i}^{*} + t_{i}}} . \end{matrix}

(A21)

For the approximation error in Equation (A17), we need to find a network function bounded by the supnorm of F. According to the previous inequalities, there exists a sequence of functions

{({\tilde{m}}_{n})}_{n}

such that, for all sufficiently large n,

{\tilde{m}}_{n} \in F (L, p, s)

, and

| {\tilde{m}}_{n} {- m |}_{\infty}^{2} \leq 2 C {max}_{i = 0, \dots, q} c^{- 2 β i^{*} / t i} n^{- (2 β_{i}^{}) / (2 β_{i}^{} + t_{i})}

. Let us define

m_{n}^{*} = {\tilde{m}}_{n} ({|m|}_{\infty} / {|m_{n}|}_{\infty} \land 1)

. Then,

{|m_{n}^{*}|}_{\infty} \leq {|m|}_{\infty} = {|g_{q}|}_{\infty} \leq K \leq F

, where the last inequality is based on Assumption (1). Additionally,

m_{n}^{*} \in F (L, p, s, F)

. We can denote

m_{n}^{*} - m = (m_{n}^{*} - {\tilde{m}}_{n}) + ({\tilde{m}}_{n} - m)

, and we have

{|m_{n}^{*} - m|}_{\infty} \leq {|m_{n}^{*} - {\tilde{m}}_{n}|}_{\infty} + {|{\tilde{m}}_{n} - m|}_{\infty}

, which implies

{|m_{n}^{*} - m|}_{\infty} \leq 2 {|{\tilde{m}}_{n} - m|}_{\infty}

. This shows that if we take the lower bound on a smaller space

F (L, p, s, F)

, then Equation (A21) also holds. Combining this with the upper bound of Equation (A17), we obtain when

Δ_{n} ({\hat{m}}_{n}, m) \leq C ϕ_{n} L {ln}^{2} n

, that

R ({\hat{m}}_{n}, m) \leq C^{'} ϕ_{n} L {ln}^{2} n,

and when

Δ_{n} ({\hat{m}}_{n}, m) \geq C ϕ_{n} L {ln}^{2} n

, that

R ({\hat{m}}_{n}, m) \leq C^{'} Δ_{n} ({\hat{m}}_{n}, m);

therefore, the upper bounds in Equations (7) and (8) hold for any constant

C > 0

. This completes the proof.

We begin by utilizing several oracle inequalities for the least squares estimators, as presented in Gyo et al. [16,17,18,19,20]. However, these inequalities assume bounded response variables, which are violated in the nonparametric regression model with Gaussian measurement noise. Additionally, we provide a lower bound for the risk and offer proof that can be easily generalized to any noise distribution. Let

N (δ, F, | \cdot |_{\infty})

be the covering number, which represents the minimum number of

{| \cdot |}_{\infty}

balls with radius

δ

needed to cover

F

(where the center does not necessarily have to be within

F

). □

Lemma A8.

We consider the nonparametric regression model in d-dimensional variables given by Equation (2) with an unknown regression function m. Let

\hat{m}

be an arbitrary estimator taking values in

F

. Let us define

Δ_{n} : = Δ_{n} ({\hat{m}}_{n}, m, F) : = E_{m} [\frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - \hat{m} (X_{i}))}^{2} - inf_{f \in F} \frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - f (X_{i}))}^{2}] .

For

F \geq 1

and assuming

\{m\} \cup F \subset \{f : {[0, 1]}^{d} \to [- F, F]\}

. If

N_{n} : = N ({δ, F, ∥ \cdot ∥}_{\infty}) \geq 3

, for any

δ, ε \in (0, 1]

, then

\begin{matrix} {(1 - ε)}^{2} Δ_{n} - F^{2} \frac{18 ln N_{n} + 76}{n_{ε}} - 38 δ F \leq R (\hat{m}, m) \\ \leq {(1 + ε)}^{2} [inf_{f \in F} E [{(f (X) - m (X))}^{2}] + F^{2} \frac{18 ln N_{n} + 72}{n_{ε}} + 32 δ F + Δ_{n}] . \end{matrix}

Proof.

Throughout the proof, let

E = E_{m}

. Define

{| g |}_{n}^{2} : = \frac{1}{n} \sum_{i = 1}^{n} g {(X_{i})}^{2}

. For any estimator

\tilde{m}

, we introduce the empirical risk

{\hat{R}}_{n} (\tilde{m}, m) : = E [{|\tilde{m} - m|}_{n}^{2}]

.

Step 1: We show that the upper bound holds under the restriction

ln N_{n} \leq n

. Since

R (\hat{m}, m) \leq 4 F^{2}

, the upper bound naturally holds when

ln N_{n} \geq n

. In this case, let

\tilde{m} \in arg {min}_{f \in F} \sum_{i \in Λ_{n}} {(Y_{i} - f (X_{i}))}^{2}

be a global risk minimizer. We observe that

\begin{matrix} {\hat{R}}_{n} (\hat{m}, m) - {\hat{R}}_{n} (\tilde{m}, m) \\ = E [{∥\hat{m} - m∥}_{n}^{2}] - E [{∥\tilde{m} - m∥}_{n}^{2}] \\ = E [\frac{1}{n} \sum_{i \in Λ_{n}} {(\hat{m} - m)}^{2}] - E [\frac{1}{n} \sum_{i \in Λ_{n}} {(\tilde{m} - m)}^{2}] \\ = E [\frac{1}{n} \sum_{i \in Λ_{n}} {(\hat{m} - Y_{i} + R_{i})}^{2}] - E [\frac{1}{n} \sum_{i \in Λ_{n}} {(\tilde{m} - Y_{i} + R_{i})}^{2}] \\ = Δ_{n} + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\hat{m} - Y_{i})] + E (\frac{1}{n} \sum_{i \in Λ_{n}} R_{i}^{2}) \\ - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\tilde{m} - Y_{i})] - E (\frac{1}{n} \sum_{i \in Λ_{n}} R_{i}^{2}) \\ = Δ_{n} + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\hat{m} - m + R_{i})] - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\tilde{m} - m + R_{i})] \\ = Δ_{n} + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \hat{m}] - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} m] - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \tilde{m}] + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} m] \\ = Δ_{n} - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \hat{m} (X_{i})] - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \tilde{m} (X_{i})] . \end{matrix}

(A22)

From this equation, we see that

Δ_{n} \leq 8 F^{2}

, which implies a lower bound on the logarithm

ln N_{n} \geq n

in the argument.

Therefore, we assume that

ln N_{n} \leq n

. The proof is divided into four parts, denoted as (I)–(IV).

(I): Establishing a connection between risk $R (\hat{m}, m) = E [{(\hat{m} (X) - m (X))}^{2}]$ and its empirical counterpart ${\hat{R}}_{n} (\hat{m}, m)$ through inequalities

$\begin{matrix} (1 - ε) {\hat{R}}_{n} (\hat{m}, m) - \frac{F^{2}}{n ε} (15 ln N_{n} + 75) - 26 δ F \\ \leq R (\hat{m}, m) \leq (1 + ε) ({\hat{R}}_{n} (\hat{m}, m) + (1 + ε) \frac{F^{2}}{n ε} (12 ln N_{n} + 70) + 26 δ F) . \end{matrix}$
(II): For any estimate $\tilde{m}$ taking values in $F$ , we know that

$|E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \tilde{m} (X_{i})]| \leq 2 \sqrt{\frac{{\hat{R}}_{n} (\tilde{m}, m) (3 ln N_{n} + 1)}{n}} + 6 δ .$
(III): We have

${\hat{R}}_{n} (\hat{m}, m) \leq (1 + ε) [inf_{f \in F} E [{(f (X) - m (X))}^{2}] + 6 δ + F^{2} \frac{6 ln N_{n} + 2}{n ε} + Δ_{n}] .$
(IV): We have

${\hat{R}}_{n} (\hat{m}, m) \geq (1 - ε) (Δ_{n} - \frac{3 ln N_{n} + 1}{n ε} - 12 δ) .$

Since

F \geq 1

, the lower bound of the lemma can be obtained by combining (I) and (IV), while the upper bound can be obtained from (I) and (III).

(I) Given a minimum

δ

-covering of

F

, let

f_{j}

represent the centers of the balls. According to the construction, there exists a random

j^{*}

such that

{|\hat{m} - f_{j^{*}}|}_{\infty} \leq δ

. Without loss of generality, we can assume that

{|f_{j}|}_{\infty} \leq F

. The random variables

X_{i}^{'}

,

i \in 1, 2, \dots, n

have the same distribution as

X

and are independent of

(X_{i}), i \in 1, 2, \dots, n

. We can use

{|f_{j}|}_{\infty}, {|m|}_{\infty}, δ \leq F

\begin{matrix} |R (\hat{m}, m) - {\hat{R}}_{n} (\hat{m}, m)| \\ = |E [\frac{1}{n} \sum_{i \in Λ_{n}} {(\hat{m} (X_{i}^{'}) - m (X_{i}^{'}))}^{2} - \frac{1}{n} \sum_{i \in Λ_{n}} {(\hat{m} (X_{i}) - m (X_{i}))}^{2}]| \\ = |E [\frac{1}{n} \sum_{i \in Λ_{n}} {(\hat{m} (X_{i}^{'}) - f_{j^{*}} (X_{i}^{'}) + f_{j^{*}} (X_{i}^{'}) - m (X_{i}^{'}))}^{2} \\ - \frac{1}{n} \sum_{i \in Λ_{n}} {(\hat{m} (X_{i}) - f_{j^{*}} (X_{i}) + f_{j^{*}} (X_{i}) - m (X_{i}))}^{2}]| \\ = |E [\frac{1}{n} \sum_{i \in Λ_{n}} 2 (\hat{m} (X_{i}^{'}) - f_{j^{*}} (X_{i}^{'})) (f_{j^{*}} (X_{i}^{'}) - m (X_{i}^{'})) + {(f_{j^{*}} (X_{i}^{'}) - m (X_{i}^{'}))}^{2} \\ - \frac{1}{n} \sum_{i \in Λ_{n}} 2 (\hat{m} (X_{i}) - f_{j^{*}} (X_{i})) (f_{j^{*}} (X_{i}) - m (X_{i})) + {(f_{j^{*}} (X_{i}) - m (X_{i}))}^{2}]| \\ \leq E [|\frac{1}{n} \sum_{i \in Λ_{n}} g_{j^{*}} (X_{i}, X_{i}^{'})|] + 9 δ F, \end{matrix}

where

g_{j^{}} (X_{i}, X_{i}^{'}) : = {(f_{j^{*}} (X_{i}^{'}) - m (X_{i}^{'}))}^{2} - {(f_{j^{*}} (X_{i}) - m (X_{i}))}^{2}

. We replace

f_{j}

with

f_{j^{*}}

, and we define

g_{j}

by using the same method. Similarly, we set

r_{j} : = \sqrt{n^{- 1} ln N n} \lor E^{1 / 2} [{(f_{j} (X) - m (X))}^{2}]

, and define

r^{*}

as

r_{j}

when

j = j^{*}

.

\begin{matrix} r^{*} & : = \sqrt{n^{- 1} ln N_{n}} \lor E^{1 / 2} [{(f_{j^{*}} (X) - m (X))}^{2} ∣ (X_{i}, Y_{i})] \\ \leq \sqrt{n^{- 1} ln N_{n}} + E^{1 / 2} [{(\hat{m} (X) - m (X))}^{2} ∣ (X_{i}, Y_{i})] + δ . \end{matrix}

In the last part, we use the triangle inequality and

{|f_{j^{*}} - \hat{m}|}_{\infty} \leq δ .

For random variables U and T, the Cauchy–Schwarz inequality states that

E [U T] \leq E^{1 / 2} [U^{2}]

E^{1 / 2} [T^{2}]

. Let

U = E^{1 / 2} [{(\hat{m} (X) - m (X))}^{2} ∣ (X_{i}, Y_{i})],

and

T : = max_{j} |\sum_{i \in Λ_{n}} g_{j} (X_{i}, X_{i}^{'}) / (r_{j} F)| .

By using

E [U^{2}] = R (\hat{m}, m)

, we have

\begin{matrix} |R (\hat{m}, m) - {\hat{R}}_{n} (\hat{m}, m)| \\ \leq E (\frac{F}{n} T r_{j}) + 9 δ F \\ = E (\frac{F}{n} T (U - U + r_{j})) + 9 δ F \\ = E (\frac{F}{n} T U + T (r_{j} - U)) + 9 δ F \\ \leq \frac{F}{n} R {(\hat{m}, m)}^{1 / 2} E^{1 / 2} [T^{2}] + \frac{F}{n} (\sqrt{\frac{ln N_{n}}{n}} + δ) E [T] + 9 δ F . \end{matrix}

(A23)

Observing that

E [g_{j} (X_{i}, X_{i}^{'})] = 0

,

|g_{j} (X_{i}, X_{i}^{'})| = |{(f_{j} (X_{i}^{'}) - m (X_{i}^{'}))}^{2} - {(f_{j} (X_{i}) - m (X_{i}))}^{2}| \leq 4 f_{j} (X_{i}) m (X_{i}) \leq 4 F^{2}

and

\begin{matrix} Var (g_{j} (X_{i}, X_{i}^{'})) & = 2 Var ({(f_{j} (X_{i}) - m (X_{i}))}^{2}) \\ \leq 2 E [{(f_{j} (X_{i}) - m (X_{i}))}^{4}] \leq 8 F^{2} r_{j}^{2} . \end{matrix}

Bernstein’s inequality states that for independent and centered random variables

U_{i_{1}}, \dots, U_{i_{N}}

, if

|U_{i}| \leq M,

then it holds true that [21]

P (|\sum_{i \in Λ_{n}} U_{i}| \geq t) \leq 2 exp (- t^{2} / [2 M t / 3 + 2 \sum_{i \in Λ_{n}} Var (U_{i})]) .

Combining Bernstein’s inequality and the bound argument, we obtain

P (T \geq t) \leq 1 \land 2 N_{n} max_{j} exp (- \frac{t^{2}}{8 t / (3 r_{j}) + 16 n}),

and since

r_{j} \geq \sqrt{n^{- 1} ln N_{n}}

, for all

t \geq 6 \sqrt{n ln N_{n}}

, we have

\begin{matrix} P (T \geq t) & \leq 2 N_{n} exp (- \frac{6 t \sqrt{n ln N_{n}}}{48 \sqrt{n ln N_{n}} / (3 \sqrt{n^{- 1} ln N_{n}}) + 16 n}) \\ \leq 2 N_{n} exp (- 3 t \sqrt{ln N_{n}} / (16 \sqrt{n})) . \end{matrix}

Thus, for large values of t, the denominator in the exponential is dominated by the first term. We have

\begin{matrix} E [T] & = \int_{0}^{\infty} P (T \geq t) d t \\ = \int_{0}^{6 \sqrt{n ln N_{n}}} P (T \geq t) d t + \int_{6 \sqrt{n ln N_{n}}}^{\infty} P (T \geq t) d t \\ \leq 6 \sqrt{n ln N_{n}} + \int_{6 \sqrt{n ln N_{n}}}^{\infty} 2 N_{n} exp (- \frac{3 t \sqrt{ln N_{n}}}{16 \sqrt{n}}) d t \\ \leq 6 \sqrt{n ln N_{n}} + \frac{32}{3} \sqrt{\frac{n}{ln N_{n}}} . \end{matrix}

According to the assumption,

N_{n} \geq 3

; hence,

ln N_{n} \geq 1

. By using a similar approach to the upper bound for

E [T]

, we can obtain the quadratic case.

\begin{matrix} E [T^{2}] & = \int_{0}^{\infty} P (T^{2} \geq u) d u = \int_{0}^{\infty} P (T \geq \sqrt{u}) d u \\ \leq 36 n ln N_{n} + \int_{36 n ln N_{n}}^{\infty} 2 N_{n} exp (- \frac{3 \sqrt{u} \sqrt{ln N_{n}}}{16 \sqrt{n}}) d u \\ \leq 36 n ln N_{n} + 2^{8} n . \end{matrix}

Step 2: The identity

\int_{b^{2}}^{\infty} e^{- \sqrt{u} a} d u = 2 \int_{b}^{\infty} s e^{- s a} d s = 2 (b a + 1) e^{- b a} / a^{2}

. In this case, we set

a = \frac{3 \sqrt{u} \sqrt{ln N_{n}}}{16 \sqrt{n}}

and

b = 6 \sqrt{n ln N_{n}}

to obtain the aforementioned inequality.

\begin{matrix} |R (\hat{m}, m) - {\hat{R}}_{n} (\hat{m}, m)| \\ \leq \frac{F}{n} R {(\hat{m}, m)}^{1 / 2} {(36 n ln N_{n} + 2^{8} n)}^{1 / 2} + F (6 \sqrt{n ln N_{n}} + \frac{32}{3} \sqrt{\frac{n}{ln N_{n}}}) + 9 δ F \\ \leq \frac{F}{n} R {(\hat{m}, m)}^{1 / 2} {(36 n ln N_{n} + 2^{8} n)}^{1 / 2} + F \frac{6 ln N_{n} + 11}{n} + 26 δ F . \end{matrix}

(A24)

Let

a, b, c, d

be positive real numbers such that

| a - b | \leq 2 \sqrt{a} c + d

. We have

b - 2 \sqrt{a} c - d \leq a \leq 2 \sqrt{a} c + b + d,

b - d - 2 (\sqrt{\frac{ε}{1 - ε}} \sqrt{a}) (\sqrt{\frac{1 - ε}{ε}}) \leq a \leq b + d + 2 (\sqrt{\frac{ε}{1 + ε}} \sqrt{a}) (\sqrt{\frac{1 + ε}{ε}}),

b - d - (\frac{ε}{1 - ε} a + \frac{1 - ε}{ε} c^{2}) \leq a \leq \frac{ε}{1 + ε} a + 1 + \frac{1 + ε}{ε} c^{2} + b + d .

Consequently, for any

ε \in (0, 1]

, we have the following.

(1 - ε) b - d - \frac{c^{2}}{ε} \leq a \leq (1 + ε) (b + d) + \frac{{(1 + ε)}^{2}}{ε} c^{2} .

(A25)

According to Equation (A24), we take

a = R (\hat{m}, m)

,

b = {\hat{R}}_{n} (\hat{m}, m)

, and

c = F {(9 n ln N_{n} + 64 n)}^{1 / 2} / n, d = F (6 ln N_{n} + 11) / n + 26 δ F

. Substituting

a, b, c, d

into Equation (A25) (denoted as (I)), we complete the proof of (I).

(II) Given an estimation

\tilde{f}

that takes values in

F

, let

j^{'}

be such that

| \tilde{m} - f_{j^{'}} |_{\infty} \leq δ

. Then,

|E [\sum_{i \in Λ_{n}} R_{i} (\tilde{m} (X_{i}) - f_{j^{'}} (X_{i}))]| \leq δ E [\sum_{i \in Λ_{n}} |R_{i}|] \leq n δ

. Since

E [R_{i} m (X_{i})] = c o v (X_{i}, R_{i}) \neq 0

, we have

\begin{matrix} |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \tilde{m} (X_{i})]| \\ = |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\tilde{m} (X_{i}) - m (X_{i}) + m (X_{i}))]| \\ \leq |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\tilde{m} (X_{i}) - m (X_{i}))]| + |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} m (X_{i})]| \\ = |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\tilde{m} (X_{i}) - f_{j^{'}} (X_{i}) + f_{j^{'}} (X_{i}) - m (X_{i}))]| + |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} m (X_{i})]| \\ \leq |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\tilde{m} (X_{i}) - f_{j^{'}} (X_{i}))]| + \frac{2}{n} c o v (X_{i}, R_{i}) \\ + |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (f_{j^{'}} (X_{i}) - m (X_{i}))]| \\ \leq 2 δ + \frac{2}{n} c o v (X_{i}, R_{i}) + |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (f_{j^{'}} (X_{i}) - m (X_{i}))]| \\ \leq 2 δ + \frac{2}{\sqrt{n}} E [({∥\tilde{m} - m∥}_{n} + δ) |ξ_{j^{'}}|], \end{matrix}

(A26)

where

ξ_{j} : = \frac{c o v (X_{i}, R_{i}) + \sum_{Λ_{n}} R_{i} (f_{j} (X_{i}) - m (X_{i}))}{\sqrt{n} {∥f_{j} - m∥}_{n}} .

Under the condition

{(X_{i})}_{i}

,

ξ_{j} \sim N (0, 1)

. According to Lemma A9, we obtain

E [ξ_{j^{'}}^{2}] \leq E [{max}_{j} ξ_{j}^{2}] \leq 3 ln N_{n} + 1

. By using Cauchy–Schwarz, we have

E [({∥\tilde{m} - m∥}_{n} + δ) |ξ_{j^{'}}|] \leq ({\hat{R}}_{n} {(\tilde{m}, m)}^{1 / 2} + δ) \sqrt{3 ln N_{n} + 1} .

(A27)

Since

ln N_{n} \leq n

, we have

2 n^{- 1 / 2} δ \sqrt{3 ln N n + 1} \leq 4 δ

. Combining Equations (A26) and (A27), we have

\begin{matrix} |E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \tilde{m} (X_{i})]| \leq 2 δ + \frac{2}{\sqrt{n}} E [({∥\tilde{m} - m∥}_{n} + δ) |ξ_{j^{'}}|] \\ \leq 2 δ + \frac{2}{\sqrt{n}} [({\hat{R}}_{n} {(\tilde{m}, m)}^{1 / 2} + δ) \sqrt{3 ln N_{n} + 1}] \\ \leq 2 \sqrt{\frac{{\hat{R}}_{n} (\tilde{m}, m) (3 ln N_{n} + 1)}{n}} + 6 δ . \end{matrix}

(II) is proven.

(III) For any fixed

f \in F

, we have

E [\frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - \hat{m} (X_{i}))}^{2}] \leq E [\frac{1}{n} \sum_{i \in Λ_{n}} {(Y_{i} - f (X_{i}))}^{2}] + Δ_{n}

. Since

X_{i} \overset{D}{=} X

and f are deterministic, we have

E [∥ f - {m ∥}_{n}^{2}] = E [{(f (X) - m (X))}^{2}]

. Since

E [R_{i} m (X_{i})] = c o v (X_{i}, R_{i}) \neq 0

, we have

\begin{matrix} {\hat{R}}_{n} (\hat{m}, m) = E [{∥\hat{m} - m∥}_{n}^{2}] \\ = E [{∥\hat{m} - Y_{i} + R_{i}∥}_{n}^{2}] \\ \leq E [{∥\hat{m} - Y_{i}∥}_{n}^{2}] + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\hat{m} - Y_{i})] \\ \leq E [{∥f - m∥}_{n}^{2}] + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} (\hat{m} - m)] + Δ_{n} \\ \leq E [{(f (X) - m (X))}^{2}] + Δ_{n} + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \hat{m}] + \frac{2}{n} c o v (X_{i}, R_{i}) \\ \leq E [{(f (X) - m (X))}^{2}] + \frac{2}{n} c o v (X_{i}, R_{i}) + 2 \sqrt{\frac{{\hat{R}}_{n} (\tilde{m}, m) (3 ln N_{n} + 1)}{n}} + 6 δ + Δ_{n} . \end{matrix}

By setting

a : = {\hat{R}}_{n} (\hat{m}, m), b : = 0, c : = \sqrt{(3 ln N_{n} + 1) / n}, d : = E [{(f (X) - m (X))}^{2}] + 6 δ + Δ_{n} + \frac{2}{n} c o v (X_{i}, R_{i})

in Equation (A25), we obtain the result for (III).

(IV) Let

\tilde{m} \in arg {min}_{f \in F} \sum_{i \in Λ_{n}} {(Y_{i} - f (X_{i}))}^{2}

be the empirical risk minimizer. By using Equation (A22), (II), and

(1 - ε) / ε + 1 = 1 / ε

, we have

\begin{matrix} {\hat{R}}_{n} (\hat{m}, m) - {\hat{R}}_{n} (\tilde{m}, m) \\ = Δ_{n} + E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \hat{m} (X_{i})] - E [\frac{2}{n} \sum_{i \in Λ_{n}} R_{i} \tilde{m} (X_{i})] \\ \geq Δ_{n} - 2 \sqrt{\frac{{\hat{R}}_{n} (\hat{m}, m) (3 ln N_{n} + 1)}{n}} - 2 \sqrt{\frac{{\hat{R}}_{n} (\tilde{m}, m) (3 ln N_{n} + 1)}{n}} - 12 δ \\ \geq Δ_{n} - 2 \sqrt{\frac{ε}{1 - ε} {\hat{R}}_{n} (\hat{m}, m) \frac{1 - ε}{ε} \cdot \frac{(3 ln N_{n} + 1)}{n}} - {\hat{R}}_{n} (\tilde{m}, m) - \frac{3 ln N_{n} + 1}{n} - 12 δ \\ \geq Δ_{n} - \frac{ε}{1 - ε} {\hat{R}}_{n} (\hat{m}, m) - {\hat{R}}_{n} (\tilde{m}, m) - \frac{3 ln N_{n} + 1}{n ε} - 12 δ . \end{matrix}

After rearranging, we have

{\hat{R}}_{n} (\hat{m}, m) \geq (1 - ε) (Δ_{n} - \frac{3 ln N_{n} + 1}{n_{ε}} - 12 δ)

, which completes the proof of (IV). □

Lemma A9.

Let

η_{j} \sim N (0, 1)

, and then

E [{max}_{j = 1, \dots, M} η_{j}^{2}] \leq 3 ln M + 1

.

Proof.

Let

Z = {max}_{j = 1, \dots, M} η_{j}^{2}

. Since

Z \leq \sum_{j} η_{j}^{2}

, and we have

E [Z] \leq M

. For

M \in {1, 2, 3}

, it is evident that

M \leq 3 ln (M) + 1

holds. Therefore, we consider the case when

M \geq 4

. By using Mill’s ratio, we obtain

P (|η_{1}| \geq \sqrt{t}) = 2 P (η_{1} \geq \sqrt{t}) \leq 2 e^{- t / 2} / (\sqrt{2 π t})

. For any T, we have

\begin{matrix} E [Z] & = \int_{0}^{\infty} P (Z \geq t) d t \leq T + \int_{T}^{\infty} P (Z \geq t) d t \leq T + M \int_{T}^{\infty} P (η_{1}^{2} \geq t) d t \\ \leq T + M \int_{T}^{\infty} \frac{2}{\sqrt{2 π t}} e^{- t / 2} d t \leq T + \frac{2 M}{\sqrt{2 π T}} \int_{T}^{\infty} e^{- t / 2} d t \\ = T + \frac{4 M}{\sqrt{2 π T}} e^{- T / 2} . \end{matrix}

For

T = 2 ln M

and

M \geq 4

, we have

E [Z] \leq 2 ln M + 2 / \sqrt{π ln M} \leq 3 ln M + 1 .

Since

2 / \sqrt{π ln M} \leq ln M + 1

, we can deduce that

2 / \sqrt{π} \leq {(ln M)}^{2} + ln M

. Considering that the function is monotonically increasing with respect to M, it holds for all

M \geq 4

. □

Lemma A10.

If

V : = \prod_{ℓ = 0}^{L + 1} (p_{ℓ} + 1)

, then for any

δ > 0

, we have

ln N ({δ, F (L, p, s, \infty), ∥ \cdot ∥}_{\infty}) \leq (s + 1) ln (2 δ^{- 1} (L + 1) V^{2}) .

Proof.

Given a network

f (x) = W_{L} σ_{v_{L}} W_{L - 1} σ_{v_{L - 1}} \dots W_{1} σ_{v_{1}} W_{0} x,

define

k \in {1, \dots, L}, A_{k}^{+} f : {[0, 1]}^{d} \to R^{p_{k}}

,

A_{k}^{+} f (x) = σ_{v_{k}} W_{k - 1} σ_{v_{k - 1}} \dots W_{1} σ_{v_{1}} W_{0} x,

and

A_{k}^{-} f : R^{p_{k - 1}} \to R^{p_{L + 1}}

,

A_{k}^{-} f (y) = W_{L} σ_{v_{L}} W_{L - 1} σ_{v_{L - 1}} \dots W_{k} σ_{v_{k}} W_{k - 1} y .

Let

A_{0}^{+} f (x) = A_{L + 2}^{-} f (x) = x

, and we note that for

f \in F (L, p)

, we have

{|A_{k}^{+} f (x)|}_{\infty} \leq \prod_{ℓ = 0}^{k - 1} (p_{ℓ} + 1)

. For a multivariate function

h (\cdot)

,

h (\cdot)

is said to be a Lipschitz function if, for all

x, y

in its domain,

{| h (x) - h (y) |}_{\infty} \leq L {| x - y |}_{\infty}

, where the smallest L is the Lipschitz constant. The combination of two Lipschitz functions with Lipschitz constants

L_{1}

and

L_{2}

results in a new Lipschitz function with a Lipschitz constant of

L_{1} L_{2}

. Therefore, the Lipschitz constant of

A_{k}^{-} f

is bounded by

\prod_{ℓ = k - 1}^{L} p_{ℓ}

. Given

ε > 0

, let

f, f^{} \in F (L, p, s)

be two network functions with parameters differing from each other by at most

ε

. Let f have parameters

(v_{k}, W_{k})

and

f^{*}

have parameters

(v_{k}^{*}, W_{k}^{*})

. Then, we have

\begin{matrix} |f (x) - f^{*} (x)| \\ = |W_{L} σ_{v_{L}} W_{L - 1} σ_{v_{L - 1}} \dots W_{1} σ_{v_{1}} W_{0} x - W_{L}^{*} σ_{v_{L}^{*}} W_{L - 1}^{*} σ_{v_{L - 1}^{*}} \dots W_{1}^{*} σ_{v_{1}^{*}} W_{0}^{*} x| \\ = |A_{k}^{-} f (σ_{v_{k}} W_{k - 1} σ_{v_{k - 1}} A_{k - 1}^{+} f (x)) - A_{k}^{-} f^{*} (σ_{v_{k}^{*}} W_{k - 1}^{*} σ_{v_{k - 1}^{*}} A_{k - 1}^{+} f^{*} (x))| \\ \leq \sum_{k = 1}^{L + 1} |A_{k + 1}^{-} f (σ_{v_{k}} W_{k - 1} A_{k - 1}^{+} f^{*} (x)) - A_{k + 1}^{-} f (σ_{v_{k}^{*}} W_{k - 1}^{*} A_{k - 1}^{+} f^{*} (x))| \\ \leq \sum_{k = 1}^{L + 1} (\prod_{ℓ = k}^{L} p_{ℓ}) {|σ_{v_{k}} W_{k - 1} A_{k - 1}^{+} f^{*} (x) - σ_{v_{k}^{*}} W_{k - 1}^{*} A_{k - 1}^{+} f^{*} (x)|}_{\infty} \\ \leq \sum_{k = 1}^{L + 1} (\prod_{ℓ = k}^{L} p_{ℓ}) {|(W_{k - 1} A_{k - 1}^{+} f^{*} (x) - v_{k}) - (W_{k - 1}^{*} A_{k - 1}^{+} f^{*} (x) - v_{k}^{*})|}_{\infty} \\ \leq \sum_{k = 1}^{L + 1} (\prod_{ℓ = k}^{L} p_{ℓ}) ({|(W_{k - 1} - W_{k - 1}^{*}) A_{k - 1}^{+} f^{*} (x)|}_{\infty} + {|v_{k} - v_{k}^{*}|}_{\infty}) \\ \leq ε \sum_{k = 1}^{L + 1} (\prod_{ℓ = k}^{L} p_{ℓ}) (p_{k - 1} {|A_{k - 1}^{+} f^{*} (x)|}_{\infty} + 1) \\ \leq ε \sum_{k = 1}^{L + 1} ((\prod_{ℓ = k}^{L} p_{ℓ} + 1)) \\ \leq ε V (L + 1) . \end{matrix}

The final step uses

V : = \prod_{ℓ = 0}^{L + 1} (p_{ℓ} + 1)

. Therefore, according to Equation (A3), the total number of parameters is bounded by

T : = \sum_{ℓ = 0}^{L} (p_{ℓ} + 1) p_{ℓ + 1} \leq (L + 1) 2^{- L} \prod_{ℓ = 0}^{L + 1} (p_{ℓ} + 1) \leq V

, and there are

(\begin{matrix} T \\ s \end{matrix}) \leq V^{s}

combinations to select s nonzero parameters.

Since all parameters are bounded by 1 in absolute value, we can discretize the nonzero parameters by using a grid size of

δ / (2 (L + 1) V)

, and the covering number

N ({δ, F (L, p, s, \infty), ∥ \cdot ∥}_{\infty}) \leq \sum_{s^{*} \leq s} {(2 δ^{- 1} (L + 1) V^{2})}^{s^{*}} \leq {(2 δ^{- 1} (L + 1) V^{2})}^{s + 1} .

Taking the logarithm yields the proof.

Note 1: Similarly, applying Equation (A4) to Lemma A10 gives

ln N ({δ, F (L, p, s, \infty), ∥ \cdot ∥}_{\infty}) \leq (s + 1) ln (2^{2 L + 5} δ^{- 1} (L + 1) p_{0}^{2} p_{L + 1}^{2} s^{2 L}) .

□

Proof of Theorem 2.

Let

δ = 1 / n

. The proof follows directly from Lemmas A8 and A10, and Note 1. □

Proof of Theorem 3.

In this proof, we define

∥ \cdot ∥ 2 = {∥ \cdot ∥}_{L^{2} {[0, 1]}^{d}}

. Let us assume that there exist positive constants

γ \leq Γ

such that the Lebesgue density of

X

over

{[0, 1]}^{d}

is bounded below by

γ

and above by

Γ

. For this particular design, we have

R ({\hat{m}}_{n}, m) \geq γ {|{\hat{m}}_{n} - m|}_{2}^{2}

. Let

P_{f}

represent the data mechanism in the nonparametric regression model given by Equation (13). For the Kullback–Leibler divergence, we have

KL (P_{f}, P_{g}) = n E [{(f (X_{1}) - g (X_{1}))}^{2}] \leq Γ n {| f - g |}_{2}^{2}

. In Alexandre’s work [22], Theorem 2.7 states that if, for

M \geq 1

and

κ > 0

, we have

f_{(0)}, \dots, f_{(M)} \in G (q, d, t, β, K)

, then

(i)

|f_{(j)} - f_{(k)}| 2 \geq κ \sqrt{ϕ n}

, where

0 \leq j < k \leq M

;

(ii)

n \sum_{j = 1}^{M} {|f_{(j)} - f_{(0)}|}_{2}^{2} \leq M ln (M) / (9 Γ)

.

Then, there exists a positive constant

c = c (κ, γ)

such that

inf_{{\hat{m}}_{n}} sup_{m \in G (q, d, t, β, K)} R ({\hat{m}}_{n}, m) \geq c ϕ_{n} .

In the next step, we construct functions

f_{(0)}, \dots, f_{(M)} \in G (q, d, t, β, K)

satisfying (i) and (ii). We define

i^{*} \in arg min_{i = 0, \dots, q} β_{i}^{*} / (2 β_{i}^{*} + t_{i}) .

The exponent

i^{*}

determines the rate of estimation, i.e.,

ϕ_{n} = n^{- 2 β^{*} / (2 β^{*} + t^{*})}

. For convenience, we denote

β^{*} : = β_{i^{*}}, β^{* *} : = β_{i^{*}}^{*}

, and

t^{*} : = t_{i^{*}}

. We note the distinction between

β^{*}

and

β^{* *}

. Let

K \in L^{2} (R) \cap C_{1}^{β^{*}} (R, 1)

be supported on

[0, 1]

. It is easy to see that such a function K exists. Furthermore, we define

m_{n} : = ⌊ρ n^{1 / (2 β^{* *} + t^{*})}⌋

and

h_{n} : = 1 / m_{n}

, where

ρ

is a constant chosen so that

n h_{n}^{2 β^{*} + t^{*}} \leq 1 / (72 Γ |K^{B}| 2^{2 t^{*}})

, with

B : = \prod_{ℓ = i^{*} + 1}^{q} (β_{ℓ} \land 1)

. For any

u = (u_{1}, \dots, u_{t^{*}}) \in U_{n} : = \{(u_{1}, \dots, u_{t^{*}}) : u_{i} \in \{0, h_{n}, 2 h_{n}, \dots, (m_{n} - 1) h_{n}\}

, we define

ψ_{u} (x_{1}, \dots, x_{t^{*}}) : = h_{n}^{β^{*}} \prod_{j = 1}^{t^{*}} K (\frac{x_{j} - u_{j}}{h_{n}}),

and for any

α

and

| α | < β^{*}

, we have

{|\partial^{α} ψ_{u}|}_{\infty} \leq 1

, by using the fact that

K \in C_{1}^{β^{*}} (R, 1)

.

For

α = (α_{1}, \dots, α_{t^{*}})

with

| α | = ⌊β^{*}⌋

, the triangle inequality and the property

K \in C_{1}^{β^{*}} (R, 1)

give

h_{n}^{β^{*}} - ⌊β^{*}⌋ \frac{|\prod_{j = 1}^{t^{*}} K^{(α_{j})} (\frac{x_{j} - u_{j}}{h_{n}}) - \prod_{j = 1}^{t^{*}} K^{(α_{j})} (\frac{y_{j} - u_{j}}{h_{n}})|}{{max}_{i} {|x_{i} - y_{i}|}^{β^{*} - ⌊β^{*}⌋}} \leq t^{*};

therefore,

ψ_{u} \in C_{t^{*}}^{β^{*}} ({[0, 1]}^{t^{*}}, {(β^{*})}^{t^{*}} + t^{*})

. For a vector

w = {(w_{u})}_{u \in U_{n}} \in {0, 1}^{|U_{n}|}

, we define

ϕ_{w} = \sum_{u \in U_{n}} w_{u} ψ_{u} .

By constructing

ψ_{u}

and

ψ_{u^{'}}

for

u, u^{'} \in U_{n}

such that

u \neq u^{'}

are mutually disjoint, we ensure that (

ϕ_{w} \in C_{t^{*}}^{β^{*}} ({[0, 1]}^{t^{*}}, {(β^{*})}^{t^{*}} + t^{*})

.

For

i < i^{*}

, let

g_{i} (x) : = {(x_{1}, \dots, x_{d_{i}})}^{⊤}

. For

i = i^{*}

, define

g_{i^{*}, w} (x) = (ϕ_{w} (x_{1}, \dots, {x_{t_{i^{*}}}), 0, \dots, 0)}^{⊤}

. For

i > i^{}

, let

g_{i} (x) : = {(x_{1}^{β_{i} \land 1}, 0, \dots, 0)}^{⊤}

. Here,

B = \prod_{ℓ = i^{} + 1}^{q} (β_{ℓ} \land 1)

. We frequently use

β^{* *} = β^{*} B

. Since

t_{i} \leq min (d_{0}, \dots, d_{i - 1})

and the mutually disjoint

ψ_{u}

ensures

\begin{matrix} f_{w} (x) & = g_{q} \circ \dots \circ g_{i^{*} + 1} \circ g_{i *, w} \circ g_{i^{*} - 1} \circ \dots \circ g_{0} (x) \\ = ϕ_{w} {(x_{1}, \dots, x_{t_{i^{*}}})}^{B} \\ = \sum_{u \in U_{n}} w_{u} ψ_{u} {(x_{1}, \dots, x_{t_{i^{*}}})}^{B}, \end{matrix}

by choosing a sufficiently large K, we ensure that

f_{w} \in G (q, d, t, β, K)

.

For all

u

,

{|ψ_{u}|}_{2}^{2} = h_{n}^{2 β^{* *} + t^{*}} |K^{B}| 2^{2 t^{*}}

. Let

Ham (w, w^{'}) = \sum_{u \in U_{n}} 1 (w_{u} \neq w_{u^{'}})

denote the Hamming distance; then,

{∥f_{w} - f_{w^{'}}∥}_{2}^{2} = Ham (w, w^{'}) h_{n}^{2 β^{* *} + t^{*}} {∥K^{B}∥}_{2}^{2 t^{*}} .

According to the Varshamov–Gilbert bound (see [22], Lemma 2.9) and

m_{n}^{t^{}} = |U_{n}|

, there exists a subset

W \subset {0, 1}^{m_{n}^{t^{*}}}

with a cardinality of

| W | \geq 2^{m_{n}^{t^{*}} / 8}

. For all

w, w^{'} \in W

such that

w \neq w^{'}

, it holds that

Ham (w, w^{'}) \geq m_{n}^{t^{*}} / 8

. This implies that for all

κ = || K^{B}| |_{2}^{t^{*}} / (\sqrt{8} ρ^{β^{* *}}), w, w^{'} \in W, w \neq w^{'}

, we have

\begin{matrix} {∥f_{w} - f_{w^{'}}∥}_{2}^{2} & = Ham (w, w^{'}) h_{n}^{2 β^{* *} + t^{*}} {∥K^{B}∥}_{2}^{2 t^{*}} \\ \geq \frac{1}{8} m_{n}^{t^{*}} h_{n}^{2 β^{* *} + t^{*}} {∥K^{B}∥}_{2}^{2 t^{*}} \\ \geq \frac{1}{8} h_{n}^{2 β^{* *}} {∥K^{B}∥}_{2}^{2 t^{*}} \\ \geq κ^{2} ϕ_{n} . \end{matrix}

According to the definitions of

h_{n}

and

ρ

, we have

\begin{matrix} n {∥f_{w} - f_{w^{'}}∥}_{2}^{2} & = n Ham (w, w^{'}) h_{n}^{2 β^{* *} + t^{*}} {∥K^{B}∥}_{2}^{2 t^{*}} \\ \leq n m_{n}^{t^{*}} h_{n}^{2 β^{* *} + t^{*}} {∥K^{B}∥}_{2}^{2 t^{*}} \\ \leq m_{n}^{t^{*}} \frac{1}{(72 Γ {∥K^{B}∥}_{2}^{2 t^{*}})} {∥K^{B}∥}_{2}^{2 t^{*}} \\ \leq \frac{m_{n}^{t^{*}}}{72 Γ} \\ \leq \frac{ln | W |}{9 Γ} . \end{matrix}

This indicates that the functions

f_{w}

with

w \in W

satisfy (i) and (ii), and, thus, the lemma is proven. □

Proof of Lemma 1.

Let

c_{2} \leq 1

. Since

{|m|}_{\infty} \leq K

, we need to consider only the lower bound on

F (L, p, s, F)

, where

F = K + 1

. Let

{\tilde{f}}_{n}

be the empirical risk minimizer. Recall that

Δ_{n} ({\tilde{m}}_{n}, m) = 0

. Due to the minimax lower bound in Theorem 3, there exists a constant

c_{3}

such that for all sufficiently large n, we have

c_{3} n^{- 2 β / (2 β + d)} \leq {sup}_{m \in C_{1}^{β} ([0, 1], K)} R ({\tilde{m}}_{n}, m)

. Since

p_{0} = d

and

p_{L + 1} = 1

, by Theorem 2, we can conclude that

\begin{matrix} c_{3} n^{- 2 β / (2 β + d)} \leq sup_{m \in C_{d}^{β} ([0, 1], K)} R ({\tilde{m}}_{n}, m) \\ \leq 4 sup_{m \in C_{d}^{β} ([0, 1], K)} inf_{f \in F (L, p, s, K + 1)} {∥f - m∥}_{\infty}^{2} + C {(K + 1)}^{2} \frac{(s + 1) ln (n {(s + 1)}^{L} d)}{n}, \end{matrix}

where C is a constant. Given

ε

, let

n_{ε} : = ⌊{(\sqrt{8} ε / \sqrt{c_{3}})}^{- (2 β + d) / β}⌋

. If

ε \leq \sqrt{c_{3} / 8}

, and

n_{ε}^{- 1} \leq 2 {(\sqrt{8} ε / \sqrt{c_{3}})}^{(2 β + d) / β}

, and

8 ε^{2} / c_{3} \leq n_{ε}^{- 2 β / (2 β + d)}

, then, for sufficiently small

c_{2} > 0

and all

ε \leq c_{2}

, we can insert

n_{ε}

into the previous inequalities, and we have

8 ε^{2} \leq 4 sup_{m \in C_{d}^{β} ([0, 1], K)} inf_{f \in F (L, p, s, K + 1)} {∥f - m∥}_{\infty}^{2} + C_{1} ε^{\frac{2 β + d}{β}} s (ln (ε^{- 1} s^{L}) + C_{2}) .

The constants

C_{1}

and

C_{2}

depend only on

K, β

and d. By using the condition

s \leq c_{1} ε^{- d / β} / (L ln (1 / ε))

and choosing a sufficiently small

c_{1}

, the proof is completed. □

Proof of Lemma 2.

Let r be the smallest positive integer such that

μ_{r} : = \int x^{r} ψ (x) d x \neq 0

. Such an r exists because the span of

\{x^{r} : r = 0, 1, \dots\}

is dense in

L^{2} [0, A]

, and

ψ

cannot be a constant function. If

h \in L^{2} (R)

, then, for the wavelet coefficients, we have

\begin{matrix} \int h (x_{1} + \dots + x_{d}) \prod_{ℓ = 1}^{d} ψ_{j, k_{ℓ}} (x_{ℓ}) d x \\ = 2^{- \frac{j d}{2}} \int_{{[0, 2^{q}]}^{d}} h (2^{- j} (\sum_{ℓ = 1}^{d} x_{ℓ} + k_{ℓ})) \prod_{ℓ = 1}^{d} ψ (x_{ℓ}) d x . \end{matrix}

(A28)

For a real number u, let

{u}

denote the fractional part of u.

We separately consider the cases when

μ_{0} \neq 0

and

μ_{0} = 0

. If

μ_{0} \neq 0

, we define

g (u) = r^{- 1} u^{r} I_{[0, 1 / 2]} (u) + r^{- 1} {(1 - u)}^{r} I_{(1 / 2, 1]} (u)

. We note that g is a Lipschitz function with Lipschitz constant 1. Let

h_{j, α} (u) = K 2^{- j α - 1} g (\{2^{j - q - ν} u\})

, where

q > 0

,

ν : = ⌈{log}_{2} d⌉ + 1

. For a V-periodic function

u \mapsto s (u)

,

α \leq 1

α

—Hölder can be expressed as

sup_{u \neq v, | u - v | \leq V} {| s (u) - s (v) | / | u - v |}^{α} .

Since g is a 1-Lipschitz function, for any u and v such that

| u - v | \leq 2^{q + ν - j}

, we have

\begin{matrix} |h_{j, α} (u) - h_{j, α} (v)| & = |K 2^{- j α - 1} (g (\{2^{j - q - ν} u\}) - g (\{2^{j - q - ν} v\}))| \\ \leq K 2^{- j α - 1} 2^{j - q - ν} | u - v | \\ \leq K 2^{- j α - 1} 2^{j - q - ν} 2^{j + q + ν} \\ \leq \frac{K}{2} 2^{α (- j + q + ν - q - ν)} \\ \leq \frac{K}{2} 2^{α (- j + q + ν)} 2^{- α (q + ν)} \\ \leq \frac{K}{2} {| u - v |}^{α} . \end{matrix}

Therefore,

{|h_{j, α}|}_{\infty} \leq K / 2

and

h_{j, α} \in C_{1}^{α} ([0, d], K)

. Let

f_{j, α} (x) = h_{j, α} (x_{1} + \dots + x_{d})

. The support of

ψ

is contained in

[0, 2^{q}]

and

2^{ν} \geq 2 d

. Based on the definition of wavelet coefficients, Equation (A28), the definition of

h_{j, α}

, and using

u_{r} = \int x^{r} ψ (x) d x

, for

p_{1}, \dots, p_{d} \in \{0, 1, \dots, 2^{j - q - 2} - 1\}

, we have

\begin{matrix} d_{(j, 2^{q + ν} p_{1}) \dots (j, 2^{q + ν} p_{d})} (f_{j, α}) \\ = 2^{- \frac{j d}{2}} \int_{{[0, 2^{q}]}^{d}} h_{j, α} (2^{- j} (\sum_{ℓ = 1}^{d} x_{ℓ} + 2^{q + ν} p_{ℓ})) \prod_{ℓ = 1}^{d} ψ (x_{ℓ}) d x \\ = K 2^{- \frac{j d}{2} - j α - 1} \int_{{[0, 2^{q}]}^{d}} g (\{\frac{\sum_{ℓ = 1}^{d} x_{ℓ} + p_{ℓ}}{2^{q + ν}}\}) \prod_{ℓ = 1}^{d} ψ (x_{ℓ}) d x \\ = r^{- 1} 2^{- q r - ν r - 1} K 2^{- \frac{j}{2} (2 α + d)} \int_{{[0, 2^{q}]}^{d}} ({(x_{1} + \dots + x_{d})}^{r} + p_{ℓ}) \prod_{ℓ = 1}^{d} ψ (x_{ℓ}) d x \\ = d r^{- 1} 2^{- q r - ν r - 1} K (μ_{0}^{d - 1} μ_{r} + p_{ℓ} μ_{0}^{d}) 2^{- \frac{j}{2} (2 α + d)} . \end{matrix}

In the last equation, according to the definition of r,

μ_{1} = \dots = μ_{r - 1} = 0

.

In the case of

μ_{0} = 0

, we take

g (u) = {(d r)}^{- 1} u^{d r} I_{[0, 1 / 2]} (u) + {(d r)}^{- 1} {(1 - u)}^{d r} I_{(1 / 2, 1]} (u)

. Following the same reasoning as above and by using the binomial theorem, we obtain

d_{(j, 2^{q + ν} p_{1}) \dots (j, 2^{q + ν} p_{r})} (f_{j, α}) = (\begin{matrix} d r \\ r \end{matrix}) \frac{1}{d r} 2^{- d q r - d ν r - 1} K μ_{r}^{d} 2^{- \frac{j}{2} (2 α + d)};

therefore, the lemma is proven. □

Proof of Theorem 4.

We define

c (ψ, d)

as in Lemma 2. We choose an integer

j^{*}

such that

\frac{1}{n} \leq c {(ψ, d)}^{2} K^{2} 2^{- j^{*} (2 α + d)} \leq \frac{2^{2 α + d}}{n} .

This implies that

2^{j^{*}} \geq \frac{1}{2} {(c {(ψ, d)}^{2} K^{2} n)}^{1 / (2 α + d)}

. According to Lemma 2, there exists a function

f_{j^{*}, α}

of the form

h (x_{1} + \dots + x_{d})

, where

h \in C_{1}^{α} ([0, d], K)

, such that

\begin{matrix} R ({\hat{m}}_{n}, f_{j^{*}, α}) & \geq \sum_{p_{1}, \dots, p_{d} \in \{0, 1, \dots, 2^{j^{*} - q - ν - 1\}}} \frac{1}{n} \\ = \frac{1}{n} 2^{j^{*} d - q d - ν d} \\ \geq \frac{1}{n} {(c {(ψ, d)}^{2} K^{2} n)}^{d / (2 α + d)} n^{\frac{d}{2 α + d}} 2^{- d} 2^{- q d - d ν} \\ \geq {(c {(ψ, d)}^{2} K^{2} n)}^{d / (2 α + d)} n^{- \frac{2 α}{2 α + d}} \\ ≳ n^{- \frac{2 α}{2 α + d}}; \end{matrix}

therefore, the lemma is proven. □

References

Hallin, M.; Lu, Z.; Tran, L.T. Local linear spatial regression. Ann. Stat. 2004, 32, 2469–2500. [Google Scholar] [CrossRef]
Biau, G.; Cadre, B. Nonparametric spatial prediction. Stat. Inference Stoch. Process. 2004, 7, 327–349. [Google Scholar] [CrossRef]
Bentsen, L.; Warakagoda, N.D.; Stenbro, R.; Engelstad, P. Spatio-temporal wind speed forecasting using graph networks and novel Transformer architectures. Appl. Energy 2023, 333, 120565. [Google Scholar] [CrossRef]
Du, P.; Bai, X.; Tan, K.; Xue, Z.; Samat, A.; Xia, J.; Li, E.; Su, H.; Liu, W. Advances of four machine learning methods for spatial data handling: A review. J. Geovis. Spat. Anal. 2020, 4, 13. [Google Scholar] [CrossRef]
Farrell, A.; Wang, G.; Rush, S.A.; Martin, J.A.; Belant, J.L.; Butler, A.B.; Godwin, D. Machine learning of large-scale spatial distributions of wild turkeys with high-dimensional environmental data. Ecol. Evol. 2019, 9, 5938–5949. [Google Scholar] [CrossRef] [PubMed]
Nikparvar, B.; Thill, J.C. Machine learning of spatial data. ISPRS Int. J. -Geo-Inf. 2021, 10, 600. [Google Scholar] [CrossRef]
Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar]
Wang, H.; Wu, Y.; Chan, E. Efficient estimation of nonparametric spatial models with general correlation structures. Aust. N. Z. J. Stat. 2017, 59, 215–233. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
Cohen, A.; Daubechies, I.; Vial, P. Wavelets on the interval and fast wavelet transforms. Appl. Comput. Harmon. Anal. 1993, 1, 54–81. [Google Scholar] [CrossRef]
Cressie, N.; Wikle, C.K. Statistics for Spatio-Temporal Data; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Wang, H.X.; Lin, J.G.; Huang, X.F. Local modal regression for the spatio-temporal model. Sci. Sin. Math. 2021, 51, 615–630. (In Chinese) [Google Scholar]
Telgarsky, M. Benefits of depth in neural networks. In Proceedings of the Conference on Learning Theory, PMLR, Hamilton, New Zealand, 16–18 November 2016; pp. 1517–1539. [Google Scholar]
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed]
Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
Giné, E.; Koltchinskii, V.; Wellner, J.A. Ratio limit theorems for empirical processes. In Stochastic Inequalities and Applications; Birkhäuser: Basel, Switzerland, 2003; pp. 249–278. [Google Scholar]
Hamers, M.; Kohler, M. Nonasymptotic bounds on the L₂ error of neural network regression estimates. Ann. Inst. Stat. Math. 2006, 58, 131–151. [Google Scholar] [CrossRef]
Koltchinskii, V. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 2006, 34, 2593–2656. [Google Scholar] [CrossRef]
Massart, P. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
Wellner, J. Weak Convergence and Empirical Processes: With Applications to Statistics; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]

Figure 1.

M S E_{s}

of local linear regression method (dashed blue line) and the neural network method (solid orange line) of dimension 3, 5, 8 and 10. (a) Three dimensions. (b) Five dimensions. (c) Eight dimensions. (d) Ten dimensions.

Figure 2. Comparison of

M S E_{s}

between two methods for Model 2.

Figure 3. Comparison of

M S E_{s}

between two methods for this case.

Table 1. Various dimensional MSE values of two methods for Model 1.

	Scenario 1		Scenario 2		Scenario 3		Scenario 4
$n$	$NE$	$DNN$	$NE$	$DNN$	$NE$	$DNN$	$NE$	$DNN$
200	0.1452	0.0363	0.9120	0.0695	4.4421	0.1119	7.5481	0.1520
600	0.0438	0.0150	0.2422	0.0280	1.3758	0.0488	2.7657	0.0655
1000	0.0242	0.0110	0.1343	0.0207	0.8347	0.0375	1.7616	0.0473
1400	0.0186	0.0097	0.0964	0.0187	0.6102	0.0336	1.2408	0.0414
1800	0.0146	0.0080	0.0755	0.0171	0.4768	0.0298	1.0533	0.0358
2200	0.0112	0.0075	0.0613	0.0153	0.4156	0.0258	0.9199	0.0357

Table 2. The

M S E

values of both methods for Model 2.

Table 2. The

M S E

values of both methods for Model 2.

n	200	600	1000	1400	1800	2200
$N E$	36.8606	32.4013	17.6990	14.0570	10.7740	5.1166
$D N N$	2.0305	2.0078	2.0074	2.0047	2.0041	2.0020

Table 3. The

M S E

values of both methods for this case.

Table 3. The

M S E

values of both methods for this case.

n	50	100	150	200	250	300
$N E$	3.1912	2.2667	2.0146	1.8464	1.7926	1.7485
$D N N$	1.7958	1.5066	1.3923	1.3338	1.2983	1.2901

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network

Abstract

1. Introduction

2. Nonparametric High-Dimensional Space Model Estimation

2.1. Mathematical Modeling of Deep Network Features

2.2. Estimation and Theoretical Properties

2.3. Suboptimality of Wavelet Series Estimation

3. Simulation Experiments and Case Study

3.1. Simulation Experiments

3.2. Case Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. The Embedding Property of Network Function Classes

Appendix B. Approximation by Polynomial Neural Networks

References

Article Metrics

Citations

Article Access Statistics