Smooth Function Approximation by Deep Neural Networks with General Activation Functions

Ohn, Ilsang; Kim, Yongdai

doi:10.3390/e21070627

Open AccessArticle

Smooth Function Approximation by Deep Neural Networks with General Activation Functions

by

Ilsang Ohn

and

Yongdai Kim

^*

Department of Statistics, Seoul National University, Seoul 08826, Korea

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(7), 627; https://doi.org/10.3390/e21070627

Submission received: 4 June 2019 / Revised: 21 June 2019 / Accepted: 25 June 2019 / Published: 26 June 2019

(This article belongs to the Special Issue Information Theoretic Learning and Kernel Methods)

Download Versions Notes

Abstract

:

There has been a growing interest in expressivity of deep neural networks. However, most of the existing work about this topic focuses only on the specific activation function such as ReLU or sigmoid. In this paper, we investigate the approximation ability of deep neural networks with a broad class of activation functions. This class of activation functions includes most of frequently used activation functions. We derive the required depth, width and sparsity of a deep neural network to approximate any Hölder smooth function upto a given approximation error for the large class of activation functions. Based on our approximation error analysis, we derive the minimax optimality of the deep neural network estimators with the general activation functions in both regression and classification problems.

Keywords:

function approximation; deep neural networks; activation functions; Hölder continuity; convergence rates

1. Introduction

Neural networks are learning machines motivated by the architecture of the human brain. Neural networks are comprised of multiple hidden layers, and each of the hidden layers has multiple hidden nodes which consist of an affine map of the outputs from the previous layer and a nonlinear map called an activation function. Deep neural networks have been leading tremendous success in various pattern recognition and machine learning tasks such as object recognition, image segmentation, machine translation and others. For an overview on the empirical success of deep neural networks, we refer to the review paper [1] and recent book [2].

Inspired by the success of deep neural networks, many researchers have tried to give theoretical supports for the success of deep neural networks. Much of the work upto date has focused on the expressivity of deep neural networks, i.e., ability to approximate a rich class of functions efficiently. The well-known classical result on this topic is the universal approximation theorem, which states that every continuous function can be approximated arbitrarily well by a neural network [3,4,5,6,7]. But these results do not specify the required numbers of layers and nodes of a neural network to achieve a given approximation accuracy.

Recently, several results about the effects of the numbers of layers and nodes of a deep neural network to its expressivity have been reported. They provide upper bounds of the numbers of layers and nodes required for neural networks to uniformly approximate all functions of interest. Examples of a class of functions include the space of rational functions of polynomials [8], the Hölder space [9,10,11,12], Besov and mixed Besov spaces [13] and even a class of discontinuous functions [14,15].

The nonlinear activation function is a central part that makes neural networks differ from the linear models, that is, a neural network becomes a linear function if the linear activation function is used. Therefore, the choice of an activation function substantially influences on the performance and computational efficiency. Numerous activation functions have been suggested to improve neural network learning [16,17,18,19,20,21]. We refer to the papers [21,22] for an overview of this topic.

There are also many recent theoretical studies about the approximation ability of deep neural networks. However, most of the studies focus on a specific type of the activation function such as ReLU [9,10,13,14,15], or small classes of activation functions such as sigmoidal functions with additional monotonicity, continuity, and/or boundedness conditions [23,24,25,26,27] and m-admissible functions which are sufficiently smooth and bounded [11]. For definitions of sigmoidal and m-admissible functions, see [24] and [11], respectively. Thus a unified theoretical framework still lacks.

In this paper, we investigate the approximation ability of deep neural networks with a quite general class of activation functions. We derive the required numbers of layers and nodes of a deep neural network to approximate any Hölder smooth function upto a given approximation error for the large class of activation functions. Our specified class of activation functions and the corresponding approximation ability of deep neural networks include most of previous results [9,10,11,23] as special cases.

Our general theoretical results of the approximation ability of deep neural networks enables us to study statistical properties of deep neural networks. Schmidt-Hieber [10] and Kim et al. [28] proved the minimax optimality of a deep neural network estimator with the ReLU activation function in regression and classification problems, respectively. In this paper, we derive similar results for general activation functions.

This paper is structured as follows. In Section 2, we introduce some notions about deep neural networks. In Section 3, we introduce two large classes of activation functions. In Section 4, we present our main result on the approximation ability of a deep neural network with the general activation function considered in Section 3. In Section 5, we apply the result in Section 4 to the supervised learning problems of regression and classification. Conclusions are given in Section 6. The proofs of all results are given in Appendix.

Notation

We denote by

𝟙 (\cdot)

the indicator function. Let

R

be the set of real numbers and

N

be the set of natural numbers. For a real valued vector

x \equiv (x_{1}, \dots, x_{d})

, we let

{| x |}_{0} : = \sum_{j = 1}^{d} 𝟙 (x_{j} \neq 0)

,

{| x |}_{p} : = {(\sum_{j = 1}^{d} {| x_{j} |}^{p})}^{1 / p}

for

p \in [1, \infty)

and

{| x |}_{\infty} : = {max}_{1 \leq j \leq d} | x_{j} |

. For simplicity, we let

| x | : = {| x |}_{1} .

For a real valued function

f (x) : R \to R

, we let

f^{'} (a), f^{″} (a)

and

f^{‴} (a)

are the first, second and third order derivatives of f at a, respectively. We let

f^{'} (a +) : = {lim}_{ϵ ↓ 0} (f (a + ϵ) - f (a)) / ϵ

and

f^{'} (a -) : = {lim}_{ϵ ↓ 0} (f (a - ϵ) - f (a)) / ϵ

. For

x \in R

, we write

{(x)}_{+} : = max {x, 0}

.

2. Deep Neural Networks

In this section we provide a mathematical representation of deep neural networks. A neural network with

L \in N

layers,

n_{l} \in N

many nodes at the l-th hidden layer for

l = 1, \dots, L

, input of dimension

n_{0}

, output of dimension

n_{L + 1}

and nonlinear activation function

σ : R \to R

is expressed as

N_{σ} (x | θ) : = A_{L + 1} \circ σ_{L} \circ A_{L} \circ \dots \circ σ_{1} \circ A_{1} (x),

(1)

where

A_{l} : R^{n_{l - 1}} \to R^{n_{l}}

is an affine linear map defined by

A_{l} (x) = W_{l} x + b_{l}

for given

n_{l} \times n_{l - 1}

dimensional weight matrix

W_{l}

and

n_{l}

dimensional bias vector

b_{l}

and

σ_{l} : R^{n_{l}} \to R^{n_{l}}

is an element-wise nonlinear activation map defined by

σ_{l} (z) : = {(σ (z_{1}), \dots, σ (z_{n_{l}}))}^{⊤}

. Here,

θ

denotes the set of all weight matrices and bias vectors

θ : = ((W_{1}, b_{1}), (W_{2}, b_{2}), \dots, (W_{L + 1}, b_{L + 1})),

which we call

θ

the parameter of the neural network, or simply, a network parameter.

We introduce some notations related to the network parameter. For a network parameter

θ

, we write

L (θ)

for the number of hidden layers of the corresponding neural network, and write

n_{max} (θ)

for the maximum of the numbers of hidden nodes at each layer. Following a standard convention, we say that

L (θ)

is the depth of the deep neural network and

n_{max} (θ)

is the width of the deep neural network. We let

{| θ |}_{0}

be the number of nonzero elements of

θ

, i.e.,

{| θ |}_{0} : = \sum_{l = 1}^{L + 1} (| vec (W_{l}) |_{0} + {| b_{l} |}_{0}),

where

vec (W_{l})

transforms the matrix

W_{l}

into the corresponding vector by concatenating the column vectors. We call

{| θ |}_{0}

sparsity of the deep neural network. Let

{| θ |}_{\infty}

be the largest absolute value of elements of

θ

, i.e.,

{| θ |}_{\infty} : = max \{max_{1 \leq l \leq L + 1} | vec (W_{l}) |_{\infty}, max_{1 \leq l \leq L + 1} {| b_{l} |}_{\infty}\} .

We call

{| θ |}_{\infty}

magnitude of the deep neural network. We let

in (θ)

and

out (θ)

be the input and output dimensions of the deep neural network, respectively. We denote by

Θ_{d, o} (L, N)

the set of network parameters with depth L, width N, input dimension d and output dimension o, that is,

Θ_{d, o} (L, N) : = θ : L (θ) \leq L, n_{max} (θ) \leq N, in (θ) = d, out (θ) = o .

We further define a subset of

Θ_{d, o} (L, N)

with restrictions on sparsity and magnitude as

Θ_{d, o} (L, N, S, B) : = {θ \in Θ_{d, o} {(L, N) : | θ |}_{0} \leq S, {| θ |}_{\infty} \leq B} .

3. Classes of Activation Functions

In this section, we consider two classes of activation functions. These two classes include most of commonly used activation functions. Definitions and examples of each class of activation functions are provided in the consecutive two subsections.

3.1. Piecewise Linear Activation Functions

We first consider piecewise linear activation functions.

Definition 1.

A function

σ : R \to R

is continuous piecewise linear if it is continuous and there exist a finite number of break points

a_{1} \leq a_{2} \leq \dots \leq a_{K} \in R

with

K \in N

such that

σ^{'} (a_{k} -) \neq σ^{'} (a_{k} +)

for every

k = 1, \dots, K

and

σ (x)

is linear on

(- \infty, a_{1}], [a_{1}, a_{2}], \dots, [a_{K - 1}, a_{K}], [a_{K}, \infty)

.

Throughout this paper, we write “picewise linear” instead of “continuous picewise linear” for notational simplicity unless there is a confusion. The representative examples of piecewise linear activation functions are as follows:

ReLU: $σ (x) = max {x, 0}$ .
Leaky ReLU: $σ (x) = max {x, a x}$ for $a \in (0, 1)$ .

The ReLU activation function is the most popular choice in practical applications due to better gradient propagation and efficient computation [22]. In this reason, most of the recent results on the function approximation by deep neural networks are based on the ReLU activation function [9,10,13,14,15]. In Section 4, as Yarotsky [9] did, we extend these results to any continuous piecewise linear activation function by showing that the ReLU activation function can be exactly represented by a linear combination of piecewise linear activation functions. A formal proof for this argument is presented in Appendix A.1.

3.2. Locally Quadratic Activation Functions

One of the basic building blocks in approximation by deep neural networks is the square function, which should be approximated precisely. Piecewise linear activation functions have zero curvature (i.e., constant first-order derivative) inside each interval divided by its break points, which makes it relatively difficult to approximate the square function efficiently. But if there is an interval on which the activation function has nonzero curvature, the square function can be approximated more efficiently, which is a main motivation of considering a new class of activation functions that we call locally quadratic.

Definition 2.

A function

σ : R \to R

is locally quadratic if there exits an open interval

(a, b) \subset R

on which σ is three times continuously differentiable with bounded derivatives and there exists

t \in (a, b)

such that

σ^{'} (t) \neq 0

and

σ^{″} (t) \neq 0

.

We now give examples of locally quadratic activation functions. First of all, any nonlinear smooth activation function with nonzero second derivative, is locally quadratic. Examples are:

Sigmoid: $σ (x) = \frac{1}{1 + e^{- x}} .$
Tangent hyperbolic: $σ (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} .$
Inverse square root unit (ISRU) [18]: $σ (x) = \frac{x}{\sqrt{1 + a x^{2}}}$ for $a > 0$ .
Soft clipping [19]: $σ (x) = \frac{1}{a} log (\frac{1 + e^{a x}}{1 + e^{a (x - 1)}})$ for $a > 0$ .
SoftPlus [22]: $σ (x) = log (1 + e^{x})$ .
Swish [21]: $σ (x) = \frac{x}{1 + e^{- x}}$ .

In addition, piecewise smooth function having nonzero second derivative on at least one piece, is also locally quadratic. Examples are:

Rectified power unit (RePU) [12]: $σ (x) = max {x^{k}, 0}$ for $k \in N \ {1}$ .
Exponential linear unit (ELU) [17]: $σ (x) = a (e^{x} - 1) 𝟙 (x \leq 0) + x 𝟙 (x > 0)$ for $a > 0$ .
Inverse square root linear unit (ISRLU) [18]: $σ (x) = \frac{x}{\sqrt{1 + a x^{2}}} 𝟙 (x \leq 0) + x 𝟙 (x > 0)$ for $a > 0$ .
Softsign [16]: $σ (x) = \frac{x}{1 + | x |} .$
Square nonlinearity [20]:
$σ (x) = 𝟙 (x > 2) + (x - x^{2} / 4) 𝟙 (0 \leq x \leq 2) + (x + x^{2} / 4) 𝟙 (- 2 \leq x < 0) - 𝟙 (x < - 2)$ .

4. Approximation of Smooth Functions by Deep Neural Networks

In this section we introduce the function class we consider and show the approximation ability of the deep neural networks with a activation function considered in Section 3.

4.1. Hölder Smooth Functions

We recall the definition of Hölder smooth functions. For a d-dimensional multiple index

m \equiv (m_{1}, \dots, m_{d}) \in N_{0}^{d}

where

N_{0} : = N \cup {0}

, we let

x^{m} : = x_{1}^{m_{1}} \dots x_{d}^{m_{d}}

for

x \in R^{d}

. For a function

f : X \to R

, where

X

denotes the domain of the function, we let

{∥ f ∥}_{\infty} : = {sup}_{x \in X} | f (x) |

. We use notation

\partial^{m} f : = \frac{\partial^{| m |} f}{\partial x^{m}} = \frac{\partial^{| m |} f}{\partial x_{1}^{m_{1}} \dots \partial x_{d}^{m_{d}}},

for

m \in N_{0}^{d}

to denote a derivative of f of order

m

. We denote by

C^{m} (X)

, the space of m times differentiable functions on

X

whose partial derivatives of order

m

with

| m | \leq m

are continuous. We define the Hölder coefficient of order

s \in (0, 1]

as

{[f]}_{s} : = sup_{x_{1}, x_{2} \in X, x_{1} \neq x_{2}} \frac{| f (x_{1}) - f (x_{2}) |}{| x_{1} - x_{2} |^{s}} .

For a positive real value

α

, the Hölder space of order

α

is defined as

H^{α} (X) : = \{f \in C^{⌊α⌋} (X) : {∥ f ∥}_{H^{α} (X)} < \infty\},

where

{∥ f ∥}_{H^{α} (X)}

denotes the Hölder norm defined by

{∥ f ∥}_{H^{α} (X)} : = \sum_{m \in N_{0}^{d} : | m | \leq ⌊α⌋} {∥ \partial^{m} f ∥}_{\infty} + \sum_{m \in N_{0}^{d} : | m | = ⌊α⌋} {[\partial^{m} f]}_{α - ⌊α⌋} .

We denote by

H^{α, R} (X)

the closed ball in the Hölder space of radius R with respect to the Hölder norm, i.e.,

H^{α, R} (X) : = \{f \in H^{α} (X) : {∥ f ∥}_{H^{α} (X)} \leq R\} .

4.2. Approximation of Hölder Smooth Functions

We present our main theorem in this section.

Theorem 1.

Let

d \in N

,

α > 0

and

R > 0

. Let the activation function σ be either continuous piecewise linear or locally quadratic. Let

f \in H^{α, R} ({[0, 1]}^{d})

. Then there exist positive constants

L_{0}

,

N_{0}

,

S_{0}

and

B_{0}

depending only on d, α, R and σ such that, for any

ϵ > 0

, there is a neural network

θ_{ϵ} \in Θ_{d, 1} (L_{0} log (1 / ϵ), N_{0} ϵ^{- d / α}, S_{0} ϵ^{- d / α} log (1 / ϵ), B_{0} ϵ^{- 4 (d / α + 1)})

(2)

satisfying

sup_{x \in {[0, 1]}^{d}} | f (x) - N_{σ} (x | θ_{ϵ}) | \leq ϵ .

(3)

The result of Theorem 1 is equivalent to the results on the approximation by ReLU neural networks [9,10] in a sense that the upper bounds of the depth, width and sparsity are the same orders of those for ReLU, namely, depth

= O (log (1 / ϵ))

, width

= O (ϵ^{- d / α})

and sparsity

= O (ϵ^{- d / α} log (1 / ϵ))

. We remark that each upper bound is equivalent to the corresponding lower bound established by [9] up to logarithmic factor.

For piecewise linear activation functions, Yarotsky [9] derived similar results to ours. For locally quadratic activation functions, only special classes of activation functions were considered in the previous work. Li et al. [12] considered the RePU activation function and Bauer and Kohler [11] considered sufficiently smooth and bounded activation functions which include the sigmoid, tangent hyperbolic, ISRU and soft clipping activation functions. However, soft plus, swish, ELU, ISRLU, softsign and square nonlinearity activation functions are new ones only considered in our results.

Even if the orders of the depth, width and sparsity are the same for both both piecewise linear and locally quadratic activation functions, the ways of approximating a smooth function by use of these two activation function classes are quite different. To describe this point, let us provide an outline of the proof. We first consider equally spaced grid points with length

1 / M

inside the d-dimensional unit hypercube

{[0, 1]}^{d}

. Let

G_{d, M}

be the set of such grid points, namely,

G_{d, M} : = {\frac{1}{M} (m_{1}, \dots, m_{d}) : m_{j} \in {0, 1, \dots, M}, j = 1, \dots, d} .

For a given Hölder smooth function f of order

α

, we first find a “local” function for each grid that approximates the target function near the grid point but vanishes at apart from the grid point. To be more specific, we construct the local functions

g_{z}

,

z \in G_{d, M}

which satisfies:

sup_{x \in {[0, 1]}^{d}} |f (x) - \sum_{z \in G_{d, M}} g_{z, M} (x)| \leq C | G_{d, M} |^{- α / d},

(4)

for some universal constant

C > 0

. The inequality (4) implies that the more grid points we used, the more accurate approximation we get. Moreover, the quality of approximation is improved when the target function is more smooth (i.e., large

α

) and low dimensional (i.e., small d). In fact,

g_{z, M} (x)

is given by a product of the Taylor polynomial

P_{z, M} (x) : = \sum_{m \in N_{0}^{d} : | m | \leq α} \partial^{m} f (z) \frac{{(x - z)}^{m}}{m!}

at

z

and the local basis function

ϕ_{z, M} (x) : = \prod_{j = 1}^{d} (1 / M - | x_{j} - z_{j} {|)}_{+}

, where

m! : = \prod_{j = 1}^{d} m_{j}!

. By simple algebra, we have

\begin{matrix} P_{M} (x) : = \sum_{z \in G_{d, M}} g_{z, M} (x) & : = \sum_{z \in G_{d, M}} P_{z, M} (x) ϕ_{z, M} (x) \\ = \sum_{z \in G_{d, M}} \sum_{m : | m | \leq α} β_{z, m} x^{m} ϕ_{z, M} (x), \end{matrix}

where

β_{z, m} : = \sum_{\tilde{m} : \tilde{m} \geq m, | \tilde{m} | \leq α} \partial^{\tilde{m}} f (z) \frac{{(- z)}^{\tilde{m} - m}}{m! (\tilde{m} - m)!}

.

The second stage is to approximate each monomial

x^{m}

and each local basis function

ϕ_{z, M} (x)

by deep neural networks. Each monomial can be approximated more efficiently by a deep neural network with a locally quadratic activation function than a piecewise linear activation function since each monomial has nonzero curvature. On the other hand, the local basis function can be approximated more efficiently by a deep neural network with a piecewise linear activation than a locally quadratic activation function since the local basis function is piecewise linear itself. That is, there is a trade-off in using either a piecewise linear or a locally quadratic activation function.

We close this section by giving a comparison of our result to the approximation error analysis of [11]. Bauer and Kohler [11] studies approximation of the Hölder smooth function of order

α

by a two layer neural network with m-admissible activation functions with

m \geq α

, where a function

σ

is called m-admissible if (1)

σ

is at least

m + 1

times continuously differentiable with bounded derivatives; (2) a point

t \in R

exists, where all derivatives up to the order m of

σ

are different from zero; and (3)

| σ (x) - 1 | \leq 1 / x

for

x > 0

and

| σ (x) | \leq 1 / | x |

for

x < 0

. Our notion of locally quadratic activation functions is a generalized version of the m-admissibility.

In the proof of [11], the condition

m \geq α

is necessary because they approximate any monomial of order

m

with

| m | \leq α

with a two layer neural network, which is impossible when

m < α

. We drop the condition

m \geq α

by showing that any monomial of order

m

with

| m | \leq α

can be approximated by deep neural network with a finite number of layers, which depends on

α

.

5. Application to Statistical Learning Theory

In this section, we apply our results about the approximation error of neural networks to the supervised learning problems of regression and classification. Let

X

be the input space and

Y

the output space. Let

F

be a given class of measurable functions from

X

to

Y

. Let

P_{0}

be the true but unknown data generating distribution on

X \times Y

. The aim of supervised learning is to find a predictive function that minimizes the population risk

R (f) : = E_{(X, Y) \sim P_{0}} ℓ (Y, f (X))

with respect to a given loss function ℓ. Since

P_{0}

is unknown, we cannot directly minimize the population risk, and thus any estimator

\hat{f}

inevitably has the excess risk which is defined as

R (\hat{f}) - {inf}_{f \in F} R (f) .

For a given sample of size n, let

F_{n}

be a given subset of

F

called a sieve and let

(x_{1}, y_{1}), \dots, (x_{n}, y_{n})

be observed (training) data of input–output pairs assumed to be independent realizations of

(X, Y)

following

P_{0} .

Let

{\hat{f}}_{n}

be an estimated function among

F_{n}

based on the training data

(x_{1}, y_{1}), \dots, (x_{n}, y_{n}) .

The excess risk of

{\hat{f}}_{n}

is decomposed to approximation and estimation errors as

R ({\hat{f}}_{n}) - inf_{f \in F} R (f) = \underset{Estimation error}{\underset{︸}{[R ({\hat{f}}_{n}) - inf_{f \in F_{n}} R (f)]}} + \underset{Approximation error}{\underset{︸}{[inf_{f \in F_{n}} R (f) - inf_{f \in F} R (f)]}} .

(5)

There is a trade-off between approximation and estimation errors. If the function class

F_{n}

is sufficiently large to approximate the optimal estimator

f^{*} : = {argmin}_{f \in F} R (f)

well, then the estimation error becomes large due to high variance. In contrast, if

F_{n}

is small, it leads to low estimation error but it suffers from large approximation error.

One of the advantages of deep neural networks is that we can construct a sieve which has good approximation ability as well as low complexity. Schmidt-Hieber [10] and Kim et al. [28] proved that a neural network estimator can achieve the optimal balance between the approximation and estimation errors to obtain the minimax optimal convergence rates in regression and classification problems, respectively. But they only considered the ReLU activation function. Based on the results of Theorem 1, we can easily extend their results to general activation functions.

The main tool to derive the minimax optimal convergence rate is that the complexity of a class of functions generated by a deep neural network is not affected much by a choice of an activation function, provided that the activation function is Lipschitz continuous. The function

σ : R \to R

is Lipschitz continuous if there is a constant

C_{σ} > 0

such that

| σ (x_{1}) - σ (x_{2}) | \leq C_{σ} | x_{1} - x_{2} |,

(6)

for any

x_{1}, x_{2} \in R

. Here,

C_{σ}

is called the Lipschitz constant. We use the covering number with respect to the

L_{\infty}

norm

{∥ \cdot ∥}_{\infty}

as a measure of complexity of function classes. We recall the definition of the covering number. Let

F

be a given class of real-valued functions defined on

X

. Let

δ > 0

. A collection

{f_{j} \in F : j = 1, \dots, J}

is called a

δ

-covering set of

F

with respect to the

L_{\infty}

norm if for all

f \in F

, there exists

f_{j}

in the collection such that

∥ f - f_{j} ∥_{\infty} \leq δ

. The cardinality of the minimal

δ

-covering set is called the

δ

-covering number of

F

with respect to the

L_{\infty}

norm which is denoted by

N (δ, F, ∥ \cdot ∥_{\infty}) .

That is,

{N (δ, F, ∥ \cdot ∥}_{\infty}) : = inf \{J \in N : \exists f_{1}, \dots, f_{J} such that F \subset ⋃_{j = 1}^{J} B_{\infty} (f_{j}, δ)\},

where

B_{\infty} (f_{j}, δ) : = {f \in F : ∥ f - f_{j} ∥_{\infty} \leq δ}

. The following proposition provides the covering number of a class of functions generated by neural networks.

Proposition 1.

Assume that the activation function σ is Lipschitz continuous with the Lipschitz constant

C_{σ}

. Consider a class of functions generated by a deep neural network

F_{d, 1} (L, N, S, B) : = N_{σ} (\cdot | θ) : θ \in Θ_{d, 1} (L, N, S, B) .

For any

δ > 0

,

log N (δ, F_{d, 1} (L, N, S, B), {∥ \cdot ∥}_{\infty}) \leq 2 L (S + 1) log (δ^{- 1} C_{σ} L (N + 1) (B \lor 1)),

(7)

where

B \lor 1 : = max {B, 1}

.

The result in Proposition 1 is very similar to the existing results in literature, e.g., Theorem 14.5 of [29], Lemma 5 of [10] and Lemma 3 of [13]. We employ similar techniques used in [10,13,29] to obtain the version presented here. We give the proof of this proposition in Appendix B.

All of the activation functions considered in Section 3 except RePU satisfy the Lipschitz condition (6) and hence Proposition 1 can be applied. An interesting implication of Proposition 1 is that the complexity of the function class generated by deep neural networks is not affected by the choice of an activation function. Hence, the remaining step to derive the convergence rate of a neural network estimator is that approximation accuracies by various activation functions are the same as that of the ReLU neural network.

5.1. Application to Regression

First we consider the regression problem. For simplicity, we let

X = {[0, 1]}^{d} .

Suppose that the generated model is

Y | X = x \sim N (f_{0} (x), 1)

for some

f_{0} : {[0, 1]}^{d} \to R .

The performance of an estimator is measured by the

L_{2}

risk

R_{2, f_{0}} (f)

, which is defined by

R_{2, f_{0}} (f) : = E_{f_{0}, P_{x}} {(Y - f (X))}^{2} : = E_{Y | X \sim N (f_{0} (X), 1), X \sim P_{x}} {(Y - f (X))}^{2},

where

P_{x}

is the marginal distribution of

X .

The following theorem proves that the optimal convergence rate is obtained by the deep neural network estimator of the regression function

f_{0}

for a general activation function.

Theorem 2.

Suppose that the activation function σ is either piecewise linear or locally quadratic satisfying the Lipschitz condition (6). Then there are universal positive constants

L_{0}

,

N_{0}

,

S_{0}

and

B_{0}

such that the deep neural network estimator obtained by

{\hat{f}}_{n} \in \underset{f \in F_{σ, n}}{argmin} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2},

with

\begin{matrix} F_{σ, n} : = \{N_{σ} (\cdot | θ) : {∥ N_{σ} (\cdot | θ) ∥}_{\infty} \leq 2 R, θ \in Θ_{d, 1} (L_{0} log n, N_{0} n^{\frac{d}{2 α + d}}, S_{0} n^{\frac{d}{2 α + d}} log n, B_{0} n^{κ})\} \end{matrix}

for some

κ > 0

satisfies

sup_{f_{0} \in H^{α, R} ({[0, 1]}^{d})} E [R_{2, f_{0}} ({\hat{f}}_{n}) - inf_{f \in F} R_{2, f_{0}} (f)] \leq C n^{- \frac{2 α}{2 α + d}} {log}^{3} n,

for some universal constant

C > 0

, where the expectation is taken over the training data.

5.2. Application to Binary Classification

The aim of the binary classification is to find a classifier that predicts the label

y \in {- 1, 1}

for any input

x \in {[0, 1]}^{d}

. An usual assumption on the data generating process is that

Y | X = x \sim 2 Bern (η (x)) - 1

for some

η : {[0, 1]}^{d} \to [0, 1]

, where

Bern (p)

denotes the Bernoulli distribution with parameter p. Note that

η (x)

is the conditional probability function

P_{0} (Y = 1 | X = x) .

A common approach is, instead of finding a classifier directly, to construct a real valued function f, a so-called classification function, and predict the label y based on the sign of

f (x)

. The performance of a classification function is measured by the misclassification error

R_{01, η} (f)

, which is defined by

R_{01, η} (f) : = E_{η, P_{x}} 𝟙 (Y f (X) < 0) : = E_{Y | X \sim 2 Bern (η (X)) - 1, X \sim P_{x}} 𝟙 (Y f (X) < 0) .

It is well known that the convergence rate of the excess risk for classification is faster than that of regression when the conditional probability function

η (x)

satisfies the following condition: there is a constant

q \in [0, \infty]

such that for any sufficiently small

u > 0

, we have

P_{x} | η (X) - 1 / 2 | < u \leq u^{q} .

(8)

This condition is called the Tsybakov noise condition and q is called the noise exponent [30,31]. When q is larger, the classification task is easier since the probability of generating vague samples become smaller. The following theorem proves that the optimal convergence rate can be obtained by the deep neural network estimator with an activation function considered in Section 3. As is done by [28], we consider the hinge loss

ℓ_{hinge} (z) : = max {1 - z, 0} .

Theorem 3.

Assume the Tsybakov noise condition (8) with the noise exponent

q \in [0, \infty]

. Suppose that the activation function σ, which is either piecewise linear or locally quadratic satisfying the Lipschitz condition (6), is used for all hidden layers except the last one and the ReLU activation function is used for the last hidden layer. Then there are universal positive constants

L_{0}

,

N_{0}

,

S_{0}

and

B_{0}

such that the deep neural network estimator obtained by

{\hat{f}}_{n} \in \underset{f \in F_{σ, n}}{argmin} \sum_{i = 1}^{n} ℓ_{hinge} (y_{i} f (x_{i})),

with

\begin{matrix} F_{σ, n} : = \{N_{σ} (\cdot | θ) : {∥ N_{σ} (\cdot | θ) ∥}_{\infty} \leq 1, θ \in Θ_{d, 1} (L_{0} log n, N_{0} n^{ν} {log}^{- 3 ν} n, S_{0} n^{ν} {log}^{- 3 ν + 1} n, B_{0} n^{κ})\}, \end{matrix}

for

ν : = d / α (q + 2) + d

and some

κ > 0

satisfies

sup_{η \in H^{α, R} ({[0, 1]}^{d})} E [R_{01, η} ({\hat{f}}_{n}) - inf_{f \in F} R_{01, η} (f)] \leq C {(\frac{{log}^{3} n}{n})}^{\frac{α (q + 1)}{α (q + 2) + d}},

for some universal constant

C > 0

, where the expectation is taken over the training data.

Note that the Bayes classifier

f^{*} : = {argmin}_{f \in F} R_{01, η} (f)

is given by

f^{*} (x) = 2 𝟙 (2 η (x) - 1 \geq 0) - 1,

which is an indicator function. Since a neural network with the ReLU activation function can approximate indicator functions well [14,15,28], we use the ReLU activation function in the last layer in order to approximate the Bayes classifier more precisely and thus to achieve the optimal convergence rate.

6. Conclusions

In this study, we established the upper bounds of the required depth, width and sparsity of deep neural networks to approximate any Hölder continuous function for the general classes of activation functions. These classes of activation functions include most of the popularly used activation functions. The derived upper bounds of the depth, width and sparsity are optimal in a sense that they are equivalent to the lower bounds up to logarithmic factors. We used this generalization of the approximation error analysis to extend the statistical optimality of the deep neural network estimator in regression and classification problems, where the activation function is other than the ReLU.

Our construction of neural networks for approximation reveals that the piecewise linear activation functions are more efficient in approximating local basis functions while locally quadratic activation functions are more efficient in approximating polynomials. Hence if the activation function has both piecewise linear region and locally quadratic region, we could have a better approximation result. We leave the development of such activation functions as a future work.

Author Contributions

Conceptualization, Y.K.; methodology, I.O. and Y.K.; investigation, I.O.; writing—original draft preparation, I.O.; writing—review and editing, Y.K.; funding acquisition, Y.K.

Funding

This work was supported by Samsung Science and Technology Foundation under Project Number SSTF-BA1601-02.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1

Appendix A.1. Proof of Theorem 1 for Piecewise Linear Activation Functions

The main idea of the proof is that any deep neural network with the ReLU activation function can be exactly reconstructed by a neural network with a piecewise activation function whose proof is in the next lemma that is a slight modification of Proposition 1 (b) of [9].

Lemma A1.

Let σ be an any continuous peicewise linear activation function, and ρ be the ReLU activation function. Let

θ \in Θ_{d, 1} (L, N, S, B)

. Then there exists

θ^{*} \in Θ_{d, 1} (L, 2 N, 4 S + 2 L N + 1, C_{1} B)

such that

sup_{x \in {[0, 1]}^{d}} | N_{σ} (x | θ^{*}) - N_{ρ} (x | θ) | = 0,

where

C_{1} > 0

is a constant depending on the activation function σ.

Proof.

Let a be any break point of σ. Note that

σ (a -) \neq σ (a +)

. Let

r_{0}

be the distance between a and the closest other break point. Then σ is linear on

[a - r_{0}, a]

and

[a, a + r_{0}]

. Then for any

r > 0

, the ReLU activation function

ρ (x) : = {(x)}_{+}

is expressed as

\begin{matrix} ρ (x) & = \frac{σ (a + \frac{r_{0}}{2 r} x) - σ (a - \frac{r_{0}}{2} + \frac{r_{0}}{2 r} x) - σ (a) + σ (a - \frac{r_{0}}{2})}{σ^{'} (a +) - σ^{'} (a -) \frac{r_{0}}{2 r}} \\ = : u_{1} σ (a + \frac{r_{0}}{2 r} x) + u_{2} σ (a - \frac{r_{0}}{2} + \frac{r_{0}}{2 r} x) + v \end{matrix}

(A1)

for any

x \in [- r, r]

, where we define

u_{1} : = 1 / ((σ^{'} (a +) - σ^{'} (a -)) \frac{r_{0}}{2 r})

,

u_{2} : = - 1 / ((σ^{'} (a +) - σ^{'} (a -)) \frac{r_{0}}{2 r})

and

v : = (- σ (a) + σ (a - r_{0} / 2)) / ((σ^{'} (a +) - σ^{'} (a -)) \frac{r_{0}}{2 r})

.

Let

θ \equiv ((W_{1}, b_{1}), \dots, (W_{L + 1}, b_{L + 1})) \in Θ_{d, 1} (L, N, S, B)

be given. Since both input

x \in {[0, 1]}^{d}

and the network parameter

θ

are bounded, we can take a sufficiently large r so that Equation (A1) holds for any hidden nodes of the network

θ

. We consider the deep neural network

θ^{*} \equiv ((W_{1}^{*}, b_{1}^{*}), \dots, (W_{L + 1}^{*}, b_{L + 1}^{*})) \in Θ_{d, 1} (L, 2 N),

where we set

\begin{matrix} W_{l}^{*} & : = \frac{r_{0}}{2 r} (\begin{matrix} u_{1} W_{l} & u_{2} W_{l} \\ u_{1} W_{l} & u_{2} W_{l} \end{matrix}) \in R^{2 n_{l} \times 2 n_{l - 1}}, \\ b_{l}^{*} & : = (\begin{matrix} a 1_{n_{l}} + \frac{r_{0}}{2 r} (v W_{l} 1_{n_{l - 1}} + b_{l}) \\ (a - \frac{r_{0}}{2}) 1_{n_{l}} + + \frac{r_{0}}{2 r} (v W_{l} 1_{n_{l - 1}} + b_{l}) \end{matrix}) \in R^{2 n_{l}}, \end{matrix}

for

l = 1, \dots, L

and

W_{L + 1}^{*} : = (\begin{matrix} u_{1} W_{L + 1} & u_{2} W_{L + 1} \end{matrix}), b_{L + 1}^{*} : = v .

Here,

1_{n}

denotes the n-dimensional vector of

1^{'} s

. Then by Equation (A1) and some algebra, we have that

N_{σ} (x | θ^{*}) = N_{ρ} (x | θ)

for any

x \in {[0, 1]}^{d}

. For the sparsity of

θ^{*}

, we note that

| vec (W_{l}^{*}) |_{0} + | b_{l}^{*} |_{0} \leq 4 {| vec (W_{l}) |}_{0} + 2 n_{l}

which implies that

| θ^{*} |_{0} \leq 4 {| θ |}_{0} + 2 L (θ) n_{max} (θ) + 1

. □

Thanks to Lemma A1, to prove Theorem 1 for piecewise linear activation functions, it suffices to show the approximation ability of the ReLU networks, which is already done by [10] as in the next lemma.

Lemma A2

(Theorem 5 of [10]). Let ρ be the ReLU activation function. For any

f \in H^{α, R} ({[0, 1]}^{d})

and any integers

m \geq 1

and

M \geq max \{{(α + 1)}^{d}, (R + 1) e^{d}\}

, there exists a network parameter

θ \in Θ_{d, 1} (L, N, S, 1)

such that

sup_{x \in {[0, 1]}^{d}} | N_{ρ} (x | θ) - f (x) | \leq (2 R + 1) (1 + d^{2} + α^{2}) 6^{d} M 2^{- m} + R 3^{α} M^{- α / d},

(A2)

where

L = 8 + (m + 5) (1 + ⌈ {log}_{2} (d \lor α) ⌉)

,

N = 6 (d + ⌈α⌉) M

, and

S = 141 {(d + α + 1)}^{3 + d} M (m + 6)

.

Theorem 1 for piecewise linear activation functions is a direct consequence of Lemmas A1 and A2, which is summarized as follows.

Proof of Theorem 1 for piecewise linear activation functions.

Let ρ be the ReLU activation function. By letting

M = 3^{d} {(2 R)}^{d / α} ϵ^{- d / α}

and

m = {log}_{2} (2 (2 R + 1) (1 + d^{2} + α^{2}) 18^{d} {(2 R)}^{d / α} ϵ^{- d / α - 1}),

Lemma A2 implies that there exists a network parameter

θ^{'}

such that

{sup}_{x \in {[0, 1]}^{d}} | N_{ρ} (x | θ^{'}) - f (x) | \leq ϵ

with

L (θ^{'}) \leq L_{0}^{'} log (1 / ϵ)

,

n_{max} (θ^{'}) \leq N_{0}^{'} ϵ^{- d / α}

and

| θ^{'} |_{0} \leq S_{0}^{'} ϵ^{- d / α} log (1 / ϵ)

for some positive constants

L_{0}^{'}

,

N_{0}^{'}

, and

S_{0}^{'}

depending only on α, d and R. Hence by Lemma A1, there is a network parameter

θ

producing the same output of the ReLU neural network

N_{ρ} (\cdot | θ)

with

L (θ) = L (θ^{'})

,

n_{max} (θ) = 2 n_{max} (θ^{'})

,

{| θ |}_{0} \leq 4 {| θ^{'} |}_{0} + 2 L (θ^{'}) n_{max} (θ^{'}) + 1 \leq S_{0} ϵ^{- d / α} log (1 / ϵ)

and

{| θ |}_{\infty} \leq B_{0} {| θ^{'} |}_{\infty}

for some

S_{0} > 0

depending only on α, d, R and σ, and some

B_{0} > 0

depending only on σ, which completes the proof. □

Appendix A.2. Proof of Theorem 1 for Locally Quadratic Activation Functions

Lemma A3.

Assume that an activation function σ is locally quadratic. There is a constant

K_{0}

depending only on the activation function such that for any

K > K_{0}

the following results hold.

(a): There is a neural network $θ_{2} \in Θ_{1, 1} (1, 3)$ with $| θ_{2} |_{\infty} \leq K^{2}$ such that

$sup_{x \in [- 1, 1]} | N_{σ} (x | θ_{2}) - x^{2} | \leq \frac{C_{1}}{K},$

where $C_{1} > 0$ is a constant depending only on σ.
(b): Let $A > 0$ . There is a neural network parameter $θ_{\times, A} \in Θ_{2, 1} (1, 9)$ with $| θ_{\times, A} |_{\infty} \leq max {K^{2}, 2 A^{2}}$ such that

$sup_{x \in {[- A, A]}^{2}} | N_{σ} (x | θ_{\times, A}) - x_{1} x_{2} | \leq \frac{6 A^{2} C_{1}}{K} .$
(c): Let α be a positive integer. For any multi-index $m \in N_{0}^{d}$ with $| m | \leq α$ , there is a network parameter $θ_{m} \in Θ_{d, 1} (⌈{log}_{2} α⌉, 9 α)$ with $| θ_{m} |_{\infty} \leq max {K^{2}, C_{2}}$ such that

$sup_{x \in {[0, 1]}^{d}} | N_{σ} (x | θ_{m}) - x^{m} | \leq \frac{C_{3}}{K},$

for some positive constants $C_{2}$ and $C_{3}$ depending only on σ and α.
(d): There is a network parameter $θ_{1 / 2} \in Θ_{1, 1} (⌈log K⌉, 15)$ with $| θ_{1 / 2} |_{\infty} \leq max {K^{2}, C_{4}}$ such that

$sup_{x \in [0, 2]} | N_{σ} (x | θ_{1 / 2}) - \sqrt{x} | \leq C_{5} \frac{log K}{K}$

for some positive constants $C_{4}$ and $C_{5}$ depending only on σ.
(e): There is a network parameter $θ_{a b s} \in Θ_{1, 1} (⌈log K⌉, 15)$ with $| θ_{a b s} |_{\infty} \leq max {K^{2}, C_{6}}$ such that

$sup_{x \in [- 1, 1]} | N_{σ} (x | θ_{abs}) - | x | | \leq \frac{C_{7}}{\sqrt{K}},$

for some positive constants $C_{6}$ and $C_{7}$ depending only on σ.

Proof.

Recall that there is an interval

(a, b)

on which

σ (x)

is three times continuously differentiable with bounded derivatives and there is

t \in (a, b)

such that

σ^{'} (t) \neq 0

and

σ^{″} (t) \neq 0

Proof of (a). Take K large so that

2 / K < min {| t - b |, | t - a |}

. Consider a neural network

N_{σ} (x | θ_{2}) : = \sum_{k = 0}^{2} {(- 1)}^{k - 1} \frac{K^{2}}{σ^{″} (t)} (\binom{2}{k}) σ \frac{k}{K} x + t .

(A3)

Since σ is three times continuously differentiable on

(a, b)

and

(k - 1) x / K + t \in (a, b)

if

x \in [0, 1]

, it can be expanded in the Taylor series with Lagrange remainder around t to have

\begin{matrix} N_{σ} (x | θ_{2}) & = \frac{K^{2}}{σ^{″} (t)} \sum_{k = 0}^{2} {(- 1)}^{k} (\binom{2}{k}) \{σ (t) + σ^{'} (t) \frac{k x}{K} + \frac{σ^{″} (t)}{2} \frac{{(k x)}^{2}}{K^{2}} + \frac{σ^{″} (ξ_{k})}{6} \frac{{(k x)}^{3}}{K^{3}}\} \\ = \frac{K^{2}}{σ^{″} (t)} \{σ^{″} (t) \frac{x^{2}}{K^{2}} + \sum_{k = 1}^{2} {(- 1)}^{k} (\binom{2}{k}) \frac{σ^{‴} (ξ_{k})}{6} \frac{{(k x)}^{3}}{K^{3}}\} \\ = x^{2} + \frac{x^{3}}{6 K σ^{″} (t)} \sum_{k = 1}^{2} {(- 1)}^{k} k^{3} (\binom{2}{k}) σ^{‴} (ξ_{k}), \end{matrix}

where

ξ_{k} \in [t - k | x | / K, t + k | x | / K] \subset (a, b) .

Since the third order derivative is bounded on

(a, b)

, we get the desired assertion by retaking

K \leftarrow \sqrt{2 / σ^{″} (t)} K

.

Proof of (b). The proof can be done straightforwardly by the polarization type identity:

x_{1} x_{2} = 2 A^{2} \{{(\frac{x_{1} + x_{2}}{2 A})}^{2} - {(\frac{x_{1}}{2 A})}^{2} - {(\frac{x_{1}}{2 A})}^{2}\} .

We construct the network as

N_{σ} (x | θ_{\times, A}) : = 2 A^{2} \{N_{σ} (\frac{x_{1} + x_{2}}{2 A} | θ_{2}) - N_{σ} (\frac{x_{1}}{2 A} | θ_{2}) - N_{σ} (\frac{x_{2}}{2 A} | θ_{2})\},

(A4)

where

θ_{2}

is defined in (A3). Since

(x_{1} + x_{2}) / 2 A, x_{1} / 2 A, x_{2} / 2 A \in [- 1, 1]

for

x \in {[- A, A]}^{2}

, the triangle inequality implies that

| N_{σ} (x | θ_{\times, A}) - x_{1} x_{2} | \leq 6 A^{2} C_{1} / K

.

Proof of (c). Let

q : = ⌈{log}_{2} α⌉

. We construct

θ_{m}

as follows. Fix

x \equiv (x_{1}, \dots, x_{d}) \in {[0, 1]}^{d}

. We first consider the affine map that transforms

(x_{1}, \dots, x_{d})

to

z \in {[0, 1]}^{2^{q}}

which is given by

z : = (\underset{m_{1} times}{\underset{︸}{x_{1}, \dots, x_{1}}}, \underset{m_{2} times}{\underset{︸}{x_{2}, \dots, x_{2}}}, \dots, \underset{m_{d} times}{\underset{︸}{x_{d}, \dots, x_{d}}}, \underset{2^{q} - | m | times}{\underset{︸}{1, \dots, 1}}) .

The first hidden layer of

θ_{m}

pairs neighboring entries in

z

and applies the network

θ_{\times, A_{1}}

defined in (b) with

A_{1} = 1

to each pair. That is, the first hidden layer of

θ_{m}

produces

\{g_{1, j} : = N_{σ} ((z_{2 j - 1}, z_{2 j}) | θ_{\times, 1}) : j = 1, \dots, 2^{q - 1}\} .

Note that

{sup}_{1 \leq j \leq 2^{q - 1}} | g_{1, j} - z_{2 j - 1} z_{2 j} | \leq 6 C_{1} / K

and

{sup}_{1 \leq j \leq 2^{q - 1}} | g_{1, j} | \leq 6 C_{1} / K + 1

, where

6 C_{1} / K + 1

can be bounded by some constant

A_{2} > 1

depending only on

C_{1}

and

K_{0}

. Then the second hidden layer of

θ_{m}

pairs neighboring entries of

\{g_{1, j} : j = 1, \dots, 2^{q - 1}\}

and applies

θ_{\times, A_{2}}

to each pair to have

\{g_{2, j} : = N_{σ} ((g_{1, 2 j - 1}, g_{1, 2 j}) | θ_{\times, A_{2}}) : j = 1, \dots, 2^{q - 2} .\}

Note that

{sup}_{1 \leq j \leq 2^{q - 2}} | g_{2, j} - g_{1, 2 j - 1} g_{1, 2 j} | \leq 6 C_{1} A_{2}^{2} / K

and

{sup}_{1 \leq j \leq 2^{q - 2}} | g_{2, j} | \leq 6 C_{1} A_{2}^{2} / K + 1 \leq A_{3}

for some

A_{3} > 1

depending only on

C_{1}

and

K_{0}

. We repeat this procedure to produce

\{g_{k, j} : j = 1, \dots, 2^{q - k}\}

for

k = 3, \dots, q

with

sup_{1 \leq j \leq 2^{q - k}} | g_{k, j} - g_{k - 1, 2 j - 1} g_{k - 1, 2 j} | \leq \frac{6 C_{1} A_{k}^{2}}{K}, sup_{1 \leq j \leq 2^{q - k}} | g_{k, j} | \leq A_{k + 1},

for some

A_{k + 1} > 1

, and we set

N_{σ} (x | θ_{m})

equal to

g_{q, 1}

.

By applying the triangle inequality repeatedly, we have

\begin{matrix} | g_{q, 1} - x^{m} | & \leq | g_{q, 1} - g_{q - 1, 1} g_{q - 1, 2} | + |g_{q - 1, 1} - \prod_{j = 1}^{2^{q - 1}} z_{j}| |g_{q - 1, 2}| + |g_{q - 1, 2} - \prod_{j = 2^{q - 1} + 1}^{2^{q}} z_{j}| |\prod_{j = 1}^{2^{q - 1}} z_{j}| \\ \leq \frac{6 C_{1} A_{q}^{2}}{K} + A_{q} |g_{q - 1, 1} - \prod_{j = 1}^{2^{q - 1}} z_{j}| + |g_{q - 1, 2} - \prod_{j = 2^{q - 1} + 1}^{2^{q}} z_{j}| \\ \leq \frac{6 C_{1} A_{q}^{2}}{K} + (A_{q} + 1) \frac{6 C_{1} A_{q - 1}^{2}}{K} + A_{q} A_{q - 1} |g_{q - 2, 1} - \prod_{j = 1}^{2^{q - 2}} z_{j}| + A_{q} |g_{q - 2, 2} - \prod_{j = 2^{q - 2} + 1}^{2 \times 2^{q - 2}} z_{j}| \\ + A_{q - 1} |g_{q - 2, 3} - \prod_{j = 2 \times 2^{q - 2} + 1}^{3 \times 2^{q - 2}} z_{j}| + |g_{q - 2, 4} - \prod_{j = 3 \times 2^{q - 2} + 1}^{4 \times 2^{q - 2}} z_{j}| \\ \leq \dots \leq \sum_{k = 0}^{q - 1} \{A_{q - k}^{2} \prod_{h = q - k + 1}^{q} (A_{h} + 1)\} \frac{6 C_{1}}{K} \leq C_{1}^{'} \frac{1}{K}, \end{matrix}

for some

C_{1}^{'} > 0

depending only on

C_{1}

,

K_{0}

and q. Since we set

x

arbitrary, the proof is done.

Proof of (d). By (b), it is easy to verify that there is a network

θ_{1} \in Θ_{1, 1} (1, 6)

with

| θ_{1} |_{\infty} \leq max {K^{2}, 2}

such that

| σ (x) - x | \leq C_{1}^{'} / K

for any

x \in [- 1, 1]

and some constant

C_{1}^{'} > 0

. The Taylor series with Lagrange remainder around 1 of

\sqrt{x}

is given by

\sqrt{x} = \sum_{k = 0}^{J} \frac{{(x - 1)}^{k}}{k!} + \frac{1}{(J + 1)!} \frac{d^{J + 1} \sqrt{x}}{d x^{J + 1}} |_{x = ξ} {(x - 1)}^{J + 1},

where

ξ \in [0, 2],

and thus

sup_{x \in [0, 2]} | \sqrt{x} - \sum_{k = 0}^{J} \frac{{(x - 1)}^{k}}{k!} | \leq C_{1}^{'} \frac{1}{(J + 1)!} \leq e {\frac{e}{J + 1}}^{J + 1} .

for some

C_{1}^{'} > 0,

where the last inequality is because

n! \geq {(n / e)}^{n} e

.

Now, we will construct a neural network

θ_{p, J}

that approximates the polynomial

\sum_{k = 0}^{J} \frac{{(x - 1)}^{k}}{k!}

as follows. The first hidden layer computes

(N_{σ} (x - 1 | θ_{2}) / 2, N_{σ} (x - 1 | θ_{1}))

from the input x. Then

| (N_{σ} (x - 1 | θ_{2}) / 2, N_{σ} (x - 1 | θ_{1})) - ({(x - 1)}^{2} / 2, (x - 1)) |_{\infty} \leq C_{2}^{'} \frac{1}{K},

for any

x \in [0, 1]

and some constant

C_{2}^{'} > 0

. The next hidden layer computes

(N_{σ} ((u, v) | θ_{\times, 1 + C_{2}^{'} / K}) / 3, N_{σ} (u + v | θ_{1}))

from the input

(u, v)

from the first hidden layer. Using the triangle inequality, we have that the second hidden layer approximates the vector

({(x - 1)}^{3} / 3!, {(x - 1)}^{2} / 2 + (x - 1))

by error

\leq 2 C_{3}^{'} / K

for some

C_{3}^{'} > 0

. Repeating this procedure, we construct the network

θ_{p, J} \in Θ_{1, 1} (J, 15)

which approximates

\sum_{k = 0}^{J} \frac{{(x - 1)}^{k}}{k!}

by error

\leq C_{4}^{'} J / K

for some

C_{4}^{'} > 0

. Taking

J = ⌈log K⌉

, we observe that

{(e / J + 1)}^{J + 1} \leq {(e / log K)}^{log K + 1} \leq e K / {(log K)}^{log K} \leq 1 / K

for all sufficiently large K, which implies the desired result.

Proof of (e). Let

ζ \in (0, 1)

. Since for any

x \in R

,

\sqrt{x^{2} + ζ^{2}} - | x | \leq \frac{ζ^{2}}{\sqrt{x^{2} + ζ^{2}} + | x |} \leq \frac{ζ^{2}}{ζ} = ζ,

the function

\sqrt{x^{2} + ζ^{2}}

approximates the absolute value function

| x |

by error ζ. For

θ_{2}

in (a) and

θ_{1 / 2}

in (d), we have that

\begin{matrix} | N_{σ} (N_{σ} (x | θ_{2}) + ξ^{2} | θ_{1 / 2}) - | x | | & \leq | N_{σ} (N_{σ} (x | θ_{2}) + ζ^{2} | θ_{1 / 2}) - \sqrt{x^{2} + ζ^{2}} | + ζ \\ \leq | N_{σ} (N_{σ} (x | θ_{2}) + ζ^{2} | θ_{1 / 2}) - \sqrt{N_{σ} (x | θ_{2}) + ξ^{2}} | \\ + | \sqrt{N_{σ} (x | θ_{2}) + ζ^{2}} - \sqrt{x^{2} + ζ^{2}} | + ζ \\ \leq C_{1}^{'} (\frac{log K}{K} + \frac{1}{K ζ}) + ζ \end{matrix}

for some constant

C_{1}^{'} > 0 .

We now set

ζ = 1 / \sqrt{K}

and

N_{σ} (x | θ_{a b s}) : = N_{σ} (N_{σ} (x | θ_{2}) + K^{- 1} | θ_{1 / 2})

. Since

(log K) / K = o (1 / \sqrt{K})

, the proof is done. □

Proof of Theorem 1 for locally quadratic activation functions.

Recall that

P_{M} (x) = \sum_{z \in G_{d, M}} \sum_{m \in N_{0}^{d} : | m | \leq α} β_{z, m} x^{m} ϕ_{z, M} (x) .

Then by Lemma B.1 of [10],

sup_{x \in {[0, 1]}^{d}} | P_{M} (x) - f (x) | \leq R M^{- α} .

From the equivalent representation of the ReLU function

{(x)}_{+} = (x + | x |) / 2

, we can easily check that the neural network

N_{σ} (x | θ_{r e l u}) : = N (x | θ_{a b s}) + N_{σ} (x | θ_{1}) / 2

with

θ_{r e l u} \in Θ_{1, 1} (⌈log K⌉, 21)

approximates the ReLU function by error

\leq C_{1}^{'} / \sqrt{K}

for some

C_{1}^{'} > 0

, where

θ_{1} \in Θ_{1, 1} (1, 6)

is defined in the proof of (d) of Lemma A3 and

θ_{a b s} \in Θ_{1, 1} (⌈log K⌉, 15)

is defined in (e) of Lemma A3. For

z \in (0, 1)

and

M \in N

, we define

N_{σ} (x | θ_{ϕ, z, M}) : = N_{σ} (1 / M - N_{σ} ((x - z) | θ_{abs}) | θ_{relu}) .

Then it approximates the function

{(1 / M - | x - z |)}_{+}

by error

\leq C_{2}^{'} / \sqrt{K}

for some

C_{2}^{'} > 0 .

In turn, for

z \in G_{d, M}

, by invoking the similar construction used in (c) of Lemma A3 to approximates the product of d components, we can construct the network

θ_{ϕ, z, M} \in Θ_{1, 1} (⌈log K⌉ + ⌈{log}_{2} d⌉, 21 d)

with

| θ_{ϕ, z, M} |_{\infty} \leq C_{3}^{'} K^{2}

for some

C_{3}^{'} > 0

such that

sup_{x \in {[0, 1]}^{d}} | N (x | θ_{ϕ, z, M}) - \prod_{j = 1}^{d} {(\frac{1}{M} - | x_{j} - z_{j} |)}_{+} | \leq C_{4}^{'} \frac{1}{\sqrt{K}},

for some

C_{4}^{'} > 0

. For each

m \in N_{0}^{d}

with

| m | \leq α

, we have the neural network

θ_{m}

in (c) of Lemma A3 that approximates

x^{m}

. The number of these networks is

(\binom{d + α}{α})

, which is denoted by

A_{α}

. Also there are

| G_{d, M} {| = (M + 1)}^{d}

networks

θ_{ϕ, z, M}

for

z \in G_{d, M}

. We need approximation of each product

x^{m} ϕ_{z, M}

, which requires additional

A_{α} {(M + 1)}^{d}

many networks

θ_{\times, A} \in Θ_{2, 1} (1, 9)

, where

θ_{\times, A}

is defined as in (A4) for some

A > 1

not depending on M and K. Finally we construct the output layer which computes the weighted sum of

\{N_{σ} ((N_{σ} (x | θ_{m}), N_{σ} (x | θ_{ϕ, z, M})) | θ_{\times, A}) : m \in N_{0}^{d}, | m | \leq α, z \in G_{d, M}\}

. Letting

θ_{f, K, M}

be the network constructed above, we can check that

sup_{x \in {[0, 1]}^{d}} | N (x | θ_{f, K, M}) - P_{M} (x) | \leq C_{5}^{'} A_{α} {(M + 1)}^{d} (\frac{1}{K} + \frac{1}{\sqrt{K}}) \leq C_{6}^{'} \frac{{(M + 1)}^{d}}{\sqrt{K}},

for some positive constants

C_{5}^{'}

and

C_{6}^{'} .

In addition, we have

L (θ_{f, K, M}) \leq 1 + (⌈log K⌉ + ⌈{log}_{2} (α \lor d)⌉ \leq C_{7}^{'} ⌈log K⌉

and

n_{max} (θ_{f, K, M}) \leq C_{8}^{'} A_{α} {(M + 1)}^{d}

for some positive constants

C_{7}^{'}

and

C_{8}^{'}

. For sparsity of the network, we have

\begin{matrix} | θ_{f, K, M} |_{0} & \leq A_{α} {(M + 1)}^{d} | θ_{\times, A} |_{0} + {(M + 1)}^{d} | θ_{ϕ, z, M} |_{0} + A_{α} {| θ_{m} |}_{0} \\ \leq C_{9}^{'} ⌈log K⌉ {(M + 1)}^{d}, \end{matrix}

for some

C_{9}^{'} > 0 .

Taking

M + 1 = ϵ^{- 1 / α}

and

K = ϵ^{- 2 d / α - 2}

, we have

θ_{f, K, M} \in Θ (L_{0} log (1 / ϵ), N_{0} ϵ^{- d / α}, S_{0} ϵ^{- d / α} log (1 / ϵ), B_{0} ϵ^{- 4 (d / α + 1)}),

so that

{‖ P_{M} - N_{σ} (\cdot | θ_{f, K, M}) ‖}_{\infty} \leq C_{10}^{'} ϵ

for some

C_{10}^{'} > 0 .

Since

{‖ f - P_{M} ‖}_{\infty} \leq R M^{- α} \leq C_{11}^{'} ϵ

for some

C_{11}^{'} > 0,

the proof is done. □

Appendix B. Proofs of Proposition 1

Proof.

Given a deep neural network

θ = ((W_{1}, b_{1}), \dots, (W_{L + 1}, b_{L + 1})) \in Θ_{d, 1} (L, N, S, B)

, we define

{\overset{ˇ}{N}}_{l, σ, θ} : R^{d} \to R^{n_{l - 1}}

and

{\hat{N}}_{l, σ, θ} : R^{n_{l}} \to R

as

\begin{matrix} {\overset{ˇ}{N}}_{l, σ, θ} (x) & : = σ_{l - 1} \circ A_{l - 1} \circ \dots \circ σ_{1} \circ A_{1} (x), \\ {\hat{N}}_{l, σ, θ} (x) & : = A_{L + 1} \circ σ_{L} \circ A_{L} \circ \dots σ_{l} \circ A_{l} \circ σ_{l - 1} (x), \end{matrix}

for

l \in 2, \dots, L

, where

A_{l} x = W_{l} x + b_{l}

. Corresponding to the last and first layer, we define

{\overset{ˇ}{N}}_{1, σ, θ} (x) = x

and

{\hat{N}}_{L + 1, σ, θ} (x) = x

. Note that

N_{σ} (x | θ) = {\hat{N}}_{l + 1, σ, θ} \circ A_{l} \circ {\overset{ˇ}{N}}_{l, σ, θ} (x)

. For given

δ > 0,

let

θ = ((W_{1}, b_{1}), \dots, (W_{L + 1}, b_{L + 1})) \in Θ_{d, 1} (L, N, S, B)

and

θ^{*} = ((W_{1}^{*}, b_{1}^{*}), \dots, (W_{L + 1}^{*}, b_{L + 1}^{*})) \in Θ_{d, 1} (L, N, S, B)

be two neural network parameter such that

| vec (W_{l} - W_{l}^{*}) |_{\infty} \leq δ

and

| b_{l} - b_{l}^{*} |_{\infty} \leq δ

for

l = 1, \dots, L + 1

. Let

C_{σ}

be the Lipschitz constant of σ. We observe that

\begin{matrix} {‖ {\overset{ˇ}{N}}_{l, σ, θ} ‖}_{\infty} & \leq C_{σ} (N B {‖ {\overset{ˇ}{N}}_{l - 1, σ, θ} ‖}_{\infty} + B) \\ \leq C_{σ} (B \lor 1) (N + 1) {‖ {\overset{ˇ}{N}}_{l - 1, σ, θ} ‖}_{\infty} \\ \leq {C_{σ} (B \lor 1) (N + 1)}^{l - 1}, \end{matrix}

and similarly,

{‖ {\hat{N}}_{l, σ, θ} ‖}_{\infty} \leq {(C_{σ} B N)}^{L - l + 1}

. Letting

A_{l}^{*} x = W_{l}^{*} x + b_{l}^{*},

we have

\begin{matrix} {‖ N_{σ} (\cdot | θ) - N_{σ} (\cdot | θ^{*}) ‖}_{\infty} & \leq {‖ \sum_{l = 1}^{L} [{\hat{N}}_{l + 1, σ, θ^{*}} \circ A_{l} \circ {\overset{ˇ}{N}}_{l, σ, θ} (\cdot) - {\hat{N}}_{l + 1, σ, θ^{*}} \circ A_{l}^{*} \circ {\overset{ˇ}{N}}_{l, σ, θ} (\cdot)] ‖}_{\infty} \\ \leq \sum_{l = 1}^{L} {(C_{σ} B N)}^{L - l} {‖ (A_{l} - A_{l}^{*}) \circ {\overset{ˇ}{N}}_{l, σ, θ} (\cdot) ‖}_{\infty} \\ \leq \sum_{l = 1}^{L} {(C_{σ} B N)}^{L - l} δ {C_{σ} (B \lor 1) (N + 1)}^{l - 1} \\ \leq δ L {C_{σ} (B \lor 1) (N + 1)}^{L} . \end{matrix}

Thus, for a fixed sparsity pattern (i.e., the location of nonzero elements in

θ

), the covering number is bounded by

{[δ / L {C_{σ} (B \lor 1) (N + 1)}^{L}]}^{- S}

. Since the number of the sparsity patterns is bounded by

(\binom{{(N + 1)}^{L}}{S}) \leq {(N + 1)}^{L S}

, the log of covering number is bounded above by

log ({(N + 1)}^{L S} {[\frac{L {C_{σ} (B \lor 1) (N + 1)}^{L}}{δ}]}^{S}) \leq 2 L S log (\frac{C_{σ} L (B \lor 1) (N + 1)}{δ}),

which completes the proof. □

Appendix C. Proof of Theorem 2

The proof Theorem 2 is based on the following oracle inequality.

Lemma A4

(Lemma 4 of [10]). Assume that

Y | X = x \sim N (f_{0} (x), 1)

for some

f_{0}

with

{‖ f_{0} ‖}_{\infty} \leq R

. Let

F^{†}

be a given function class from

{[0, 1]}^{d}

to

[- 2 R, 2 R]

, and let

\hat{f}

be any estimator in

F^{†}

. Then for any

δ \in (0, 1]

, we have

\begin{matrix} E [E_{X \sim P_{x}} {(\hat{f} (X) - f_{0} (X))}^{2}] \leq 4 [ & inf_{f \in F^{†}} E_{X \sim P_{x}} {(f (X) - f_{0} (X))}^{2} \\ + {(4 R)}^{2} \frac{18 log N (δ, F^{†}, ∥ \cdot ∥_{\infty}) + 72}{n} + 32 δ (4 R) + Δ_{n}], \end{matrix}

with

Δ_{n} : = E [\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - \hat{f} (X_{i}))}^{2} - inf_{f \in F^{†}} \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - f (X_{i}))}^{2}],

where the expectations are taken over the training data.

Proof of Theorem 2.

We apply Lemma A4 to

F^{†} = F_{σ, n}

and

\hat{f} = {\hat{f}}_{n} \in {argmin}_{f \in F_{σ, n}} \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2}

. By definition of

{\hat{f}}_{n}

, we have

Δ_{n} = 0

. Also it can be easily verified that

f_{0} = {argmin}_{f \in F} R_{2, f_{0}} (f)

and

E_{f_{0}, P_{x}} {({\hat{f}}_{n} (X) - f_{0} (X))}^{2} = R_{2, f_{0}} ({\hat{f}}_{n}) - R_{2, f_{0}} (f_{0}) .

Set

δ = 1 / n

. By Proposition 1,

log N (\frac{1}{n}, F_{σ, n}, {∥ \cdot ∥}_{\infty}) \leq C_{1}^{'} n^{\frac{d}{2 α + d}} {log}^{3} n,

for some

C_{1}^{'} > 0

. If a function

f_{n}

is approximates

f_{0}

by error ϵ which is sufficeintly small, then

{∥ f_{n} ∥}_{\infty} \leq 2 R

since

{∥ f_{0} ∥}_{\infty} \leq R

. Now, Theorem 1 implies that there is

f_{n} \in F_{σ, n}

such that

\begin{matrix} E_{f_{0}, P_{x}} {(f_{n} (X) - f_{0} (X))}^{2} & \leq C_{2}^{'} sup_{x \in {[0, 1]}^{d}} {| f_{n} (x) - f_{0} (x) |}^{2} \\ \leq C_{3}^{'} {({(n^{\frac{d}{2 α + d}})}^{- d / α})}^{2} = C_{3}^{'} n^{- \frac{2 α}{2 α + d}}, \end{matrix}

which completes the proof. □

Appendix D. Proof of Theorem 3

For a given real-valued function f, let

R_{hinge, η} (f) : = E_{Y | X \sim 2 Bern (η (X)) - 1, X \sim P_{x}} ℓ_{hinge} (Y f (X))

, which we call the hinge risk. The proof of Theorem 3 is based on the following theorem, which is given in [28].

Lemma A5

(Theorem 6 of [28]). Assume that

η (x)

satisfies the Tsybakov noise condition (8) with the noise exponent

q \in [0, \infty]

. Assume that there exists a sequence

{(δ_{n})}_{n \in N}

such that

there exists a sequence of classes of functions ${F_{n}}_{n \in N}$ with ${sup}_{n \in N} {sup}_{f \in F_{n}} {∥ f ∥}_{\infty} \leq F$ for some $F > 0$ such that there is $f_{n} \in F_{n}$ with $R_{hinge, η} (f_{n}) - {min}_{f \in F} R_{hinge, η} (f) \leq C_{1} δ_{n}$ for some universal constant $C_{1} > 0$ ;
$log N (δ_{n}, F_{n}, {∥ \cdot ∥}_{\infty}) \leq C_{2} n δ_{n}^{(q + 2) / (q + 1)}$ for some universal constant $C_{2} > 0$ .

Then the estimator

{\hat{f}}_{n}

obtained by

{\hat{f}}_{n} \in \underset{f \in F_{n}}{argmin} \sum_{i = 1}^{n} ℓ_{hinge} (y_{i} f (x_{i}))

satisfies

E [R_{01, η} ({\hat{f}}_{n}) - min_{f \in F} R_{01, η} (f)] \leq C_{3} δ_{n},

for some universal constant

C_{3} > 0

, where the expectation is taken over the training data.

Proof of Theorem 3.

It is well known that

f^{*} = 2 𝟙 (η (\cdot) \geq 1 / 2) - 1 = {argmin}_{f \in F} R_{hinge, η} (f)

, i.e., the hinge risk minimizer is equal to the Bayes classifier [32]. The first step is to find a function

f_{n} \in F_{σ, n}

which approximates the Bayes classifier

f^{*}

well. Let

{(ξ_{n})}_{n \in N}

be a given sequence of positive integers. Since

η \in H^{α, R} ({[0, 1]}^{d})

, by Theorem 6, for each

ξ_{n}

there exists

θ_{n}

such that

∥ N_{σ} (\cdot | θ_{n}) - η (\cdot) ∥_{\infty} \leq ξ_{n}

with at most

O (log (1 / ξ_{n}))

layers,

O (ξ_{n}^{- d / α})

nodes at each layer and

O (ξ_{n}^{- d / α} log (1 / ξ_{n}))

nonzero parameters. We construct the neural network

f_{n}

by adding one ReLU layer to

N_{σ} (\cdot | θ_{n})

to have

f_{n} (x) = 2 \{ρ (\frac{1}{ξ_{n}} (N_{σ} (x | θ_{n}) - \frac{1}{2})) - ρ (\frac{1}{ξ_{n}} (N_{σ} (x | θ_{n}) - \frac{1}{2}) - 1)\} - 1,

where ρ is the ReLU activation function. Note that

f_{n} (x)

is equal to 1 if

N_{σ} (x | θ_{n}) \geq 1 / 2 + ξ_{n}

,

(N_{σ} (x | θ_{n}) - 1 / 2) / ξ_{n}

if

1 / 2 \leq (N_{σ} (x | θ_{n}) < 1 / 2 + ξ_{n}

and

- 1

otherwise. Let

B (4 ξ_{n}) = {x : | 2 η (x) - 1 | > 4 ξ_{n}} .

Then on

B (4 ξ_{n})

,

| f_{n} (x) - f^{*} (x) | = 0

, since

N_{σ} (x | θ_{n}) - 1 / 2 = (η (x) - 1 / 2) - ((N_{σ} (x | θ_{n}) - η (x)) \geq ξ_{n}

when

2 η (x) - 1 > 4 ξ_{n} .

Similarly we can show that

N_{σ} (x | θ_{n}) - 1 / 2 < - ξ_{n}

when

2 η (x) - 1 < - 4 ξ_{n}

. Therefore the Tsybakov noise condition (8) implies

\begin{matrix} R_{hinge, η} (f_{n}) - R_{hinge, η} (f^{*}) & = \int | f_{n} (x) - f^{*} (x) | | 2 η (x) - 1 | d P_{x} (x) \\ = \int_{B {(4 ξ_{n})}^{c}} | f_{n} (x) - f^{*} (x) | | 2 η (x) - 1 | d P_{x} (x) \\ \leq 8 ξ_{n} Pr (| 2 η (x) - 1 | \leq 4 ξ_{n}) \leq C_{1}^{'} ξ_{n}^{q + 1}, \end{matrix}

for some

C_{1}^{'} > 0,

where the first equality follows from Theorem 2.31 of [33].

We take

δ_{n} = C_{1}^{'} ξ_{n}^{q + 1}

. Then there are positive constants

L_{0}

,

N_{0}

,

S_{0}

and

B_{0}

such that

f_{n} \in F_{σ, n}

where

\begin{matrix} F_{σ, n} : = {N_{σ} (\cdot | θ) : & {∥ N_{σ} (\cdot | θ) ∥}_{\infty} \leq 1, \\ θ \in Θ_{d, 1} (L_{0} log (δ_{n}^{- 1}), N_{0} δ_{n}^{- d / α (q + 1)}, S_{0} δ_{n}^{- d / α (q + 1)} log (δ_{n}^{- 1}), B_{0} δ_{n}^{- κ^{'}})}, \end{matrix}

for some

κ^{'} > 0

. Propostion 1 implies that the log covering number of

F_{σ, n}

is bounded above by

\begin{matrix} log N (δ_{n}, F_{σ, n}, {∥ \cdot ∥}_{\infty}) \leq δ_{n}^{- d / α (q + 1)} {log}^{3} (δ_{n}^{- 1}) . \end{matrix}

Note that to satisfy the entropy condition of Lemma A5,

δ_{n}

should satisfy

{(δ_{n})}^{\frac{d}{α (q + 1)} + \frac{q + 2}{q + 1}} \geq C_{2}^{'} n^{- 1} {log}^{3} (δ_{n}^{- 1})

(A5)

for some

C_{2}^{'} > 0

. If we let

δ_{n} = {({log}^{3} n / n)}^{α (q + 1) / (α (q + 2) + d)}

, the condition (A5) holds and so the proof is done. □

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
Funahashi, K.I. On the approximate realization of continuous mappings by neural networks. Neural Netw. 1989, 2, 183–192. [Google Scholar] [CrossRef]
Chui, C.K.; Li, X. Approximation by ridge functions and neural networks with one hidden layer. J. Approx. Theory 1992, 70, 131–141. [Google Scholar] [CrossRef] [Green Version]
Leshno, M.; Lin, V.Y.; Pinkus, A.; Schocken, S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 1993, 6, 861–867. [Google Scholar] [CrossRef] [Green Version]
Telgarsky, M. Neural networks and rational functions. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3387–3393. [Google Scholar]
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [Green Version]
Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. arXiv 2017, arXiv:1708.06633. [Google Scholar]
Bauer, B.; Kohler, M. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Stat. 2019. accepted. [Google Scholar] [CrossRef]
Li, B.; Tang, S.; Yu, H. Better Approximations of High Dimensional Smooth Functions by Deep Neural Networks with Rectified Power Units. arXiv 2019, arXiv:1903.05858. [Google Scholar]
Suzuki, T. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: Optimal rate and curse of dimensionality. arXiv 2018, arXiv:1810.08033. [Google Scholar]
Petersen, P.; Voigtlaender, F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 2018, 108, 296–330. [Google Scholar] [CrossRef] [Green Version]
Imaizumi, M.; Fukumizu, K. Deep Neural Networks Learn Non-Smooth Functions Effectively. arXiv 2018, arXiv:1802.04474. [Google Scholar]
Bergstra, J.; Desjardins, G.; Lamblin, P.; Bengio, Y. Quadratic Polynomials Learn Better Image Features; Technical Report 1337; Département d’Informatique et de Recherche Operationnelle, Université de Montréal: Montréal, QC, Canada, 2009. [Google Scholar]
Clevert, D.A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv 2015, arXiv:1511.07289. [Google Scholar]
Carlile, B.; Delamarter, G.; Kinney, P.; Marti, A.; Whitney, B. Improving deep learning by inverse square root linear units (ISRLUs). arXiv 2017, arXiv:1710.09967. [Google Scholar]
Klimek, M.D.; Perelstein, M. Neural Network-Based Approach to Phase Space Integration. arXiv 2018, arXiv:1810.11509. [Google Scholar]
Wuraola, A.; Patel, N. SQNL: A New Computationally Efficient Activation Function. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–7. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Mhaskar, H.N. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math. 1993, 1, 61–80. [Google Scholar] [CrossRef]
Costarelli, D.; Vinti, G. Saturation classes for max-product neural network operators activated by sigmoidal functions. Results Math. 2017, 72, 1555–1569. [Google Scholar] [CrossRef]
Costarelli, D.; Spigler, R. Solving numerically nonlinear systems of balance laws by multivariate sigmoidal functions approximation. Comput. Appl. Math. 2018, 37, 99–133. [Google Scholar] [CrossRef]
Costarelli, D.; Vinti, G. Estimates for the neural network operators of the max-product type with continuous and p-integrable functions. Results Math. 2018, 73, 12. [Google Scholar] [CrossRef]
Costarelli, D.; Sambucini, A.R. Approximation results in Orlicz spaces for sequences of Kantorovich max-product neural network operators. Results Math. 2018, 73, 15. [Google Scholar] [CrossRef]
Kim, Y.; Ohn, I.; Kim, D. Fast convergence rates of deep neural networks for classification. arXiv 2018, arXiv:1812.03599. [Google Scholar]
Anthony, M.; Bartlett, P.L. Neural Network Learning: Theoretical Foundations; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Mammen, E.; Tsybakov, A.B. Smooth discrimination analysis. Ann. Stat. 1999, 27, 1808–1829. [Google Scholar]
Tsybakov, A.B. Optimal aggregation of classifiers in statistical learning. Ann. Stat. 2004, 32, 135–166. [Google Scholar] [CrossRef]
Lin, Y. A note on margin-based loss functions in classification. Stat. Probab. Lett. 2004, 68, 73–82. [Google Scholar] [CrossRef]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: New York, NY, USA, 2008. [Google Scholar]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ohn, I.; Kim, Y. Smooth Function Approximation by Deep Neural Networks with General Activation Functions. Entropy 2019, 21, 627. https://doi.org/10.3390/e21070627

AMA Style

Ohn I, Kim Y. Smooth Function Approximation by Deep Neural Networks with General Activation Functions. Entropy. 2019; 21(7):627. https://doi.org/10.3390/e21070627

Chicago/Turabian Style

Ohn, Ilsang, and Yongdai Kim. 2019. "Smooth Function Approximation by Deep Neural Networks with General Activation Functions" Entropy 21, no. 7: 627. https://doi.org/10.3390/e21070627

APA Style

Ohn, I., & Kim, Y. (2019). Smooth Function Approximation by Deep Neural Networks with General Activation Functions. Entropy, 21(7), 627. https://doi.org/10.3390/e21070627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Smooth Function Approximation by Deep Neural Networks with General Activation Functions

Abstract

1. Introduction

Notation

2. Deep Neural Networks

3. Classes of Activation Functions

3.1. Piecewise Linear Activation Functions

3.2. Locally Quadratic Activation Functions

4. Approximation of Smooth Functions by Deep Neural Networks

4.1. Hölder Smooth Functions

4.2. Approximation of Hölder Smooth Functions

5. Application to Statistical Learning Theory

5.1. Application to Regression

5.2. Application to Binary Classification

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix A.1. Proof of Theorem 1 for Piecewise Linear Activation Functions

Appendix A.2. Proof of Theorem 1 for Locally Quadratic Activation Functions

Appendix B. Proofs of Proposition 1

Appendix C. Proof of Theorem 2

Appendix D. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI