## 1. Introduction

Over the past several years, neural nets, particularly deep nets, have become the state-of-the-art in a remarkable number of machine learning problems, from mastering go to image recognition/segmentation and machine translation (see the review article [

1] for more background). Despite all their practical successes, a robust theory of why they work so well is in its infancy. Much of the work to date has focused on the problem of explaining and quantifying the expressivity (the ability to approximate a rich class of functions) of deep neural nets [

2,

3,

4,

5,

6,

7,

8,

9,

10,

11]. Expressivity can be seen both as an effect of both depth and width. It has been known since at least the work of Cybenko [

12] and Hornik-Stinchcombe-White [

13] that if no constraint is placed on the width of a hidden layer, then a single hidden layer is enough to approximate essentially any function. The purpose of this article, in contrast, is to investigate the “effect of depth without the aid of width.” More precisely, for each

$d\ge 1$, we would like to estimate:

Here,

$\mathbb{N}=\{0,1,2,\dots \}$ are the natural numbers and ReLU is the so-called “rectified linear unit,”

$ReLU\left(t\right)=max\{0,t\},$ which is the most popular non-linearity used in practice (see (

4) for the exact definition). In Theorem 1, we prove that

${\omega}_{\mathrm{min}}\left(d\right)\le d+2.$ This raises two questions:

**Q1.** Is the estimate in the previous line sharp?

**Q2.** How efficiently can ReLU nets of a given width $w\ge {w}_{\mathrm{min}}\left(d\right)$ approximate a given continuous function of d variables?

A priori, it is not clear how to estimate

${\omega}_{min}\left(d\right)$ and whether it is even finite. One of the contributions of this article is to provide reasonable bounds on

${\omega}_{min}\left(d\right)$ (see Theorem 1). Moreover, we also provide quantitative estimates on the corresponding rate of approximation. On the subject of Q1, we will prove in forthcoming work with M.Sellke [

14] that in fact,

${\omega}_{\mathrm{min}}\left(d\right)=d+1.$ When

$d=1$, the lower bound is simple to check, and the upper bound follows for example from Theorem 3.1 in [

5]. The main results in this article, however, concern Q1 and Q2 for convex functions. For instance, we prove in Theorem 1 that:

where:

This illustrates a central point of the present paper: the convexity of the ReLU activation makes ReLU nets well-adapted to representing convex functions on ${[0,1]}^{d}.$

Theorem 1 also addresses Q2 by providing quantitative estimates on the depth of a ReLU net with width

$d+1$ that approximates a given convex function. We provide similar depth estimates for arbitrary continuous functions on

${[0,1]}^{d},$ but this time for nets of width

$d+3.$ Several of our depth estimates are based on the work of Balázs-György-Szepesvári [

15] on max-affine estimators in convex regression.

In order to prove Theorem 1, we must understand what functions can be exactly computed by a ReLU net. Such functions are always piecewise affine, and we prove in Theorem 2 the converse: every piecewise affine function on

${[0,1]}^{d}$ can be exactly represented by a ReLU net with hidden layer width at most

$d+3$. Moreover, we prove that the depth of the network that computes such a function is bounded by the number affine pieces it contains. This extends the results of Arora-Basu-Mianjy-Mukherjee (e.g., Theorem 2.1 and Corollary 2.2 in [

2]).

Convex functions again play a special role. We show that every convex function on ${[0,1]}^{d}$ that is piecewise affine with N pieces can be represented exactly by a ReLU net with width $d+1$ and depth $N.$

## 2. Statement of Results

To state our results precisely, we set notation and recall several definitions. For

$d\ge 1$ and a continuous function

$f:{[0,1]}^{d}\to \mathbb{R},$ write:

Further, denote by:

the modulus of continuity of

$f,$ whose value at

$\epsilon $ is the maximum that

f can change when its argument moves by at most

$\epsilon .$ Note that by the definition of a continuous function,

${\omega}_{f}\left(\epsilon \right)\to 0$ as

$\epsilon \to 0.$ Next, given

${d}_{\mathrm{in}},{d}_{\mathrm{out}},$ and

$w\ge 1,$ we define a feed-forward neural net with ReLU activations, input dimension

${d}_{\mathrm{in}}$, hidden layer width

w, depth

$n,$ and output dimension

${d}_{\mathrm{out}}$ to be any member of the finite-dimensional family of functions:

that map

${\mathbb{R}}^{d}$ to

${\mathbb{R}}_{+}^{{d}_{\mathrm{out}}}=\{x=\left({x}_{1},\dots ,{x}_{{d}_{\mathrm{out}}}\right)\in {\mathbb{R}}^{{d}_{\mathrm{out}}}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{x}_{i}\ge 0\}.$ In (

4),

are affine transformations, and for every

$m\ge 1$:

We often denote such a net by

$\mathcal{N}$ and write:

for the function it computes. Our first result contrasts both the width and depth required to approximate continuous, convex, and smooth functions by ReLU nets.

**Theorem** **1.** Let $d\ge 1$ and $f:{[0,1]}^{d}\to {\mathbb{R}}_{+}$ be a positive function with ${\u2225f\u2225}_{{C}^{0}}=1$. We have the following three cases:

**1.** **(f is continuous)**There exists a sequence of feed-forward neural nets ${\mathcal{N}}_{k}$ with ReLU activations, input dimension $d,$ hidden layer width $d+2,$ and output dimension $1,$ such that: In particular, ${w}_{min}\left(d\right)\le d+2.$ Moreover, write ${\omega}_{f}$ for the modulus of continuity of $f,$ and fix $\epsilon >0.$ There exists a feed-forward neural net ${\mathcal{N}}_{\epsilon}$ with ReLU activations, input dimension $d,$ hidden layer width $d+3,$ output dimension $1,$ and:such that: **2.** **(f is convex)**There exists a sequence of feed-forward neural nets ${\mathcal{N}}_{k}$ with ReLU activations, input dimension $d,$ hidden layer width $d+1,$ and output dimension $1,$ such that: Hence, ${\omega}_{min}^{conv}\left(d\right)\le d+1.$ Further, there exists $C>0$ such that if f is both convex and Lipschitz with Lipschitz constant $L,$ then the nets ${\mathcal{N}}_{k}$ in (

8)

can be taken to satisfy: **3.** **(f is smooth)**There exists a constant K depending only on d and a constant C depending only on the maximum of the first K derivative of f such that for every $k\ge 3$, the width $d+2$ nets ${\mathcal{N}}_{k}$ in (

5)

can be chosen so that:

The main novelty of Theorem 1 is the width estimate

${w}_{\mathrm{min}}^{\mathrm{conv}}\left(d\right)\le d+1$ and the quantitative depth estimates (

9) for convex functions, as well as the analogous estimates (

6) and (

7) for continuous functions. Let us briefly explain the origin of the other estimates. The relation (

5) and the corresponding estimate

${w}_{\mathrm{min}}\left(d\right)\le d+2$ are a combination of the well-known fact that ReLU nets with one hidden layer can approximate any continuous function and a simple procedure by which a ReLU net with input dimension

d and a single hidden layer of width

n can be replaced by another ReLU net that computes the same function, but has depth

$n+2$ and width

$d+2.$ For these width

$d+2$ nets, we are unaware of how to obtain quantitative estimates on the depth required to approximate a fixed continuous function to a given precision. At the expense of changing the width of our ReLU nets from

$d+2$ to

$d+3,$ however, we furnish the estimates (

6) and (

7). On the other hand, using Theorem 3.1 in [

5], when

f is sufficiently smooth, we obtain the depth estimates (

10) for width

$d+2$ ReLU nets. Indeed, since we are working on a compact set

${[0,1]}^{d}$, the smoothness classes

${W}_{w,q,\gamma}$ from [

5] reduce to classes of functions that have sufficiently many bounded derivatives.

Our next result concerns the exact representation of piecewise affine functions by ReLU nets. Instead of measuring the complexity of such a function by its Lipschitz constant or modulus of continuity, the complexity of a piecewise affine function can be thought of as the minimal number of affine pieces needed to define it.

**Theorem** **2.** Let $d\ge 1$ and $f:{[0,1]}^{d}\to {\mathbb{R}}_{+}$ be the function computed by some ReLU

net with input dimension d, output dimension $1,$ and arbitrary width. There exist affine functions ${g}_{\alpha},{h}_{\beta}:{[0,1]}^{d}\to \mathbb{R}$ such that f can be written as the difference of positive convex functions: Moreover, there exists a feed-forward neural net $\mathcal{N}$ with ReLU activations, input dimension $d,$ hidden layer width $d+3,$ output dimension $1,$ and:that computes f exactly. Finally, if f is convex (and hence, h vanishes), then the width of $\mathcal{N}$ can be taken to be $d+1$, and the depth can be taken to be $N.$ The fact that the function computed by a ReLU net can be written as (

11) follows from Theorem 2.1 in [

2]. The novelty in Theorem 2 is therefore the uniform width estimate

$d+3$ in the representation on any function computed by a ReLU net and the

$d+1$ width estimate for convex functions. Theorem 2 will be used in the proof of Theorem 1.