Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks

We show that neural networks with an absolute value activation function and with network path norm, network sizes and network weights having logarithmic dependence on 1/ε can ε-approximate functions that are analytic on certain regions of Cd.


Introduction
Deep neural networks have found broad applications in many areas and disciplines, such as computer vision, speech and audio recognition and natural language processing. Two of the main characteristics of a given class of neural networks are its complexity and approximating capability. Once the activation function is selected, a class of networks is determined by the specification of the network architecture (namely, its depth and width) and the choice of network weights. Hence, the estimation of the complexity of a given class is carried out by regularizing (one of) those parameters, and the approximation properties of obtained regularized classes of networks are then investigated.
The capability of shallow networks of depth 1 to approximate continuous functions is shown in the universal approximation theorem ( [1]), and approximations of integrable functions by networks with fixed width are presented in [2]. Network-architecture-constrained approximations of analytic functions are given in [3], where it is shown that ReLU networks with depth depending logarithmically on 1/ε and width d + 4 can ε-approximate analytic functions on the closed subcubes of (−1, 1) d .
The weight regularization of networks is usually carried out by imposing an l p -related constraint on network weights, p ≥ 0. The most popular types of such constraints include the l 0 , l 1 and the path norm regularizations (see, respectively, [4][5][6] and references therein). Approximations of β-smooth functions on [0, 1] d by l 0 -regularized sparse ReLU networks are given in [5,7], and exponential rates of approximations of analytic functions by l 0regularized networks are derived in [8].
Path-norm-regularized classes of deep ReLU networks are considered in [4], where, together with other characteristics, the Rademacher complexities of those classes are estimated. The network size independence of those estimates makes the path norm regularization particularly remarkable. As the estimation only uses the Lipschitz continuity (with Lipschitz constant 1), the idempotency and the non-negative homogeneity of the ReLU function, it can be extended to networks with the absolute value activation function. Network characteristics similar to the path norm are also considered in the works [9,10], where they are called, respectively, a variation and a basis-path norm, and statistical features of classes of networks are described in terms of those characteristics.
The objective of the present paper is the construction of path-norm-regularized networks that exponentially fast approximate analytic functions. Our goal is to achieve such convergence rates with activations that are idempotent, non-negative homogeneous and Lipschitz continuous with Lipschitz constant 1 so that the constructed path-normregularized networks fall within the scope of network classes studied in [4]. It turns out that networks with an absolute value activation function may suit this goal better than the networks with an ReLU activation function. More precisely, we show that analytic functions can be ε-approximated by networks with an absolute value activation function a(x) and with the path norm, the depth, the width and the weights all depending logarithmically on 1/ε. Such an approximation holds (i) on any subset (0, 1 − δ] d ⊂ (0, 1) d for analytic functions on (0, 1) d with absolutely convergent power series; (ii) on the whole hypercube [0, 1] d for functions that can be analytically continued to certain subsets of C d . Note that, as the network weights, as well as the total number of weights, depend logarithmically on 1/ε, then the l 1 weight norms of the constructed approximating deep networks are also of logarithmic dependence on 1/ε.
Note that the absolute value activation function considered in this paper is among the common built-in activation functions of the software-based neural network evolving method NEAT-Python ( [11]). Training algorithms for networks with an absolute value activation function are developed in the works [12,13]. In addition, the VC-dimensions and the structures of the loss surfaces of neural networks with piecewise linear activation functions, including the absolute value function, are described in the works [14,15].
Notation: For a matrix W ∈ R d 1 ×d 2 , we denote by |W| ∈ R d 1 ×d 2 the matrix obtained by taking the absolute values of the entries of W: |W| ij = |W ij |. For brevity of presentation, we will say that the matrix |W| is the absolute value of the matrix W (note that, in the literature, there are also other definitions of the notion of an absolute value of a matrix). The path norm of a neural network f is denoted by f × . For To assure that the matrix-vector multiplications are able to be accomplished, the vectors from R d , according to the context, may be treated as matrices either from R d×1 or from R 1×d .

The Class of Approximant Networks
Neural networks are constituted of weight matrices, biases and nonlinear activation functions acting neuron-wise in the hidden layers. The biases, also called shift vectors, can be omitted by adding a fixed coordinate 1 to the input vector and correspondingly modifying the weight matrices. As the definition of the path norm of networks does not assume the presence of shift vectors, we will add a coordinate 1 to the input vector x and will consider classes of neural networks of the form where W i ∈ R p i+1 ×p i are the weight matrices, i = 0, . . . , L, and p = (p 0 , p 1 , . . . , p L+1 ) is the width vector, with p 0 = p + 1. The number of hidden layers L determines the depth of networks from F α (L, p) and, in each layer, the activation function α : R → R acts element-wise on the input vector. For f ∈ F α (L, p) given by let be the path norm of f , where · 1 denotes the l 1 norm of the p 0 (= p + 1) dimensional vector ∏ L i=0 |W i | obtained as a product of absolute values of the weight matrices of f .
be a path-norm-regularized subclass of F α (L, p). As the results obtained in [4] indicate, the path norm regularizations are particularly well-suited for networks whose activation function α is We therefore aim to choose an activation α possessing those properties such that analytic functions can be approximated by networks from F α (L, p, B) with a small path norm constraint B. The most popular activation functions satisfying the above conditions are the ReLU function σ(x) = max{0, x} and the absolute value function a(x) = |x|. Below, we show that, with the absolute value activation function, the path norms of approximant networks may be significantly smaller than the path norms of the ReLU networks.
The standard technique of neural network function approximation relies on approximating the product function (x, y) → xy, which then allows us to approximate monomials and polynomials of any desired degree. In [7], the approximation of the product xy = ((x + y) 2 − x 2 − y 2 )/2 is achieved by approximating the function x → x 2 . The latter is based on the observation that, for the triangle wave where g : and for any positive integer m, The approximation of x 2 by networks with the ReLU activation function σ(x) then follows from the representation Thus, in this case, we will obtain matrices containing weights 2 and 4, which will make the path norm of approximant networks big. Note that the same approach is also used in [3] for constructing ReLU network approximations of analytic functions. In [5], the approximation of the product is achieved by approximating the function h(x) := x(1 − x), which, in turn, is based on the observation that, for the triangle wave and for any positive integer m, Although in the representation (6), the coefficients (weights) are all in [−1, 1], the approximant ∑ m k=1 R k (x) in this case does not have the factors 2 −2s presented in the approximant f m (x) in (4), which, again, will result in big values of path norms. Therefore, in order to take advantage of the presence of those reducing weights, we would like to represent the function g(x) in (5) by a linear combination of activation functions with smaller coefficients. This is possible if, instead of σ(x), we deploy the absolute value activation function a(x). Indeed, in this case, we have that g(x) can be represented on [0, 1] as In the next section, we use the above representation (7) to show that analytic functions can be ε-approximated by networks from F a (L, p, B) with each of L, p ∞ and B, as well as the network weights having logarithmic dependence on 1/ε. As all networks will have the same activation function a(x) = |x|, in the following, the subscript a will be omitted.

Results
We first construct a neural network with activation function a(x), that, for the given γ, m ∈ N, simultaneously approximates all d-dimensional monomials of a degree less than γ up to an error of γ 2 4 −m . The depth of this network has order m log 2 γ and its width is of order mγ d+1 . Moreover, the entries of the product of the absolute values of matrices of the network have an order of at most γ 5 (note the independence of m).
For γ > 0, let C d,γ denote the number of d-dimensional monomials x k with degree k 1 < γ. Then, C d,γ < (γ + 1) d and the following holds: Moreover, the entries of the C d,γ × (d + 1)-dimensional matrix obtained by multiplying the absolute values of matrices presented in Mon d m,γ are all bounded by 144(γ + 1) 5 .
Taking in the above lemma γ, m = log 2 1 ε , we obtain a neural network from F (L, p), with L and p ∞ having logarithmic dependence on 1/ε, which simultaneously approximates the monomials of a degree at most of γ with error ε (up to a logarithmic factor). Moreover, the entries of the product of absolute values of matrices of this network will also have logarithmic dependence on 1/ε. Below, we use this property to construct a neural network approximation of analytic and analytically continuable functions with an approximation error ε and with network parameters having logarithmic order.
Note that an exponential convergence rate of deep ReLU network approximants on subintervals (0, 1 − δ] d is also given in [3]. In our case, however, not only the depth and the width but also the path norm F ε × of the constructed network F ε have logarithmic dependence on 1/ε. Note that, in the above theorem, as δ approaches to 0, both p ∞ and B, as well as the approximation error, grow polynomially on 1/δ. In the next theorem, we use the properties of Chebyshev series to derive an exponential convergence rate on the whole hypercube [0, 1] d . Recall that the Chebyshev polynomials are defined as T 0 (x) = 1, Chebyshev polynomials play an important role in the approximation theory ( [16]), and, in particular, it is known ( [17], Theorem 3.1) that if f is Lipschitz continuous on [−1, 1], then it has a unique representation as an absolutely and uniformly convergent Chebyshev series Moreover, in case f can be analytically continued to an ellipse E ρ ⊂ C with foci −1 and 1 and with the sum of semimajor and semiminor axes equal to ρ > 1, then the partial sums of the above Chebyshev series converge to f with a geometric rate and the coefficients a k also decay with a geometric rate. This result was first derived by Bernstein in [18] and its extension to the multivariate case was given in [19]. Note that the condition z ∈ E ρ implies that and Combining Lemma 1 and Lemma 2, we obtain the following.

Theorem 2.
Let ε ∈ (0, 1) and let ρ ≥ 2 √ d . For f ∈ A d (ρ, F), there is a constant C = C(d, ρ, F) and a neural network F ε ∈ F (L, p, B) with L ≤ C log 2 We conclude this part by estimating the l 1 weight regularization of networks constructed in Theorem 2. First, the total number of weights in those networks is bounded by (L + 1) p 2 ∞ = O(log 2 1 ε ) 2d+6 . From (7), it follows that all of the weights of network Mon d m,γ from Lemma 1 are in [−2, 2]. In Theorem 2, the network F ε is obtained by adding to a network Mon d m,γ , with γ = m = O(log 2 1 ε ), a layer with coefficients of partial sums of power series of an approximated function. Thus, using (8), we obtain that the l 1 weight norm of the network F ε constructed in Theorem 2 has order O(log 2 1 ε ) 4d+6 .

Proofs
In the following proofs, I k denotes an identity matrix of size k × k and all of the networks have activation a(x) = |x|. The proof of Lemma 1 is based on the following two lemmas.

Lemma 3.
For any positive integer m, there exists a neural network Mult m ∈ F (2m + 3, p), with p 0 = 3, p L+1 = 1 and p ∞ = 3m + 2, such that and the product of absolute values of the matrices presented in Mult m is equal to Proof. For k ≥ 2, let R k denote a row of length k with a first entry equal to −1/2, last entry equal to 1 and all other entries equal to 0. Let A k be a matrix of size (k + 1) × k obtained by adding the (k + 1)-th row R k to the identity matrix I k . That is, In addition, let B k denote a matrix of size k × k given by It then follows from (7) that where g s (x) is the function defined in (3), s = 1, . . . , m. Thus, if S m+2 is a row of length m + 2 defined as where f m is defined by (4). We have that As xy = 1 2 (x + y) 2 − x 2 − y 2 , then, in the first layer of Mult m , we will obtain a vector and will then apply the network in a parallel manner from the first part of the proof to each of the pairs (1, x), (1, y) and (1, x + y). More precisely, for a given matrix M of size p × q, letM be a matrix of size 3p × 3q defined as Then, for the network which, together with | f m (x) − x 2 | < 2 −2m−2 and the triangle inequality, implies (9). It remains to be noted that the product of absolute values of the matrices presented in Mult m is equal to which completes the proof of the lemma.

Lemma 4.
For any positive integer m, there exists a neural network Mult r m ∈ F (L, p), with L = (2m + 5) log 2 r + 1, p 0 = r + 1, p L+1 = 1 and p ∞ ≤ 6r(m + 2) + 1, such that and, for the (r + 1)-dimensional vector J r m obtained by multiplication of absolute values of matrices presented in Mult r m , we have that J r m ∞ ≤ 144r 4 .
In the first layer, we obtain a vector for which the first coordinate is 1 followed by triples (1, x 2l−1 , x 2l ) l = 1, . . . , k, that is, the vector (1, 1, x 1 , x 2 , 1, x 3 , x 4 , . . . , 1, x 2k−1 , x 2k ). N k m is then obtained by applying in parallel the network Mult m to each triple (1, x 2l−1 , x 2l ) while keeping the first coordinate equal to 1. The product of absolute values of the matrices presented in this construction is a matrix of size (k + 1) × (2k + 1) having a form where q = log 2 r . We then subsequently apply the networks N 2 q m , N 2 q−1 m , . . . , N 2 m and, in the last layer, we multiply the outcome by (0, 1). From Lemma 3 and triangle inequality, we have that |Mult m (x, y) − tz| ≤ 3 · 2 −2m−3 + |x − t| + |y − z|, for x, y, t, z ∈ [0, 1]. Hence, by induction on q, we obtain that |Mult r Note that the product of absolute values of matrices in each network N k m has the above form, that is, in each row, it has at most three nonzero values, each of which is less than 2. As the matrices given in the first and the last layer of Mult r m also satisfy this property, then each entry of the product of absolute values of all matrices of Mult r m will not exceed 12 q+2 ≤ 144r 4 .
The first layer of Mon d m,γ computes the d by multiplying the input vector by matrix Γ of size d In the following layers, we do not change the first d + 1 coordinates (by multiplying them by I d+1 ), and, to each x k i , we apply in parallel the network Mult As the matrix Γ only contains entries 0 and 1, then, applying Lemma 4, we obtain that the entries of M are bounded by Applying Lemma 1 with m = log 2 4F+16 ε , we obtain that, for all where we used the inequalities log 2 (1 − δ) −1 ≥ δ, δ ∈ (0, 1), and log 2 2 r ≤ r for r ≥ 16. In order to approximate the partial sum ∑ k 1 ≤γ a k x k , we add one last layer with the coefficients of that partial sum to the network Mon d m,γ+1 . As the sum of absolute values of those coefficients is bounded by F, then, combining (10) and (11), for the obtained network Let us now present the result from [19] that will be used to derive Lemma 2. First, if f ∈ A d (ρ, F), then ( [20], Theorem 4.1) f has a unique representation as an absolutely and uniformly convergent multivariate Chebyshev series Note that, for k := (k 1 , . . . , k d ), the degree of a d-dimensional polynomial T k 1 (x 1 ) . . . T k d (x d ) is k 1 = k 1 + · · · + k d . Then, for any non-negative integers n 1 , . . . , n d , the partial sum is a polynomial truncation of the multivariate Chebyshev series of f of degree d(p) = n 1 + · · · + n d . It is shown in [19] that Theorem 3. For f ∈ A d (ρ, F), there is a constant C = C(d, ρ, F) such that the multivariate Chebyshev coefficients of f satisfy |a k | ≤ Cρ − k 2 (13) and, for the polynomial truncations p of the multivariate Chebyshev series of f , we have that Proof of Lemma 2. Note that, from the recursive definition of the Chebyshev polynomials, it follows that, for any k ≥ 0, the coefficients of the Chebyshev polynomial T k (x) are all bounded by 2 k . Let p now be a polynomial given by (12) with degree d(p) ≤ γ. As the number of summands in the right-hand side of (12) is bounded by (γ + 1) d , then, using (13), we obtain that p can be rewritten as where the last inequality follows from the condition ρ ≥ 2 √ d .

Discussion
Although various activation functions, including the ReLU, sigmoid and the Gaussian function, have already been used in the literature for neural network approximations of smooth and analytic functions (see [3,8,21]), approximating properties of neural networks with an absolute value activation function, which is a built-in activation function of software-based neural network evolving methods (such as NEAT-Python, [11]), has been barely covered previously. Whereas the algorithms developed in the works [12,13] allow us to train neural networks with an absolute value activation function, in the present paper, we study the capabilities of those networks to approximate analytic functions. While popular types of constraints imposed on approximating neural networks are either controlling the l p norms of network weights or adjusting their architectures, in the present work, we study approximating properties of neural networks with regularized path norms and show that networks with an absolute value activation function and with network path norms having logarithmic dependence on 1/ε can ε-approximate functions that are analytic on certain regions of C d . The sizes and the weights of constructed networks also have logarithmic dependence on 1/ε.