The Theory and Applications of Hölder Widths

Man Lu; Peixin Ye

doi:10.3390/axioms14010025

and

Department of Applied Mathematics, School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China

^*

Author to whom correspondence should be addressed.

Axioms2025, 14(1), 25;https://doi.org/10.3390/axioms14010025

This article belongs to the Special Issue Numerical Computation, Approximation of Functions and Applied Mathematics, 2nd Edition

Version Notes

Order Reprints

Abstract

We introduce the Hölder width, which measures the best error performance of some recent nonlinear approximation methods, such as deep neural network approximation. Then, we investigate the relationship between Hölder widths and other widths, showing that some Hölder widths are essentially smaller than

n

-Kolmogorov widths and linear widths. We also prove that, as the Hölder constants grow with n, the Hölder widths are much smaller than the entropy numbers. The fact that Hölder widths are smaller than the known widths implies that the nonlinear approximation represented by deep neural networks can provide a better approximation order than other existing approximation methods, such as adaptive finite elements and

n

-term wavelet approximation. In particular, we show that Hölder widths for Sobolev and Besov classes, induced by deep neural networks, are

O (n^{- 2 s / d})

and are much smaller than other known widths and entropy numbers, which are

O (n^{- s / d})

.

Keywords:

Hölder widths; deep neural networks; entropy numbers; nonlinear approximation; n-Kolmogorov widths; nonlinear (n, N)-widths; Sobolev classes; Besov classes

MSC:

41A46; 41A65; 68T07; 46N10

1. Introduction

Width theory is one of the most important topics in approximation theory because widths can be considered approximation standards that indicate the accuracy achievable for a given function class using an approximation method. They have been extensively studied and applied in various fields, providing a benchmark for the best performance of different approximation techniques. One of the earliest studies on widths was Kolmogorov’s work in 1936, where he introduced the concept of

n

-Kolmogorov widths [1]. With the development of modern science and engineering, the theory of widths has also developed rapidly, greatly promoting research into various linear and nonlinear approximation methods. Problems related to width theory have been and continue to be studied by many experts, including Pinkus, Lorentz, DeVore, Temlyakov, et al. [2,3,4,5,6,7,8,9,10,11,12]. In addition, nonlinear methods play a crucial role in understanding complex phenomena across various applications, such as compressed sensing, signal processing, and neural networks [13,14,15]. Widths such as manifold widths, nonlinear widths, and Lipschitz widths have been utilized as fundamental measures to assess the optimal convergence rate of these nonlinear methods [12,15,16,17,18].

It is known that neural networks can serve as powerful nonlinear tools. For instance, the ReLU (Rectified Linear Unit) activation function,

σ (t) : = ReLU (t) = max {0, t}, t \in R,

is characterized as a Lipschitz mapping, which has led to the introduction of stable manifold widths and Lipschitz widths. In [13,17], Cohen et al. and DeVore, et al. investigated stable manifold widths to quantify error performance in nonlinear approximation methods, such as compressed sensing and neural networks. They discussed the fundamental properties of these widths and established their connections with entropy numbers. In [18,19], Petrova and Wojtaszczyk introduced Lipschitz widths and showed their relationships with other widths and entropy numbers. However, not all mappings are Lipschitz; thus, it is essential to consider weaker conditions to understand the error performance of nonlinear approximation methods. One such condition is the Hölder mapping, which we explore in this paper. We will introduce the concept of Hölder widths and investigate their relationship with other widths and entropy numbers. Our results may provide a better understanding of the effects of such nonlinear approximation methods and their potential applications in deep neural networks.

Many authors have achieved profound results for the ReLU activation function, which acts as a

L i p 1

continuous function in feed-forward deep neural networks (DNNs) [14,17,18,20,21]. It is known that the mapping

Φ : (B_{l_{\infty}^{c n}} {, ∥ \cdot ∥}_{l_{\infty}^{c n}}) \to C ({[0, 1]}^{d})

in [18] with the ReLU activation function is a

C^{'} n W^{n}

Lipschitz mapping in DNNs, where the network has a width W and adepth n. Unlike the Lipschitz width, the terms ‘width’ and ‘depth’ here indicate the scale of the network. The performance of this Lipschitz mapping is discussed in [14].

We introduce a more flexible assumption. Let

(Y, d)

and

(Z, ρ)

be metric spaces. Moreover, we assume that the space Z is separable. Define

Φ : (Y, d) \to (Z, ρ)

as an

α

-Hölder mapping with coefficient

γ

if for any

x, y \in Y

,

ρ (Φ (x), Φ (y)) \leq γ d^{α} (x, y), γ > 0 and 0 < α \leq 1 .

We could also say that

Φ

satisfies the

H_{α} (γ)

condition [22] equivalent to

sup_{x, y \in Y} \frac{ρ (Φ (x), Φ (y))}{d^{α} (x, y)} \leq γ, γ > 0 and 0 < α \leq 1 .

We provide some remarks on Hölder mappings below.

Remark 1.

If

α = 1

, then Φ is Lipschitz continuous.

Remark 2.

If Y is bounded and Φ is an α-Hölder mapping, then for any

β \leq α

, Φ is a β-Hölder mapping.

Remark 3.

The minimum α-Hölder coefficient

γ = 0

if and only if Φ is constant.

Note that the RePU (Rectified Power Unit) activation function [23,24],

σ_{1} (t) : = RePU (t) = max {0, t}^{α}, t \in R,

with

α \in N, α \geq 2

, and the GELU (Gaussian Error Linear Unit) activation function [25],

σ_{2} (t) : = GELU (t) = 0.5 t (1 + tanh (\sqrt{\frac{2}{π}} (t + 0.044715 t^{3}))), t \in R .

can be considered 1-Hölder mappings in bounded spaces. The performance of these mappings can be found in [14,18]. Moreover, there are various

α

-Hölder activation functions with

0 < α < 1

. In [26], Forti, Grazzini, et al. obtained global convergence results, where the neuron activations were modeled by

α_{i}

-Hölder continuous functions with

α_{i} \in

(0, 1), such as

σ_{i} (t) = k_{i} sign (t) {| t |}^{α_{i}},

where

k_{i} > 0

and

i \in θ_{C}

defined in [26]. These activations can significantly increase the computational power [26,27,28]. Motivated by the above results, we mainly focus on the

α

-Hölder condition with

0 < α < 1

, which is weaker than the Lipschitz condition.

Now, we introduce Hölder widths, which measure the best error performance of some recent nonlinear approximation methods characterized by Hölder mappings. Throughout this paper, let X be a Banach space with a norm

{∥ \cdot ∥}_{X}

, and let

Y_{n}

be an

n

-dimensional Banach space with a norm

{∥ \cdot ∥}_{Y_{n}}

on

R^{n}

,

n \geq 1

. Denote the unit ball of

Y_{n}

by

B (Y_{n}) : = {y \in R^{n} {: ∥ y ∥}_{Y_{n}} \leq 1}

.

Let K be a bounded subset of X. For

γ \geq 0

,

0 < α < 1

, we define the fixed Hölder widths

δ_{n}^{γ, α} {(K, Y_{n})}_{X} : = inf_{Φ_{n}} sup_{f \in K} inf_{y \in B (Y_{n})} {∥ f - Φ_{n} (y) ∥}_{X},

(1)

where

Φ_{n} : (B (Y_{n}), {∥ \cdot ∥}_{Y_{n}}) \to X

satisfies

sup_{x, y \in B (Y_{n})} \frac{{∥ Φ (x) - Φ (y) ∥}_{X}}{{∥ x - y ∥}_{Y_{n}}^{α}} \leq γ .

Next, we define the Hölder width

δ_{n}^{γ, α} {(K)}_{X} : = inf_{{∥ \cdot ∥}_{Y_{n}}} δ_{n}^{γ, α} {(K, Y_{n})}_{X},

(2)

where the infimum is taken over all norms

{∥ \cdot ∥}_{Y_{n}}

on

R^{n}

. From definition (1), we see that the error of any numerical method based on Hölder mappings will not be smaller than the Hölder widths.

We propose the Hölder widths, exploring their properties and relationships with other known widths and entropy numbers. In Section 2, we establish the fundamental properties of Hölder widths. In Section 3, we compare Hölder widths with

n

-Kolmogorov widths, linear widths, and nonlinear

(n, N)

-widths. In Section 4, we investigate the relationship between Hölder widths and entropy numbers. In Section 5, we provide some specific applications and derive the asymptotic order of Hölder widths for Sobolev classes

B W_{p}^{s} ({[0, 1]}^{d})

and Besov classes

B B_{p, τ}^{s} ({[0, 1]}^{d})

using deep neural networks. In Section 6, we provide some concluding remarks. All detailed proofs for the results from Section 2, Section 3 and Section 4 are included in Appendix A, Appendix B and Appendix C, and the proofs for Theorems 10 and 15 in Section 5 are provided in Appendix D.

2. Fundamental Properties of Hölder Widths

Recall that the radius of a set

K \subset X

is defined as

rad (K) = : inf_{g \in X} sup_{f \in K} {∥ f - g ∥}_{X} .

(3)

It is known from Remark 3 that a function that satisfies the

H_{α} (0)

condition is a constant function. Then, for the

n

-dimensional space

Y_{n}

,

rad (K) = δ_{n}^{0, α} {(K, Y_{n})}_{X} = δ_{n}^{0, α} {(K)}_{X} .

(4)

Moreover, for a fixed constant

α

, it is known from (2) that

δ_{n}^{γ, α} {(K)}_{X}

is decreasing with respect to

γ

and n, that is, (i) if

γ_{1} \leq γ_{2}

, then

δ_{n}^{γ_{2}, α} {(K)}_{X} \leq δ_{n}^{γ_{1}, α} {(K)}_{X} \leq rad (K) < \infty

, and (ii) if

n_{1} \leq n_{2}

, then

δ_{n_{2}}^{γ, α} {(K)}_{X} \leq δ_{n_{1}}^{γ, α} {(K)}_{X}

.

In addition, it is easy to see that the space (

R^{n}

,

{∥ \cdot ∥}_{Y_{n}}

) in (1) and (2) can be replaced with any

n

-dimensional normed space (

Z_{n}

,

{∥ \cdot ∥}_{Z_{n}}

) such that

δ_{n}^{γ, α} {(K)}_{X} = inf_{{∥ \cdot ∥}_{Z_{n}}} δ_{n}^{γ, α} {(K, Z_{n})}_{X}, δ_{n}^{γ, α} {(K, Z_{n})}_{X} = inf_{Φ_{n}} sup_{f \in K} inf_{z \in B (Z_{n})} {∥ f - Φ_{n} (z) ∥}_{X},

where

B (Z_{n}) : = {z \in Z_{n} {: ∥ z ∥}_{Z_{n}} \leq 1} .

Denote by

ℓ_{p}^{n} : = (R^{n} {, ∥ \cdot ∥}_{ℓ_{p}})

the space

R^{n}

equipped with the

ℓ_{p}

norm, that is, for

y = (y_{1}, y_{2}, \dots, y_{n}) \in R^{n}

,

{∥ y ∥}_{ℓ_{\infty}^{n}} : = max_{j} | y_{j} {|, ∥ y ∥}_{ℓ_{p}^{n}} : = {(\sum_{j = 1}^{n} {|y_{j}|}^{p})}^{1 / p} .

Recall that an

ε

-covering of K is a collection

{g_{1}, \dots, g_{m}} \subset X

such that

K \subset ⋃_{j = 1}^{m} B (g_{j}, ε) .

The minimal

ε

-covering number

N_{ε} (K)

is the minimal cardinality of the

ε

-covering of K. We say that a set K is totally bounded if for every

ε > 0

,

N_{ε} (K) < \infty

.

We establish the following fundamental properties of Hölder widths.

Theorem 1.

Let K be a compact subset of X. For any

n \in N

,

γ > 0

, and

0 < α < 1

, there exists a norm

{∥ \cdot ∥}_{Y}

on

R^{n}

satisfying for

y \in R^{n}

,

{∥ y ∥}_{ℓ_{\infty}^{n}} \leq {∥ y ∥}_{Y} \leq {∥ y ∥}_{ℓ_{1}^{n}},

(5)

such that

δ_{n}^{γ, α} {(K)}_{X} = δ_{n}^{γ, α} {(K, Y)}_{X} .

Theorem 2.

Let K be a compact subset of X. As a function of γ, the Hölder width

δ_{n}^{γ, α} {(K)}_{X}

is continuous.

Theorem 3.

A subset K of X is totally bounded if and only if for every

n \in N

,

lim_{γ \to \infty} δ_{n}^{γ, α} {(K)}_{X} = 0 .

Theorem 4.

Let

K \subset X

and

0 < α < 1

. If for every

ε > 0

, there exist two numbers

n \in N

and

γ > 0

such that

δ_{n}^{γ, α} {(K)}_{X} \leq ε

, then K is totally bounded.

It is important to delve deeper into Hölder widths since the Hölder condition offers a more flexible framework that can accommodate a broader range of functions, making it particularly suitable for analyzing and approximating complicated functions and datasets. In the forthcoming sections, we will gain deeper insights into the approximation performance measured by Hölder widths.

3. The Relationship Between Hölder Widths and Other Widths

In this section, we will demonstrate that Hölder widths are smaller than other known widths, such as

n

-Kolmogorov, linear, and nonlinear

(n, N)

-widths. In the following sections, we assume that K is a compact subset of the Banach space X, which is our main concern.

3.1. The Relationship Between Hölder Widths, $n$ -Kolmogorov Widths, and Linear Widths

We recall the definition of the

n

-Kolmogorov width of K from [2], as follows:

d_{0} {(K)}_{X} : = sup_{f \in K} {∥ f ∥}_{X}, d_{n} {(K)}_{X} : = inf_{dim (X_{n}) = n} sup_{f \in K} inf_{g \in X_{n}} {∥ f - g ∥}_{X}, n \geq 1,

where the infimum is taken over all

n

-dimensional subspaces

X_{n} \subset X

.

It is known that the

n

-Kolmogorov width determines the optimal errors generated by approximating the ‘worst’ element of the set K using

n

-dimensional subspaces of X. We will show that some Hölder widths are essentially smaller than

n

-Kolmogorov widths.

Theorem 5.

For a compact set

K \subset X

and

n \geq 1

,

0 < α < 1

,

δ_{n}^{γ, α} {(K)}_{X} \leq d_{n} {(K)}_{X}, f o r γ = \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}} .

Corollary 1.

If

K \subset X

is compact, then for each

n_{0} \in N

and each

γ \geq \frac{d_{n_{0}} {(K)}_{X} + rad (K)}{2^{α - 1}}

,

lim_{n \to \infty} δ_{n}^{γ, α} {(K)}_{X} = 0 .

Remark 4.

Recall that the linear width

d_{n}^{L} {(K)}_{X}

is defined as

d_{0}^{L} {(K)}_{X} : = sup_{f \in K} {∥ f ∥}_{X}, d_{n}^{L} {(K)}_{X} : = inf_{L \in L_{n}} sup_{f \in K} {∥ f - L (f) ∥}_{X}, n \geq 1,

where the infimum is taken over the class

L_{n}

of all continuous linear operators from X into itself with rank at most n. It follows from the definitions of the

n

-Kolmogorov width and linear width that

d_{n} {(K)}_{X} \leq d_{n}^{L} {(K)}_{X}

. Thus, Theorem 5 implies that

δ_{n}^{γ, α} {(K)}_{X} \leq d_{n}^{L} {(K)}_{X}, f o r γ = \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}} .

3.2. The Relationship Between Hölder Widths and Nonlinear $(n, N)$ -Widths

To evaluate the performance of the best

n

-term approximation with respect to different systems, such as the trigonometric system and wavelet bases, Temlyakov introduced the nonlinear

(n, N)

-width in [16], which is defined as follows: for

N \geq 1

,

n \geq 1

,

d_{0} {(K, N)}_{X} : = sup_{f \in K} {∥ f ∥}_{X}, d_{n} {(K, N)}_{X} : = inf_{L_{N}} sup_{f \in K} inf_{X_{n} \in L_{N}} dist {(f, X_{n})}_{X} .

(6)

where the second infimum is taken over the sets

L_{N}

of N linear spaces

X_{n} \subset X

of dimension n. The nonlinear

(n, N)

-width can reflect the approximation performance of greedy algorithms.

It is clear that

d_{n} {(K, 1)}_{X} = d_{n} {(K)}_{X}

. The larger N is, the more flexibility we have in approximating f. Moreover, it is known from (6) and Theorem 5 that

d_{n} {(K, N)}_{X} \geq d_{n \cdot N} {(K)}_{X} \geq δ_{n \cdot N}^{γ, α} {(K)}_{X}, where γ = 2^{2 - α} rad (K) .

Moreover, we obtain the following inequalities, revealing the relationship between Hölder widths and nonlinear

(n, N)

-widths.

Theorem 6.

For any

n \geq 1

,

N > 1

,

0 < α < 1

, and any compact set

K \subset X

with

sup_{f \in K} {∥ f ∥}_{X} = 1

, we have

δ_{n + 1}^{4 (N + 1) \sqrt{n}, α} {(K)}_{X} \leq d_{n} {(K, N)}_{X}, and δ_{n + ⌈ {log}_{2} N ⌉}^{12 \sqrt{n}, α} {(K)}_{X} \leq d_{n} {(K, N)}_{X} .

4. Comparison Between Hölder Widths and Entropy Numbers

We first recall the definition of the entropy number from [29]. The entropy number

ε_{n} {(K)}_{X}

is defined as

ε_{n} {(K)}_{X} : = inf {ε > 0 : K \subset ⋃_{j = 1}^{2^{n}} B (g_{j}, ε), g_{j} \in X, j = 1, 2, \dots, 2^{n}},

which is the infimum of all

ε > 0

for which

2^{n}

balls of radius

ε

cover a compact set

K \subset X

, and

B (g_{j}, ε) : = {g \in X : ∥ g - g_{j} ∥_{X} \leq ε}

.

Entropy numbers have many applications in fields such as compressed sensing, statistics, and learning theory [30,31,32]. They can provide a benchmark for the best error performance of numerical recovery algorithms. Sometimes, estimating the entropy numbers

ε_{n} {(K)}_{X}

is more accessible than computing other known widths, such as

n

-Kolmogorov widths and nonlinear

(n, N)

-widths. For example, for some model classes K, such as unit balls in classical Sobolev and Besov spaces, the entropy numbers are known and can also be used to estimate the lower bound of these widths.

In this section, we compare the convergence rate of the Hölder widths with that of the entropy numbers. We first obtain the following general results.

Theorem 7.

For any

n \geq 1

and

0 < α < 1

, we have

δ_{n}^{2^{k} rad (K), α} {(K)}_{X} \leq ε_{k n} {(K)}_{X}, k = 1, 2, \dots .

Specifically, if

k = n

, then

δ_{n}^{2^{n} rad (K), α} {(K)}_{X} \leq ε_{n^{2}} {(K)}_{X} .

Remark 5.

It is known from Theorem 7 that if

k = 1

, then

δ_{n}^{2 \cdot rad (K), α} {(K)}_{X} \leq ε_{n} {(K)}_{X} .

So, by the decreasing property of

δ_{n}^{γ, α} {(K)}_{X}

with respect to γ, Hölder widths are smaller than entropy numbers for

γ \geq 2 \cdot rad (K)

.

It follows from Theorem 7 with

k = 1

that the following corollary holds.

Corollary 2.

Let

0 < α < 1

, and

γ \geq 2 \cdot rad (K)

.

(i) If the following inequality holds:

ε_{n} {(K)}_{X} \leq c_{1} \frac{{({log}_{2} n)}^{p}}{n^{q}}, n \geq 1,

where

c_{1}, p, q

are some constants such that

p \in R

,

c_{1}, q > 0

, then we have

δ_{n}^{γ, α} {(K)}_{X} \leq C \frac{{({log}_{2} n)}^{p}}{n^{q}}, n \geq 1,

where C is a positive constant.

(ii) If the following inequality holds:

ε_{n} {(K)}_{X} \leq c_{1} {({log}_{2} n)}^{- q}, n \geq 1,

where

c_{1}, q

are two positive constants, then we have

δ_{n}^{γ, α} {(K)}_{X} \leq C {({log}_{2} n)}^{- q}, n \geq 1,

where C is a positive constant.

Theorem 7 and Corollary 2 state that given an upper bound on the entropy number, it can be determined that the upper bound of the Hölder width is less than it. Conversely, if we know a lower bound of the entropy number, then we can obtain a lower bound of the Hölder width. To show this, we need the following theorem.

Theorem 8.

Let

γ > 0

,

0 < α < 1

. If there exists

n > n_{0}

, where

n_{0} = n_{0} (c_{0}, α, p, q)

satisfies

p \in R, q > 0

,

δ_{n}^{γ, α} {(K)}_{X} < c_{0} \frac{{({log}_{2} n)}^{p}}{n^{q}},

then for

m = c_{1} n {log}_{2} n

,

ε_{m} {(K)}_{X} < C \frac{{({log}_{2} m)}^{p + q}}{m^{q}},

where

c_{1}, C

are constants that depend on

γ, α, c_{0}, p

, and q.

Based on Theorem 8, we can obtain a lower bound of the Hölder width from that of the entropy number.

Theorem 9.

(i) If the following inequality holds:

ε_{n} {(K)}_{X} > c_{1} \frac{{({log}_{2} n)}^{p}}{n^{q}}, n \geq 1,

where

c_{1}, p, q

are constants such that

p \in R

,

c_{1}, q > 0

, then for each

γ > 0

and

0 < α < 1

, we have

δ_{n}^{γ, α} {(K)}_{X} \geq C \frac{{({log}_{2} n)}^{p - q}}{n^{q}}, n \geq 1,

(7)

where C is a positive constant.

(ii) If the following inequality holds:

ε_{n} {(K)}_{X} > c_{1} {({log}_{2} n)}^{- q}, n \geq 1,

where

c_{1}, q

are some positive constants, then for each

γ > 0

and

0 < α < 1

, we have

δ_{n}^{γ, α} {(K)}_{X} \geq C {({log}_{2} n)}^{- q}, n \geq 1,

(8)

where C is a positive constant.

(iii) If the following inequality holds:

ε_{n} {(K)}_{X} > c_{1} 2^{- c_{2} n^{p}}, n \geq 1,

where

c_{1}, c_{2}, p

are some constants such that

c_{1}, c_{2} > 0

, and

0 < p < 1

, then for each

γ > \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}}, and 0 < α < 1,

we have

δ_{n}^{γ, α} {(K)}_{X} \geq C_{1} 2^{- C_{2} n^{\frac{p}{1 - p}}}, n \geq 1,

(9)

where

C_{1}, C_{2}

are two positive constants.

Combining Corollary 2 (ii) with Theorem 9 (ii), we derive the following corollary.

Corollary 3.

For any compact set

K \subset X

,

0 < α < 1

,

γ \geq max \{2 \cdot rad (K), \frac{d_{n_{0}} {(K)}_{X} + rad (K)}{2^{α - 1}}\}

, and

q > 0

, then

ε_{n} {(K)}_{X} ≍ {({log}_{2} n)}^{- q} ⟺ δ_{n}^{γ, α} {(K)}_{X} ≍ {({log}_{2} n)}^{- q}, for any n \geq 1 .

5. Some Applications

The importance of the Hölder width lies in its lower bound, which is independent of any specific algorithm. This bound not only reveals the limitations of certain approximation tools but also provides information on the order of the Hölder width without knowing the concrete algorithms. The insights from the lower bound can help us show the optimality of some existing algorithms or prompt us to design optimal algorithms that can achieve such a bound. In essence, the concept of width is independent of any specific algorithm but inspires us to design optimal algorithms.

We apply the above general theoretical results to some important function spaces and obtain the corresponding orders of Hölder widths.

First, we remark that some common neural networks are Hölder mappings. Therefore, we can obtain their asymptotic orders of Hölder widths characterized by these fully connected feed-forward neural networks. In the following discussion, we mainly consider the Banach spaces X of functions, where

C ({[0, 1]}^{d})

is continuously embedded in X. Let

σ : R \to R

be an activation function. Denote by

Ω : = {[0, 1]}^{d}

and

Φ_{σ} : (B_{l_{\infty}^{\tilde{n}}}, {∥ \cdot ∥}_{l_{\infty}^{\tilde{n}}}) \to C (Ω)

.

A feed-forward neural network with width W, depth n, and activation

σ

produces a family

Σ_{n, σ} \subset C (Ω)

:

Σ_{n, σ} : = \{Φ_{σ} (t) : t \in R^{\tilde{n}}, \tilde{n} = \tilde{n} (W, n) = C_{0} n\},

which generates an approximation to a target element

f \in K

. For every

t \in R^{\tilde{n}},

there exists a continuous function

Φ_{σ} (t) \in Σ_{n, σ}

on

Ω

Φ_{σ} (t) : = T^{(n)} \circ \bar{σ} \circ T^{(n - 1)} \circ \bar{σ} \circ \dots \circ \bar{σ} \circ T^{(0)},

(10)

where the affine mappings

T^{(0)} : R^{d} \to R^{W}

,

T^{(k)} : R^{W} \to R^{W}

,

k = 1, 2 \dots, n - 1

, and

T^{(n)} : R^{W} \to R

, and the function

\bar{σ} : R^{W} \to R^{W}

\bar{σ} (z_{j + 1}, \dots, z_{j + W}) : = (σ (z_{j + 1}), \dots, σ (z_{j + W})) .

Here, t is the vector whose coordinates are the entries of the matrices and biases of

T^{(k)}

,

k = 0, 1, \dots, n

. We note that the dimension of any hidden layer can naturally be expanded; thus, any fully connected network can be made to have a fixed width [13,33]. Our assumption about a fixed width W can simplify the computations and notations.

Proposition 1.

If σ is an

H_{α} (γ)

mapping, then

Φ_{σ} : (B_{l_{\infty}^{\tilde{n}}}, {∥ \cdot ∥}_{l_{\infty}^{\tilde{n}}})

, defined in (10), is an

H_{α} (Γ_{n})

mapping, which means that for

t, t^{'} \in B_{l_{\infty}^{\tilde{n}}}

,

∥ Φ_{σ} (t) - Φ_{σ} (t^{'}) ∥_{C (Ω)} \leq Γ_{n} {∥ t - t^{'} ∥}_{l_{\infty}^{\tilde{n}}}^{α},

where

Γ_{n} = M_{d, α} γ_{0}^{n} {(2 W)}^{n α}

,

γ_{0} = max {γ, 1}

, and

M_{d, α}

is a constant.

It follows that a specific neural network can be a Hölder mapping with coefficient

Γ_{n} = M_{d, α} {({(2 W)}^{α} γ_{0})}^{n}

to approximate the target element f, where

0 < α < 1

and

W > 1

. Then, we consider the lower bound for the Hölder width with coefficient

γ_{n} = M λ^{n}

,

M > 0

,

λ > 1

, which also implies the lower bound of the DNN approximation.

Theorem 10.

Let

0 < α < 1

and

γ_{n} = M λ^{n}

, where

M > 0

,

λ > 1

,

n \geq 1

.

(i) If the following inequality holds:

ε_{n} {(K)}_{X} > c_{1} \frac{{({log}_{2} n)}^{p}}{n^{q}}, n \geq 1,

where

c_{1}, p, q

are constants such that

p \in R

,

c_{1}, q > 0

, then

δ_{n}^{γ_{n}, α} {(K)}_{X} \geq C \frac{{({log}_{2} n)}^{p}}{n^{2 q}}, n \geq 1,

(11)

where C is a positive constant.

(ii) If the following inequality holds:

ε_{n} {(K)}_{X} > c_{1} {({log}_{2} n)}^{- q}, n \geq 1,

where

c_{1}, q

are some positive constants, then

δ_{n}^{γ_{n}, α} {(K)}_{X} \geq C {({log}_{2} n)}^{- q}, n \geq 1,

(12)

where C is a positive constant.

Theorems 7 and 10 imply the following corollary, which means that if we know the asymptotic orders of some entropy numbers, we can obtain the asymptotic orders of their Hölder widths.

Corollary 4.

Let

0 < α < 1

and

γ_{n} = M λ^{n}

, where

M > 0

,

λ > 1

,

n \geq 1

.

(i) If the following holds:

ε_{n} {(K)}_{X} ≍ \frac{{({log}_{2} n)}^{p}}{n^{q}}, n \geq 1,

where

p, q

are some constants such that

p \in R

,

q > 0

, then we have

δ_{n}^{γ_{n}, α} {(K)}_{X} ≍ \frac{{({log}_{2} n)}^{p}}{n^{2 q}}, n \geq 1 .

(ii) If the following holds:

ε_{n} {(K)}_{X} ≍ {({log}_{2} n)}^{- q}, n \geq 1,

where q is a positive constant, then we have

δ_{n}^{γ_{n}, α} {(K)}_{X} ≍ {({log}_{2} n)}^{- q}, n \geq 1 .

We point out that Corollary 4 provides a tool for giving lower bounds on how well a compact set K can be approximated by a DNN, where all weights and biases are from the unit ball of some norm

{∥ \cdot ∥}_{Y_{\tilde{n}}}

. The classical model classes K for multivariate functions can be the unit balls of smoothness spaces, such as classical Lipschitz, Hölder, Sobolev, and Besov spaces. For any model class K, denote the unit ball of K by

B K : = {f \in K : ∥ f ∥_{K} \leq 1} .

For any

K, A \subset X

, let

dist {(K, A)}_{X} : = sup_{f \in K} inf_{g \in A} {∥ f - g ∥}_{X} .

Many experts have investigated the performance of deep learning approximation for these function classes of the Lebesgue space

L_{p}

on the cube

Ω

[20,21,33,34].

First, we determine the exact order of the Hölder width for the classical Sobolev class. For

1 \leq p \leq \infty

, we denote by

L_{p} (Ω)

the usual Lebesgue space on

Ω

equipped with the

L_{p}

norm

{∥ f ∥}_{p} : = {∥ f ∥}_{L_{p} (Ω)} = \{\begin{matrix} {(\int_{Ω} {| f (x) |}^{p} d μ (x))}^{1 / p}, 1 \leq p < \infty, \\ ess sup_{x \in Ω} | f (x) |, p = \infty . \end{matrix}

For

s \in N

,

1 \leq p \leq \infty

, we say that function f belongs to the Sobolev space

W_{p}^{s} (Ω)

if

f \in L_{p} (Ω)

, and the norm

{∥ \cdot ∥}_{W_{p}^{s}}

of f is given by, for

| k | : = \sum_{j = 1}^{d} | k_{j} | = s

,

{∥ f ∥}_{W_{p}^{s}} : = {∥ f ∥}_{p} + max_{| k | = s} ∥ D^{k} f ∥_{p} < \infty .

It is known from [35] that when

s > d {(1 / p - 1 / q)}_{+}

,

1 < p, q < \infty

ε_{n} {(B W_{p}^{s} (Ω))}_{L_{q}} ≍ n^{- s / d} .

(13)

Moreover, the approximation error rates

O (n^{- s / d})

for Sobolev and Besov classes can be obtained using many classical methods of nonlinear approximation, such as adaptive finite elements or

n

-term wavelet approximation [13,15].

It follows from Theorem 10 and Corollary 4 that for some deep neural networks,

dist {(B W_{p}^{s} (Ω), Σ_{n, σ})}_{L_{q}} \geq δ_{\tilde{n}}^{γ_{n}, α} {(B W_{p}^{s} (Ω))}_{L_{q}} ≫ n^{- 2 s / d},

(14)

where

Σ_{n, σ}

is produced by some neural networks with depth n, fixed width, and

H_{α} (γ)

activation functions

σ

,

0 < α \leq 1

. Compared with classical methods, the factor 2 in the exponent leaves open the possibility of improved approximation rates when using deep neural networks. Indeed, it is known from [33] that there exists a neural network

Σ_{n, σ^{'}}

with depth n, width

25 d + 31

, and ReLU activation function

σ^{'}

such that, for

f \in B W_{p}^{s} (Ω),

inf_{f_{n} \in Σ_{n, σ^{'}}} ∥ f - f_{n} ∥_{q} ≪ {∥ f ∥}_{W_{p}^{s}} \cdot n^{- 2 s / d},

(15)

The author used a novel bit-extraction technique, which gives an optimal encoding of sparse vectors, to obtain the upper bounds

O (n^{- 2 s / d})

.

Based on the above discussion, we obtain the Hölder widths for the Sobolev classes

B W_{p}^{s} (Ω)

.

Theorem 11.

Let

1 < p, q < \infty

,

0 < α \leq 1

, and

s > d {(\frac{1}{p} - \frac{1}{q})}_{+}

. Then, there exist

M > 0

,

λ > 1

such that for

γ_{n} = M λ^{n}

,

δ_{n}^{γ_{n}, α} {(B W_{p}^{s} (Ω))}_{L_{q}} ≍ dist {(B W_{p}^{s} (Ω), Σ_{n, σ^{'}})}_{L_{q}} ≍ n^{- 2 s / d} .

Proof.

It is known from (13) and Corollary 4 that for any

M > 0

,

λ > 1

,

δ_{n}^{γ_{n}, α} {(B W_{p}^{s} (Ω))}_{L_{q}} ≫ n^{- 2 s / d} .

Moreover, it is known from (14) that

dist {(B W_{p}^{s} (Ω), Σ_{n, σ^{'}})}_{L_{q}} ≫ n^{- 2 s / d} .

It follows from (15) that

dist {(B W_{p}^{s} (Ω), Σ_{n, σ^{'}})}_{L_{q}} ≍ sup_{f \in B W_{p}^{s} (Ω)} inf_{f_{n} \in Σ_{n, σ^{'}}} {∥ f - f_{n} ∥}_{q} ≪ n^{- 2 s / d},

and for any

f \in B W_{p}^{s} (Ω),

inf_{f_{n} \in Σ_{n, σ^{'}}} {∥ f - f_{n} ∥}_{q} ≪ n^{- 2 s / d} .

Therefore, there exist constants c,

c_{0} > 0

such that

δ_{c_{0} n}^{c {(50 d + 62)}^{n}, 1} {(B W_{p}^{s} (Ω))}_{L_{q}} ≪ n^{- 2 s / d} .

Replacing

c_{0} n

with n, it follows from the decreasing property of the Hölder width that there exist

M = 2 c > 0

and

λ = {(50 d + 62)}^{1 / c_{0}} > 1

such that

δ_{n}^{γ_{n}, α} {(B W_{p}^{s} (Ω))}_{L_{q}} \leq δ_{n}^{c λ^{n}, 1} {(B W_{p}^{s} (Ω))}_{L_{q}} ≪ n^{- 2 s / d} .

Thus, we complete the proof of Theorem 11. □

Remark 6.

Theorem 11 implies that the upper bound in inequality (15) is sharp.

For a fixed

s \in N

and

d = 1

, Figure 1 compares the performance of the approximation error versus the number of elements or layers using different approximation methods, such as the classical methods represented by

n

-term wavelets, adaptive finite elements, and new tools represented by deep neural networks. The blue solid line shows the approximation error decreasing at a rate of

O (n^{- s})

with the number of elements n for

n

-term wavelets or adaptive finite elements, while the orange dashed line indicates a faster decay of

O (n^{- 2 s})

with depth n for deep neural networks. Overall, deep neural networks significantly outperform classical methods, such as

n

-term wavelets or adaptive finite elements, offering more rapid convergence and potentially higher accuracy in approximating functions. We call this phenomenon the super-convergence of deep neural networks, where the classical Hölder, Sobolev, and Besov classes on

{[0, 1]}^{d}

can achieve super-convergence [20,21].

Figure 1. Approximation error and the number of elements n and depth d: classical methods (

n

-term wavelets and adaptive finite elements) vs. new tools (deep neural networks).

The results of Theorem 11 can be extended to Besov spaces, which are much more general than Sobolev spaces. It is well known that functions from Besov spaces have been widely used in approximation theory, statistics, image processing, and machine learning (see [6,36,37,38], and the references therein). Recall that for

r \in N

, the modulus of smoothness of order r of

f \in L_{p} (Ω)

is

ω_{r} {(f, t)}_{p} : = sup_{| h | \leq t} {∥ ▵_{h}^{r} (f, \cdot, Ω) ∥}_{p},

where

▵_{h}^{r} (f, x)

is the r-th order difference of f with step h, and

▵_{h}^{r} (f, x, Ω) : = \{\begin{matrix} ▵_{h}^{r} (f, x), x, x + h, \dots, x + r h \in Ω, \\ 0, otherwise . \end{matrix}

For

0 < s < r

,

1 \leq p, τ \leq \infty

, we say that function f belongs to the Besov space

B_{p, τ}^{s} (Ω)

if

f \in L_{p} (Ω)

, and the norm

{∥ \cdot ∥}_{B_{p, τ}^{s}}

of f, given by

\begin{matrix} {∥ f ∥}_{B_{p, τ}^{s}} : = \{\begin{matrix} {∥ f ∥}_{p} + {(\int_{0}^{1} {(\frac{ω_{r} {(f, t)}_{p}}{t^{s}})}^{τ} \frac{d t}{t})}^{\frac{1}{τ}}, 1 \leq τ < \infty; \\ {∥ f ∥}_{p} + sup_{t > 0} \frac{ω_{r} {(f, t)}_{p}}{t^{s}}, τ = \infty, \end{matrix} \end{matrix}

is finite. It is known from [15] that when

s > d {(1 / p - 1 / q)}_{+}

,

q \neq \infty

ε_{n} {(B B_{p, τ}^{s} (Ω))}_{L_{q}} ≍ n^{- s / d} .

It follows from Theorem 10 and Corollary 4 that for some deep neural networks,

dist {(B B_{p, τ}^{s} (Ω), Σ_{n, σ})}_{L_{q}} \geq δ_{\tilde{n}}^{γ_{n}, α} {(B B_{p, τ}^{s} (Ω))}_{L_{q}} ≫ n^{- 2 s / d},

and it is also known from [33] that there exists a neural network

Σ_{n, σ^{'}}

with depth n, width

25 d + 31

, and ReLU activation function

σ^{'}

such that for

f \in B B_{p, τ}^{s} (Ω),

inf_{f_{n} \in Σ_{n, σ^{'}}} ∥ f - f_{n} ∥_{q} ≪ {∥ f ∥}_{B_{p, τ}^{s}} \cdot n^{- 2 s / d} .

Thus, we obtain the Hölder widths for the Besov classes

B B_{p, τ}^{s} (Ω)

.

Theorem 12.

Let

1 < p, q < \infty

,

1 \leq τ \leq \infty

,

0 < α \leq 1

, and

s > d {(\frac{1}{p} - \frac{1}{q})}_{+}

. Then, there exist

M > 0

,

λ > 1

such that for

γ_{n} = M λ^{n}

,

δ_{n}^{γ_{n}, α} {(B B_{p, τ}^{s} (Ω))}_{L_{q}} ≍ dist {(B B_{p, τ}^{s} (Ω), Σ_{n, σ^{'}})}_{L_{q}} ≍ n^{- 2 s / d} .

The proof of Theorem 12 is similar to that of Theorem 11, so we omit the details here.

Note that any numerical algorithm based on Hölder mappings will have a convergence rate that is not faster than that of the Hölder width. This characterizes the limitation of the approximation power of some deep neural networks.

Meanwhile, it is well known that spherical approximation has been widely applied in many fields, such as cosmic microwave background analysis, global ionospheric prediction for geomagnetic storms, climate change modeling, environmental governance, and other spherical signals [39].

We recall some concepts on the sphere from [40]. Let

S^{d - 1} : = \{x = (x_{1}, \dots, x_{d}) \in R^{d} : \sum_{j = 1}^{d} x_{j}^{2} = 1\}

be the unit sphere in

R^{d}

equipped with the rotation-invariant measure

d μ (x)

normalized by

\int_{S^{d - 1}} d μ (x) = 1

. For

1 \leq p < \infty

, denote by

L_{p} (S^{d - 1})

the usual Lebesgue space on

S^{d - 1}

endowed with the

L_{p}

norm

{∥ f ∥}_{L_{p} (S^{d - 1})} : = {(\int_{S^{d - 1}} {| f (x) |}^{p} d μ (x))}^{1 / p} for 1 \leq p < \infty .

In [41], Feng et al. studied the approximation of the Sobolev class on the sphere

S^{d - 1}

, denoted by

W_{p}^{s} (S^{d - 1})

, using convolutional neural networks with layer J. Recall that a function f belongs to the Sobolev class

W_{p}^{s} (S^{d - 1})

if

f \in L_{p} (S^{d - 1})

, and the norm

{∥ \cdot ∥}_{W_{p}^{s} (S^{d - 1})}

of f is given by

{∥ f ∥}_{W_{p}^{s} (S^{d - 1})} : = {∥ f ∥}_{L_{p} (S^{d - 1})} + ∥ {(- ▵_{0})}^{s / 2} f ∥_{L_{p} (S^{d - 1})} < \infty,

where

▵_{0}

is the Laplace–Beltrami operator on the sphere. The authors obtained the upper bound

O (J^{- \frac{s}{d - 1}})

of the error of such an approximation in

L_{p} (S^{d - 1})

.

However, it is known from [42] that when

s > (d - 1) {(1 / p - 1 / q)}_{+}

,

1 < p, q < \infty

ε_{n} {(B W_{p}^{s} (S^{d - 1}))}_{L_{q}} ≍ n^{- \frac{s}{d - 1}} .

Then, it follows from Corollary 4 that the lower bound of the Hölder width for

B W_{p}^{s} (S^{d - 1})

may be

O (n^{- \frac{2 s}{d - 1}})

.

Theorem 13.

Let

1 < p, q < \infty

,

0 < α \leq 1

, and

s > (d - 1) {(\frac{1}{p} - \frac{1}{q})}_{+}

. Then, there exist

M > 0

,

λ > 1

such that for

γ_{n} = M λ^{n}

,

dist {(B W_{p}^{s} (S^{d - 1}), Σ_{n, σ^{'}})}_{L_{q}} ≫ δ_{n}^{γ_{n}, α} {(B W_{p}^{s} (S^{d - 1}))}_{L_{q}} ≫ n^{- \frac{2 s}{d - 1}} .

With the development of spherical theory [40,43,44,45] and the relationship between Hölder widths and entropy numbers, we conjecture that the approximation order of

W_{p}^{s} (S^{d - 1})

using some fully connected feed-forward neural networks with the bit-extraction technique may be

O (n^{- \frac{2 s}{d - 1}})

. This conjecture can be formulated as follows.

Conjecture 1.

Let

1 < p, q < \infty

,

0 < α \leq 1

, and

s > (d - 1) {(\frac{1}{p} - \frac{1}{q})}_{+}

. Then, there exist

M > 0

,

λ > 1

such that for

γ_{n} = M λ^{n}

,

δ_{n}^{γ_{n}, α} {(B W_{p}^{s} (S^{d - 1}))}_{L_{q}} ≍ dist {(B W_{p}^{s} (S^{d - 1}), Σ_{n, σ^{'}})}_{L_{q}} ≍ n^{- \frac{2 s}{d - 1}} .

Our results show that networks modeled by both Hölder and Lipschitz (e.g., ReLU) functions can achieve the approximation error

O (n^{- 2 s / d})

, which is superior to classical approximation tools such as

n

-term wavelets and adaptive finite elements. We achieve the same approximation error for networks with a weaker condition,

0 < α < 1

, which gives us more options for neural network approximation tools. For example, if we need to solve numerically differential equations with a discontinuous right-hand side modeling neural network dynamics, we can choose networks with non-Lipschitz activation functions, using the fact that non-Lipschitz functions share the peculiar property that even small variations in the neuron state are able to produce significant changes in the neuron output [26,27,28]. The results of the Hölder widths would help us select suitable Hölder activation functions. Moreover, it is known from Appendix B.2 that the Hölder width is smaller than the Lipschitz width in the sense that

δ_{n}^{2 γ, α} {(K)}_{X} \leq δ_{n}^{γ, 1} {(K)}_{X} .

Thus, from the perspective of the approximation error, the Hölder activation function performs better than the Lipschitz activation function. However, it is currently unknown what magnitude this improvement can reach. It would be interesting to study this problem.

Next, we estimate the Hölder widths for discrete Lebesgue spaces. For

1 \leq q < \infty

, denote by

ℓ_{q}

the set of all sequences

{x_{k}}_{k = 1}^{\infty}

with

\begin{matrix} {∥ x ∥}_{q} : = {(\sum_{k = 1}^{\infty} {| x_{k} |}^{q})}^{1 / q} < \infty . \end{matrix}

Let

τ = {t_{k}}_{k = 1}^{\infty}

be a sequence such that

t_{k} = {({log}_{2} (k + 1))}^{- 1 / 2}

for

k \geq 1

. Denote by

A_{τ} : = \{y \in ℓ_{2} : y_{k} = t_{k} x_{k}, where \sum_{j = 1}^{\infty} | x_{k} | \leq 1\},

where

x = (x_{1}, x_{2}, \dots) \in ℓ_{1}

, and

y = (y_{1}, y_{2}, \dots) \in ℓ_{2}

.

Theorem 14.

For the space

X = ℓ_{2}

and the subset

A_{τ}

, its Hölder width satisfies

γ \geq 2 \cdot rad (A_{τ})

and

0 < α < 1

,

{(log n)}^{- 1 / 2} n^{- 1 / 2} ≪ δ_{n}^{γ, α} {(A_{τ})}_{X} ≪ n^{- 1 / 2} .

Proof.

It is known from [46] that its entropy number satisfies

ε_{n} {(A_{τ})}_{ℓ_{2}} ≍ n^{- 1 / 2} .

It follows from Corollary 2 that there exists

C_{1} > 0

such that its Hölder width satisfies

γ \geq 2 \cdot rad (A_{τ})

,

δ_{n}^{γ, α} {(A_{τ})}_{ℓ_{2}} \leq C_{1} n^{- 1 / 2} .

It follows from Theorem 9 (i) that there exists

C_{2} > 0

such that its Hölder width satisfies

δ_{n}^{γ, α} {(A_{τ})}_{ℓ_{2}} \geq C_{2} {(log n)}^{- 1 / 2} n^{- 1 / 2} .

The proof of Theorem 14 is completed. □

Remark 7.

It is known from [18] that its

n

-Kolmogorov width satisfies

d_{n} {(A_{τ})}_{ℓ_{2}} ≍ {({log}_{2} n)}^{- 1 / 2} .

Theorem 14 illustrates that the Hölder width of

A_{τ}

is smaller than that of the

n

-Kolmogorov width.

Finally, we obtain the asymptotic order of the Hölder width for

c_{0}

, which is the Banach space of all sequences converging to 0, equipped with the

ℓ_{\infty}

norm. Let

η = {η_{k}}_{k = 1}^{\infty}

be a sequence with

η_{k} = {({log}_{2} (k + 1))}^{- 1}

,

k \geq 1

. Denote a compact subset of X by

K_{η} : = {η_{k} e_{k}}_{k = 1}^{\infty} \cup {0},

where the sequence

{e_{k}}_{k = 1}^{\infty}

is the standard basis in X. It is known from [18] that its entropy number satisfies

ε_{n} {(K_{η})}_{X} ≍ \frac{1}{n} .

Theorem 15.

For the space

X = c_{0}

and the subset

K_{η}

, we obtain that its Hölder width satisfies

γ > 2

,

0 < α < 1

,

δ_{n}^{γ, α} {(K_{η})}_{X} ≍ \frac{1}{n {log}_{2} (n + 1)} .

Theorem 15 shows the sharpness of Theorem 9 (i).

6. Concluding Remarks

We introduce the Hölder width, which measures the best error performance of some recent nonlinear approximation methods. We investigate the relationship between Hölder widths and other known widths, demonstrating that some Hölder widths are essentially smaller than

n

-Kolmogorov widths and linear widths. Moreover, we obtain that as the Hölder constants grow with n, the Hölder widths are much smaller than the entropy numbers. The significance of Hölder widths being smaller than known widths is that it implies that some nonlinear approximations, such as deep neural network approximations, may yield a better approximation order than other known classical approximation methods. In fact, we show that the asymptotic orders of Hölder widths for the Sobolev classes

B W_{p}^{s} (Ω)

and the Besov classes

B B_{p, τ}^{s} (Ω)

are

O (n^{- 2 s / d})

for

s > d {(\frac{1}{p} - \frac{1}{q})}_{+}

using deep neural networks, while other known widths might be

O (n^{- s / d})

. This result shows that deep neural networks significantly outperform classical methods of approximation, such as adaptive finite elements and

n

-term wavelet approximation. Indeed, the Hölder width in neural networks serves two purposes. On the one hand, it demonstrates the superior approximation power of deep neural networks. On the other hand, it reveals the limitation in the approximating ability of some deep neural networks. These features are crucial for a deeper understanding and further exploration of the approximation power of deep neural networks. It would be interesting to calculate the Hölder widths for some important function classes.

Author Contributions

M.L. and P.Y. contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the National Natural Science Foundation of China (Grant No. 11671213).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs of Section 2

Appendix A.1. Proof of Theorem 1

To prove Theorem 1, we need the following lemmas.

Lemma A1

(Auerbach lemma [47,48]). Let

X_{n}

be an

n

-dimensional Banach space and

X_{n}^{*}

be its dual space. Then, there exist elements

x_{1}, \dots, x_{n} \in X_{n}

and functionals

f_{1}, \dots, f_{n} \in X_{n}^{*}

such that for

1 \leq i, j \leq n

,

∥ x_{j} ∥_{X_{n}} = {∥ f_{j} ∥}_{X_{n}^{*}} = 1, and f_{j} (x_{i}) = \{\begin{matrix} 1, i = j, \\ 0, i \neq j . \end{matrix}

Lemma A2.

Let K be a bounded subset of X. For any

n \in N

,

γ > 0

, and

0 < α < 1

, we can limit the infimum in (2) to normed spaces

(R^{n}, ∥ \cdot ∥_{Z_{n}})

with the norm

{∥ \cdot ∥}_{Z_{n}}

satisfying, for

y \in R^{n}

,

{∥ y ∥}_{ℓ_{\infty}^{n}} \leq {∥ y ∥}_{Z_{n}} \leq {∥ y ∥}_{ℓ_{1}^{n}} .

(A1)

Proof.

According to Lemma A1, we can find vectors

{v_{j}}_{j = 1}^{n} \subset R^{n}

and linear functionals

{f_{j}}_{j = 1}^{n}

on the space (

R^{n}

,

{∥ \cdot ∥}_{Y_{n}}

) satisfying

∥ v_{j} ∥_{Y_{n}} = {∥ f_{j} ∥}_{Y_{n}^{*}} = 1, and f_{i} (v_{j}) = \{\begin{matrix} 1, i = j, \\ 0, i \neq j, \end{matrix}

(A2)

where

Y_{n}^{*}

is the dual space of

Y_{n}

and

i, j = 1 . \dots, n

.

For

y = (y_{1}, y_{2}, \dots, y_{n}) \in R^{n}

, we can consider a new norm

{∥ \cdot ∥}_{Z_{n}}

on

R^{n}

and a mapping

ϕ_{0} : (B (Z_{n}) {, ∥ \cdot ∥}_{Z_{n}}) \to (B (Y_{n}) {, ∥ \cdot ∥}_{Y_{n}})

as

{∥ y ∥}_{Z_{n}} : = {∥\sum_{j = 1}^{n} y_{j} v_{j}∥}_{Y_{n}}, and ϕ_{0} (y) : = \sum_{j = 1}^{n} y_{j} v_{j} .

(A3)

It is clear that

ϕ_{0} (B (Z_{n})) = B (Y_{n})

.

In this case, we can construct an

H_{α} (γ)

mapping

{\tilde{Φ}}_{n} : (B (Z_{n}), {∥ \cdot ∥}_{Z_{n}}) \to X

. For any

H_{α} (γ)

mapping

Φ_{n} : (B (Y_{n}) {, ∥ \cdot ∥}_{Y_{n}}) \to X

, we can define

{\tilde{Φ}}_{n}

as

{\tilde{Φ}}_{n} : = Φ_{n} \circ ϕ_{0}

. Then,

∥ {\tilde{Φ}}_{n} (y^{'}) - {\tilde{Φ}}_{n} (y) ∥_{X} = ∥ Φ_{n} \circ ϕ_{0} (y^{'}) - Φ_{n} \circ ϕ_{0} (y) ∥_{X} \leq γ {∥ ϕ_{0} (y^{'}) - ϕ_{0} (y) ∥}_{Y_{n}}^{α} .

By (A2) and (A3),

∥ {\tilde{Φ}}_{n} (y^{'}) - {\tilde{Φ}}_{n} (y) ∥_{X} \leq γ {∥\sum_{j = 1}^{n} (y_{j}^{'} - y_{j}) v_{j}∥}_{Y_{n}}^{α} = γ {∥ y_{j}^{'} - y_{j} ∥}_{Z_{n}}^{α} .

Therefore,

{\tilde{Φ}}_{n}

satisfies the

H_{α} (γ)

condition, and

{\tilde{Φ}}_{n} (B (Z_{n})) = Φ_{n} (B (Y_{n}))

.

Moreover, we can verify that the construction (A3) satisfies the condition (A1). Indeed,

{∥ y ∥}_{Z_{n}} \leq \sum_{j = 1}^{n} |y_{j}| {∥v_{j}∥}_{Y_{n}} = \sum_{j = 1}^{n} |y_{j}| = {∥ y ∥}_{ℓ_{1}^{n}},

and for

f_{i} \in Y_{n}^{*}

,

{∥ y ∥}_{Z_{n}} = {∥\sum_{j = 1}^{n} y_{j} v_{j}∥}_{Y_{n}} = sup_{{∥f∥}_{Y_{n}^{*}} = 1} |f (\sum_{j = 1}^{n} y_{j} v_{j})| \geq |f_{i} (\sum_{j = 1}^{n} y_{j} v_{j})| = |y_{i}|, i = 1, \dots, n .

Then,

{∥ y ∥}_{Z_{n}} \geq max_{i} | y_{i} {| = ∥ y ∥}_{ℓ_{\infty}^{n}}

. Thus, the proof of Lemma A2 is completed. □

We also need the following

α

-Hölder version of Ascoli’s theorem.

Lemma A3.

For a separable metric space

(Z, ρ)

and a metric space

(Y, d)

where every closed ball is compact, let

Ψ_{j} : Y \to Z

be a sequence of α-Hölder mappings such that there exist

y \in Y

and

z \in Z

satisfying

Ψ_{j} (y) = z

for

j \geq 1

. Then, there exists a subsequence

{Ψ_{j_{k}}}_{k = 1}^{\infty}

, which converges pointwise to an α-Hölder function

Φ : Y \to Z

. If

(Y, d)

is also compact, then the convergence is uniform.

Proof.

For

f \in Y

, we have

ρ (Ψ_{j} (f), z) = ρ (Ψ_{j} (f), Ψ_{j} (y)) \leq γ d^{α} (f, y) .

Fix a countable dense subset

M = {f_{j}}_{j = 1}^{\infty} \in Y

, and define

B_{j} : = B (z, γ d^{α} (f_{j}, y))

as the closed ball in Z with radius

γ d^{α} (f_{j}, y)

centered at z for

j \geq 1

. It follows that the Cartesian product

B : = B_{1} \times B_{2} \times \dots,

is a compact metric space under the product topology.

Let

F_{n} : = {Ψ_{n} (f_{1}), Ψ_{n} (f_{2}), \dots, Ψ_{n} (f_{j}), \dots} \in B .

Then, there exists a subsequence

{F_{n_{s}}}_{s = 1}^{\infty}

and an element

F \in B

such that

F (j) = lim_{s \to \infty} F_{n_{s}} (j) = lim_{s \to \infty} Ψ_{n_{s}} (f_{j}), j = 1, 2, \dots .

Thus, we obtain a function

Φ : M \to Z

satisfying

Φ (f_{j}) = F (j) = lim_{s \to \infty} Ψ_{n_{s}} (f_{j}),

where

ρ (Φ (f_{j}), Φ (f_{i})) \leq γ d^{α} (f_{j}, f_{i})

for i,

j \geq 1

. Since

M

is dense,

Φ

can extend to an

α

-Hölder function on Y, that is,

Φ : Y \to Z

. Furthermore, for every

f \in Y

,

Φ (f) = lim_{s \to \infty} Ψ_{n_{s}} (f),

which implies that

Ψ_{n_{s}}

is pointwise convergent to

Φ

.

If

(Y, d)

is compact, then for

ε > 0

, we can cover Y by a finite number of

ε^{\frac{1}{α}}

-balls with centers

g_{1}, \dots g_{k} \in Y

. Thus, for sufficiently large s, we have

\begin{matrix} sup_{f \in Y} ρ (Ψ_{n_{s}} (f), Φ (f)) & \leq sup_{f \in Y} ρ (Ψ_{n_{s}} (f), Ψ_{n_{s}} (g_{i})) + ρ (Ψ_{n_{s}} (g_{i}), Φ (g_{i})) + sup_{f \in Y} ρ (Φ (g_{i}), Φ (f)) \\ \leq 2 γ ε + max_{i = 1, \dots, n} ρ (Ψ_{n_{s}} (g_{i}), Φ (g_{i})) \\ \leq (2 γ + 1) ε, \end{matrix}

which implies that the convergence is uniform. The proof of Lemma A3 is completed. □

Now, we are ready to prove all theorems in this section.

Proof of Theorem 1.

By (1) and (2), it is obvious that for any

n

-dimensional space Y with a norm

{∥ \cdot ∥}_{Y}

on

R^{n}

,

δ_{n}^{γ, α} {(K)}_{X} \leq δ_{n}^{γ, α} {(K, Y)}_{X} .

Thus, we only need to prove that

δ_{n}^{γ, α} {(K)}_{X} \geq δ_{n}^{γ, α} {(K, Y)}_{X} .

By Lemma A2, we can find a sequence

{Ψ_{j}}_{j = 1}^{\infty}

, where

Ψ_{j} : (B (Z_{j}) {, ∥ \cdot ∥}_{Z_{j}}) \to X

satisfies the

H_{α} (γ)

condition and the norm

{∥ \cdot ∥}_{Z_{j}}

on

R^{n}

satisfies (5), such that

d_{j} : = sup_{f \in K} inf_{y \in B (Z_{j})} {∥ f - Ψ_{j} (y) ∥}_{X} \to δ_{n}^{γ, α} {(K)}_{X}, as j \to \infty .

According to Lemma A3, there exists a subsequence

{∥ \cdot ∥}_{Z_{j_{k}}}

of the sequence of norms

{∥ \cdot ∥}_{Z_{j}}

that converges pointwise on

R^{n}

and uniformly on

B_{ℓ_{\infty}^{n}}

to a norm

{∥ \cdot ∥}_{Y}

on

R^{n}

satisfying (5).

Thus, there exists a number

j_{0} \in N

such that for every

j \geq j_{0}

, there is a corresponding value

ε_{j}

with

0 < ε_{j} < 1

,

lim_{j \to \infty} ε_{j} = 0

and

{∥ y ∥}_{Z_{j}} - ε_{j} \leq {∥ y ∥}_{Y} \leq {∥ y ∥}_{Z_{j}} + ε_{j}, for all {∥ y ∥}_{ℓ_{\infty}^{n}} \leq 1 .

(A4)

If

y \in (1 - ε_{j}) B (Z_{j})

, then we have

{∥ y ∥}_{Z_{j}} \leq 1 - ε_{j}

and

{∥ y ∥}_{Y} \leq 1 - ε_{j} + ε_{j} \leq 1 .

And if

y \in B (Y)

, then we have

{∥ y ∥}_{Z_{j}} \leq 1 + ε_{j}

. Thus,

(1 - ε_{j}) B (Z_{j}) \subset B (Y) \subset (1 + ε_{j}) B (Z_{j}), j \geq j_{0} .

Next, we define the mapping

{\tilde{Ψ}}_{j} : (1 + ε_{j}) B (Z_{j}) \to X

as

{\tilde{Ψ}}_{j} : = Ψ_{j} ({(1 + ε_{j})}^{- 1} y) .

For any

y, y^{'} \in (1 + ε_{j}) B (Z_{j})

, we have

\begin{matrix} ∥ {\tilde{Ψ}}_{j} (y^{'}) - {\tilde{Ψ}}_{j} (y) ∥_{X} & = ∥ Ψ_{j} ({(1 + ε_{j})}^{- 1} y) - Ψ_{j} ({(1 + ε_{j})}^{- 1} y^{'}) ∥_{X} \\ \leq \frac{γ}{{(1 + ε_{j})}^{α}} {∥ y^{'} - y ∥}_{Z_{j}}^{α} \\ < γ ∥ y^{'} {- y ∥}_{Z_{j}}^{α}, \end{matrix}

so

{\tilde{Ψ}}_{j}

also satisfies the

H_{α} (γ)

condition. We could write

Φ_{j}

as the restriction of

{\tilde{Ψ}}_{j}

on

B (Y)

.

Let

f \in K

and

j \geq j_{0}

. Then, for each

θ > 0

, we can find an element

y = y (f, j, θ) \in B (Z_{j})

such that

∥ f - Ψ_{j} {(y) ∥}_{X} < d_{j} + θ

. Taking

z : = (1 - ε_{j}) y \in (1 - ε_{j}) B (Z_{j}) \subset B (Y),

we have

\begin{matrix} inf_{x \in B_{Y}} {∥ f - Φ_{j} (x) ∥}_{X} & \leq ∥ f - Φ_{j} {(z) ∥}_{X} = {∥ f - {\tilde{Ψ}}_{j} (z) ∥}_{X} \\ = ∥ f - Ψ_{j} ({(1 + ε_{j})}^{- 1} z) ∥_{X} = {∥f - Ψ_{j} (\frac{1 - ε_{j}}{1 + ε_{j}} \cdot y)∥}_{X} \\ \leq ∥ f - Ψ_{j} {(y) ∥}_{X} + {∥Ψ_{j} (y) - Ψ_{j} (\frac{1 - ε_{j}}{1 + ε_{j}} \cdot y)∥}_{X} \\ < d_{j} + θ + γ {(1 - \frac{1 - ε_{j}}{1 + ε_{j}})}^{α} {∥ y ∥}_{Z_{j}}^{α} \\ \leq d_{j} + θ + γ {(\frac{2 ε_{j}}{1 + ε_{j}})}^{α} . \end{matrix}

As

θ \to 0

and by taking the supremum over

f \in K

, we have

δ_{n}^{γ, α} {(K, Y)}_{X} \leq sup_{f \in K} inf_{x \in B (Y)} {∥ f - Φ_{j} (x) ∥}_{X} \leq d_{j} + γ {(\frac{2 ε_{j}}{1 + ε_{j}})}^{α} .

As

j \to \infty

, we have

d_{j} \to δ_{n}^{γ, α} {(K)}_{X}

and

ε_{j} \to 0

. Thus,

δ_{n}^{γ, α} {(K, Y)}_{X} \leq δ_{n}^{γ, α} {(K)}_{X}

.

The proof of Theorem 1 is completed. □

Appendix A.2. Proofs of Theorems 2–4

Proof of Theorem 2.

We begin with the continuity of the Hölder widths

δ_{n}^{γ, α} {(K)}_{X}

at

γ = 0

. For any

H_{α} (γ)

mapping

Φ : (B (Y_{n}), ∥ \cdot ∥_{Y_{n}}) \to X

,

f \in K

and

y \in B (Y_{n})

, we have

\begin{matrix} {∥ f - Φ (y) ∥}_{X} & \geq {∥ f - Φ (0) ∥}_{X} - {∥ Φ (0) - Φ (y) ∥}_{X} \\ \geq {∥ f - Φ (0) ∥}_{X} - γ {∥ y ∥}_{Y_{n}}^{α} \\ \geq {∥ f - Φ (0) ∥}_{X} - γ . \end{matrix}

Thus, for any

H_{α} (γ)

mapping

Φ

,

sup_{f \in K} inf_{y \in B (Y_{n})} {∥ f - Φ (y) ∥}_{X} \geq sup_{f \in K} {∥ f - Φ (0) ∥}_{X} - γ \geq inf_{g \in X} sup_{f \in K} {∥ f - g ∥}_{X} - γ,

which implies that

δ_{n}^{γ, α} {(K, Y_{n})}_{X} \geq inf_{g \in X} sup_{f \in K} {∥ f - g ∥}_{X} - γ .

It follows from (3), (4), and the decreasing property of

δ_{n}^{γ, α} {(K)}_{X}

with respect to

γ

that

rad (K) - γ \leq δ_{n}^{γ, α} {(K)}_{X} \leq rad (K),

which implies that the continuity at

γ = 0

.

Next, we prove that

δ_{n}^{γ, α} {(K)}_{X}

is continuous for

γ > 0

by contradiction. For convenience, denote by

F (γ) : = δ_{n}^{γ, α} {(K)}_{X}

. We assume that F is not continuous at some

γ_{0} > 0

. Therefore, there exists

ξ_{0} > 0

and a sequence of real numbers

δ_{k} \to 0

,

k \geq 1

such that

| F (γ_{0} + δ_{k}) - F (γ_{0} - δ_{k}) | \geq ξ_{0} .

(A5)

It is known from the definition of Hölder widths that for fixed

δ_{k}

, there exists an

H_{α} (γ_{0} + δ_{k})

mapping

Φ_{n} : (B (Y_{n}) {, ∥ \cdot ∥}_{Y_{n}}) \to X

such that

F (γ_{0} + δ_{k}) \leq sup_{f \in K} inf_{y \in B (Y_{n})} {∥ f - Φ_{n} (y) ∥}_{X} \leq F (γ_{0} + δ_{k}) + δ_{k} .

Let

λ_{k} : = \frac{γ_{0} - δ_{k}}{γ_{0} + δ_{k}}

, and

{\tilde{Φ}}_{n} : = λ_{k} Φ_{n} .

Then,

{\tilde{Φ}}_{n}

is an

H_{α} (γ_{0} - δ_{k})

mapping, and

\begin{matrix} F (γ_{0} - δ_{k}) & \leq sup_{f \in K} inf_{y \in B (Y_{n})} {∥ f - {\tilde{Φ}}_{n} (y) ∥}_{X} \\ \leq sup_{f \in K} inf_{y \in B (Y_{n})} (λ_{k} ∥ f - Φ_{n} {(y) ∥}_{X} + (1 - λ_{k}) {∥ f ∥}_{X}) \\ \leq λ_{k} (F (γ_{0} + δ_{k}) + δ_{k}) + (1 - λ_{k}) sup_{f \in K} {∥ f ∥}_{X} . \end{matrix}

It follows from the compactness of K and (A5) that

F (γ_{0} + δ_{k}) + ξ_{0} \leq F (γ_{0} - δ_{k}) \leq λ_{k} (F (γ_{0} + δ_{k}) + δ_{k}) + (1 - λ_{k}) C,

where

C : = sup_{f \in K} {∥ f ∥}_{X} < \infty

. As

δ_{k} \to 0

, we have

λ_{k} \to 0

, and thus

F (γ_{0}) + ξ_{0} \leq F (γ_{0})

, which contradicts the assumption that

ξ_{0} > 0

. Thus,

δ_{n}^{γ, α} {(K)}_{X}

is continuous as a function of

γ \geq 0

. We complete the proof of Theorem 2. □

Proof of Theorem 3.

If

lim_{γ \to 0} δ_{n}^{γ, α} {(K)}_{X} = 0,

then for any

δ > 0

, there exists a norm

{∥ \cdot ∥}_{Y_{n}}

and an

H_{α} (γ)

mapping

Φ_{n}

such that

sup_{f \in K} inf_{y \in B (Y_{n})} {∥ f - Φ_{n} (y) ∥}_{X} < \frac{δ}{2} .

Thus, for a given

f \in K

, we can find

y_{f} \in B (Y_{n})

such that

∥ f - Φ_{n} (y_{f}) ∥_{X} \leq \frac{δ}{2} .

For a compact set

B (Y_{n})

, there is a finite collection

{y_{j}}_{j = 1}^{N} \subset B (Y_{n})

such that

B_{Y_{n}} \subset ⋃_{j = 1}^{N} B (y_{j}, {(\frac{δ}{2 γ})}^{1 / α}) .

Therefore, for

y_{f} \in B (Y_{n})

, there exists

j_{0} \in {1, \dots, N}

such that

y_{f} \in B (y_{j_{0}}, {(\frac{δ}{2 γ})}^{1 / α})

, and

∥ Φ (y_{f}) - Φ (y_{j_{0}}) ∥_{X} \leq γ {∥ y_{f} - y_{j_{0}} ∥}_{Y_{n}}^{α} \leq \frac{δ}{2} .

Thus, for any

f \in K

, there exists

j_{0} \in {1, \dots, N}

such that

∥ f - Φ (y_{j_{0}}) ∥_{X} \leq ∥ f - Φ (y_{f}) ∥_{X} + {∥ Φ (y_{f}) - Φ (y_{j_{0}}) ∥}_{X} \leq δ,

which implies that

K \subset ⋃_{j = 1}^{N} B (Φ (y_{j}), δ)

. By the arbitrariness of

δ

, the set K is totally bounded.

If K is totally bounded, then for

ε > 0

, we can find a minimal

δ

-covering

{f_{j}}_{j = 1}^{N_{ε} (K)}

and a suitable

γ > 0

such that

diam K \cdot (N_{ε} (K) - 1) \leq 2 γ,

(A6)

where

diam K : = sup_{f, g \in K} {∥ f - g ∥}_{X}

. We only consider the case

n = 1

since

δ_{n}^{γ, α} {(K)}_{X} \leq δ_{1}^{γ, α} {(K)}_{X}

for any

n \geq 1

. Set the points in

([- 1, 1], | \cdot |)

satisfying

t_{j} : = - 1 + {(\frac{2 j - 2}{N_{ε} (K) - 1})}^{\frac{1}{α}}, j = 1, \dots, N_{ε} (K),

and the continuous piecewise linear function

Φ : ([- 1, 1], | \cdot |) \to X

satisfying

Φ (t_{j}) : = f_{j}, j = 1, \dots, N_{ε} (K) .

Thus, it follows from (A6) that for

0 < α < 1

,

max_{j = 1, \dots, N_{ε} (K) - 1} \frac{∥ f_{j + 1} - f_{j} ∥_{X}}{| t_{j + 1} - t_{j} |^{α}} \leq \frac{diam K \cdot (N_{ε} (K) - 1)}{2 \cdot {|j^{1 / α} - {(j - 1)}^{1 / α}|}^{α}} \leq γ,

which implies that

Φ

satisfies the

H_{α} (γ)

condition. Therefore, we have

sup_{f \in K} inf_{y \in [- 1, 1]} {∥ f - Φ (y) ∥}_{X} \leq δ,

and thus

lim_{γ \to 0} δ_{1}^{γ, α} {(K)}_{X} = 0 .

The proof of Theorem 3 is completed. □

Proof of Theorem 4.

The proof is similar to that of Theorem 3. For any fixed

η > 0

, there exist

n_{0} \in N

and

γ_{0} > 0

such that we can find a norm

{∥ \cdot ∥}_{Y_{n_{0}}}

in

R^{n_{0}}

and an

H_{α} (γ_{0})

mapping

Φ

satisfying

sup_{f \in K} inf_{y \in B (Y_{n_{0}})} {∥ f - Φ (y) ∥}_{X} < \frac{η}{2} .

Since

B (Y_{n_{0}})

is compact, there is a finite collection

{y_{j}}_{j = 1}^{N} \subset B (Y_{n_{0}})

, which is a

{(\frac{η}{2 γ_{0}})}^{1 / α}

-covering of

B (Y_{n_{0}})

. Then, for any

y \in B (Y_{n_{0}})

, we can find

y_{j_{0}}

,

j_{0} = {1, \dots, N}

such that

∥ Φ (y) - Φ (y_{j_{0}}) ∥_{X} \leq γ_{0} {∥ y - y_{j_{0}} ∥}_{Y_{n_{0}}}^{α} \leq \frac{η}{2} .

Thus, for any

f \in K

, we can find

y \in B (Y_{n_{0}})

and

y_{j_{0}}

,

j = {1, \dots, N}

such that

∥ f - Φ (y_{j_{0}}) ∥_{X} \leq ∥ f - Φ (y) ∥_{X} + {∥ Φ (y) - Φ (y_{j_{0}}) ∥}_{X} \leq η,

which implies that

K \subset ⋃_{j = 1}^{N} B (Φ (y_{j}), η)

. The proof of Theorem 4 is completed. □

Appendix B. Proofs of Section 3

Appendix B.1. Proofs of Theorem 5 and Corollary 1

Proof of Theorem 5.

We begin with

γ > \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}}, and η : = γ - \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}} > 0 .

Choose a number

η_{1}

such that

0 < η_{1} < η

. Let

X_{n}

be an

n

-dimensional linear subspace of X that satisfies the inequality

sup_{f \in K} inf_{g \in X_{n}} {∥ f - g ∥}_{X} \leq d_{n} {(K)}_{X} + 2^{α - 1} η_{1} .

Then, for every

f \in K

, there is an element

h : = h (f)

in

X_{n}

such that

{∥ f - h ∥}_{X} \leq d_{n} {(K)}_{X} + 2^{α - 1} η_{1} .

(A7)

Denote the set of all such elements as

M : = {h (f) : f \in K} \subset X_{n} .

Next, fix

f_{0} \in X

such that

sup_{f \in K} ∥ f - f_{0} ∥ \leq rad (K) + 2^{α - 1} η - 2^{α - 1} η_{1} .

Then, for

f \in K

,

\begin{matrix} ∥ h - f_{0} ∥_{X} \\ \leq {∥ h (f) - f ∥}_{X} + {∥ f - f_{0} ∥}_{X} \\ < d_{n} {(K)}_{X} + rad (K) + 2^{α - 1} η = 2^{α - 1} γ . \end{matrix}

Thus, we have

rad (M) \leq 2^{α - 1} γ, and M \subset {g \in X_{n} : ∥ g - f_{0} ∥_{X} \leq 2^{α - 1} γ} = : B (f_{0}, 2^{α - 1} γ) .

In addition, denote the unit ball in

X_{n}

by

B (X_{n})

, which is defined by

B (X_{n}) : = {g \in X_{n} {: ∥ g ∥}_{X} = 1} .

Define the mapping

Φ : (B (X_{n}), ∥ \cdot ∥_{X}) \to X

such that

Φ (g) : = f_{0} + 2^{α - 1} γ g .

Then, for

g_{1}, g_{2} \in B (X_{n})

,

∥ Φ (g_{1}) - Φ (g_{2}) ∥_{X} = 2^{α - 1} γ ∥ g_{1} - g_{2} ∥_{X} \leq γ {∥ g_{1} - g_{2} ∥}_{X}^{α} .

Thus,

Φ

satisfies the

H_{α} (γ)

condition, and

Φ (B (X_{n})) = B (f_{0}, 2^{α - 1} γ)

.

For

M \subset Φ (B (X_{n}))

and (A7), we have

sup_{f \in K} inf_{g \in X_{n}} {∥ f - Φ (g) ∥}_{X} \leq sup_{f \in K} inf_{h \in M} {∥ f - h ∥}_{X} \leq d_{n} {(K)}_{X} + 2^{α - 1} η_{1} .

Thus,

δ_{n}^{γ, α} {(K)}_{X} \leq d_{n} {(K)}_{X} + 2^{α - 1} η_{1} .

Let

η_{1} \to 0

, then for any

γ > \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}}

,

δ_{n}^{γ, α} {(K)}_{X} \leq d_{n} {(K)}_{X} .

By Theorem 3 as

γ \to \frac{d_{n} {(K)}_{X} + rad (K)}{2^{α - 1}}

, we complete the proof of Theorem 5. □

Proof of Corollary 1.

For a compact set

K \subset X

, it is known from [2] that the sequence

{d_{n} {(K)}_{X}}_{n = 1}^{\infty}

is decreasing and tends to zero. Denote

γ_{0} = \frac{d_{n_{0}} {(K)}_{X} + rad (K)}{2^{α - 1}} .

For

γ \geq γ_{0}

, it follows from Theorem 5 that

δ_{n}^{γ, α} {(K)}_{X} \leq δ_{n}^{γ_{0}, α} {(K)}_{X} .

Then, it is clear that Corollary 1 holds true. □

Appendix B.2. Proofs of Theorem 6

To prove Theorem 6, we use a result from [14].

Lemma A4.

For any

n \geq 1

,

N > 1

, and any compact set

K \subset X

with

sup_{f \in K} {∥ f ∥}_{X} = 1

, the following inequalities hold

δ_{n + 1}^{2 (N + 1) \sqrt{n}, 1} {(K)}_{X} \leq d_{n} {(K, N)}_{X}, a n d δ_{n + ⌈ {log}_{2} N ⌉}^{6 \sqrt{n}, 1} {(K)}_{X} \leq d_{n} {(K, N)}_{X} .

Proof of Theorem 6.

By the definition of

δ_{n}^{γ, 1} {(K)}_{X}

, there is an

ε > 0

and a

L i p γ

mapping

Ψ : (B (Y_{n}), ∥ \cdot ∥_{Y_{n}}) \to X

satisfying

∥ f - Ψ_{n} {(y) ∥}_{X} \leq δ_{n}^{γ, 1} {(K, Y_{n})}_{X} + ε,

(A8)

where, for any

y, y^{'} \in B (Y_{n})

, the mapping

Ψ

satisfies

∥ Ψ (y) - Ψ (y^{'}) ∥_{X} \leq γ {∥ y - y^{'} ∥}_{Y_{n}} .

Then, we have

∥ Ψ (y) - Ψ (y^{'}) ∥_{X} \leq 2^{1 - α} γ ∥ y - y^{'} ∥_{Y_{n}}^{α} \leq 2 γ {∥ y - y^{'} ∥}_{Y_{n}}^{α} .

Thus,

Ψ

satisfies the

α

-Hölder condition, and

Ψ

is an

H_{α} (2 γ)

mapping.

Therefore, it is known from (A8) and the definition of Hölder widths that

δ_{n}^{2 γ, α} {(K, Y_{n})}_{X} \leq sup_{f \in K} inf_{y \in B (Y_{n})} {∥ f - Ψ_{n} (y) ∥}_{X} \leq δ_{n}^{γ, 1} {(K, Y_{n})}_{X} + ε .

Taking

ε \to 0

, we obtain

δ_{n}^{2 γ, α} {(K)}_{X} \leq δ_{n}^{γ, 1} {(K)}_{X} .

Thus, by Lemma A4, we complete the proof of Theorem 6. □

Appendix C. Proofs of Section 4

Appendix C.1. Proof of Theorem 7

To prove Theorem 7, we recall a result from [22].

Proposition A1

([22]). If Φ satisfies the

H_{α_{1}} (γ_{1})

condition, and Ψ satisfies the

H_{α_{1}} (γ_{2})

condition, the composition

Φ \circ Ψ

satisfies the

H_{α_{1} α_{2}} (γ_{1} γ_{2}^{α_{2}})

condition. In addition, if

α_{1} = α_{2} = α

, then

Φ + Ψ

satisfies the

H_{α} (γ_{1} + γ_{2})

condition.

Proof of Theorem 7.

Note that entropy numbers and Hölder widths are invariant in terms of translation, that is, for any

g \in X

,

ε_{n} {(K)}_{X} = ε_{n} {(K - g)}_{X}, δ_{n}^{γ, α} {(K)}_{X} = δ_{n}^{γ, α} {(K - g)}_{X} .

We only need to consider the compact set

K \subset B_{X} (0, r) : = {{g \in X : ∥ g ∥}_{X} \leq r, where r > 0}

.

For any

ε > 0

, it is known from the definition of Hölder widths that there is an element

g \in X

satisfying

sup_{f \in K} {∥ f - g ∥}_{X} < rad (K) + ε 2^{- k} .

Thus, we could set

r = rad (K) + ε 2^{- k}

.

Let

η > 0

and the set

S_{k n} : = {h_{1}, h_{2}, \dots h_{2^{k n}}} \subset K

, satisfying, for any

f \in K

, that there is an element

h_{j} \in S_{k n}

such that

∥ f - h_{j} ∥_{X} \leq ε_{k n} {(K)}_{X} + η .

Next, we split the unit ball

(B_{n} {, ∥ \cdot ∥}_{ℓ_{\infty}^{n}} {) = [- 1, 1]}^{n} \subset R^{n}

into

2^{k n}

non-overlapping open balls

B_{j}

with side length

2^{1 - k}

. Denote by

y_{j}

the center of

B_{j}

. Let

ϕ_{j} : R^{n} \to R

be the mapping such that

ϕ_{j} (y) = max {0, 1 - 2^{k} ∥ y_{j} - y ∥_{ℓ_{\infty}^{n}}^{α}}, j = 1, \dots, 2^{k n} .

(A9)

It is clear that

z = max {0, x}

,

x \in R

satisfies the

H_{1} (1)

condition, and

z = x^{α}

,

x \geq 0

satisfies the

H_{α} (1)

condition. Indeed, for a constant

0 < α < 1

, the function

ζ (t) : = \frac{{(1 + t)}^{α}}{1 + t^{α}}

on

t \geq 0

is non-negative and has maximum 1. Set

t = \frac{x_{1}}{x_{2}}

for any

x_{1}, x_{2} > 0

. Then, we have

{(1 + \frac{x_{1}}{x_{2}})}^{α} \leq 1 + {(\frac{x_{1}}{x_{2}})}^{α},

which implies that

| x_{1} + x_{2} |^{α} \leq x_{1}^{α} + x_{2}^{α}

, and thus

| x_{1}^{α} - x_{2}^{α} | \leq | x_{1} - x_{2} |^{α} .

(A10)

So,

z = x^{α}

,

x \geq 0

satisfies the

H_{α} (1)

condition. It follows from Proposition A1 that

ϕ_{j}

satisfies the

H_{α} (2^{k})

condition. Denote by

Y = (R^{n}, ∥ \cdot ∥_{ℓ_{\infty}^{n}})

. Let the mapping

Φ : Y \to X

satisfy

Φ (y) : = \sum_{j = 1}^{2^{k n}} h_{j} ϕ_{j} (y) .

Then, we prove that

Φ

satisfies the

H_{α} (2^{k} r)

condition and

Φ (y_{j}) = h_{j}

.

It is known from (A9) that if

y = y_{j}

, then

ϕ_{j} (y) = 1

, and if

y \notin B_{j}

, then

ϕ_{j} (y) = 0

. Thus,

Φ (y_{j}) = h_{j}

. Moreover, for any

y, y^{'} \in Y

, we consider the following three cases.

Case 1: If

y, y^{'} \notin B_{n}

, then

∥ Φ (y) - Φ (y^{'}) ∥_{X} = 0 \leq 2^{k} {max}_{j} ∥ h_{j} ∥_{X} {∥ y - y^{'} ∥}_{Y}^{α}

.

Case 2: If

y^{'} \notin B_{n}

and

y \in B_{n}

, then there is a

j_{0}

such that

y \in B_{j_{0}}

. Thus,

\begin{matrix} ∥ Φ (y) - Φ (y^{'}) ∥_{X} = {∥ h_{j_{0}} ϕ_{j_{0}} (y) - 0 ∥}_{X} \\ \leq max_{j} {∥ h_{j} ∥}_{X} \cdot |ϕ_{j_{0}} (y) - ϕ_{j_{0}} (y^{'})| \\ \leq 2^{k} max_{j} ∥ h_{j} ∥_{X} \cdot {∥ y - y^{'} ∥}_{Y}^{α} . \end{matrix}

Case 3: If

y, y^{'} \in B_{n}

, then there exist two numbers

j_{0}

,

j_{1}

such that

y \in B_{j_{0}}

and

y^{'} \in B_{j_{1}}

. Therefore,

∥ Φ (y) - Φ (y^{'}) ∥_{X} \leq max_{j} ∥ h_{j} ∥_{X} \cdot | ϕ_{j_{0}} (y) - ϕ_{j_{1}} (y^{'}) | .

(A11)

We divide it into the following cases.

Case 3.1: If

j_{0} = j_{1}

, it follows from

ϕ_{j_{0}}

satisfying the

H_{α} (2^{k})

condition that

∥ Φ (y) - Φ (y^{'}) ∥_{X} \leq max_{j} ∥ h_{j} ∥_{X} \cdot | ϕ_{j_{0}} (y) - ϕ_{j_{0}} (y^{'}) | \leq 2^{k} max_{j} ∥ h_{j} ∥_{X} \cdot {∥ y - y^{'} ∥}_{Y}^{α} .

Case 3.2: If

j_{0} \neq j_{1}

, we have

| ϕ_{j_{0}} (y) - ϕ_{j_{1}} (y^{'}) | = 2^{k} \cdot |∥ y_{j_{0}} {- y ∥}_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{1}} - y^{'} ∥}_{ℓ_{\infty}^{n}}^{α}| .

(A12)

Due to the arbitrariness of

y, y^{'}

and

j_{0}, j_{1}

, we can assume that

∥ y_{j_{0}} {- y ∥}_{ℓ_{\infty}^{n}}^{α} \geq {∥ y_{j_{1}} - y^{'} ∥}_{ℓ_{\infty}^{n}}^{α}

. Then,

\begin{matrix} |∥ y_{j_{0}} {- y ∥}_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{1}} - y^{'} ∥}_{ℓ_{\infty}^{n}}^{α}| \\ = ∥ y_{j_{0}} {- y ∥}_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{1}} - y^{'} ∥}_{ℓ_{\infty}^{n}}^{α} \\ \leq ∥ y_{j_{1}} {- y ∥}_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{1}} - y^{'} ∥}_{ℓ_{\infty}^{n}}^{α} \\ \leq ∥ y - y^{'} ∥_{ℓ_{\infty}^{n}}^{α}, \end{matrix}

where the last inequality uses (A10). Otherwise,

\begin{matrix} |∥ y_{j_{0}} {- y ∥}_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{1}} - y^{'} ∥}_{ℓ_{\infty}^{n}}^{α}| & = ∥ y_{j_{1}} - y^{'} ∥_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{0}} - y ∥}_{ℓ_{\infty}^{n}}^{α} \\ \leq ∥ y_{j_{0}} - y^{'} ∥_{ℓ_{\infty}^{n}}^{α} - {∥ y_{j_{0}} - y ∥}_{ℓ_{\infty}^{n}}^{α} \\ \leq ∥ y - y^{'} ∥_{ℓ_{\infty}^{n}}^{α} . \end{matrix}

Thus, by combining (A11) with (A12), we have

∥ Φ (y) - Φ (y^{'}) ∥_{X} \leq 2^{k} max_{j} ∥ h_{j} ∥_{X} \cdot {∥ y - y^{'} ∥}_{Y}^{α} .

Therefore,

Φ

satisfies the

H_{α} (2^{k} r)

condition, where

r = rad (K) + ε 2^{- k} > max_{j} {∥ h_{j} ∥}_{X}

.

Then, we have

δ_{n}^{2^{k} r, α} {(K)}_{X} = δ_{n}^{2^{k} rad (K) + ε, α} {(K)}_{X} \leq ε_{k n} {(K)}_{X} + η .

Let

η \to 0

and

ε \to 0

. We obtain

δ_{n}^{2^{k} rad (K), α} {(K)}_{X} \leq ε_{k n} {(K)}_{X} .

We complete the proof of Theorem 7. □

Appendix C.2. Proof of Theorem 8

To prove Theorem 8, we recall some definitions and lemmas.

An

ε

-packing of K is a collection

{f_{1}, \dots, f_{l}} \subset K

, there is

min_{i \neq j} {∥ f_{i} - f_{j} ∥}_{X} > ε .

The maximal

ε

-packing number

P_{ε} (K)

is the cardinality of the largest

ε

-packing of K.

It is known ([3], Chapter 15) that

P_{2 ε} (K) \leq N_{ε} (K) \leq P_{ε} (K)

, and Lemma A5 holds.

Lemma A5

([3]). For the ball

B_{r} : = {x \in Y_{n} {: ∥ x ∥}_{Y_{n}} \leq r}

and

0 \leq ε \leq r

, we have

2^{- n} {(\frac{r}{ε})}^{n} \leq P_{ε} (B_{r}) \leq 3^{n} {(\frac{r}{ε})}^{n},

(A13)

and

2^{- n} {(\frac{r}{ε})}^{n} \leq N_{ε} (B_{r}) \leq 3^{n} {(\frac{r}{ε})}^{n},

(A14)

Lemma A6.

If

δ_{n}^{γ, α} {(K)}_{X} < ε

, then

γ \geq 3^{- α} ε N_{2 ε}^{\frac{α}{n}} (K) .

(A15)

Specifically, if

δ_{n}^{γ, α} {(B_{Z_{m}})}_{X} \leq ε

, then

γ \geq 3^{- α} 2^{- \frac{2 m α}{n}} ε^{1 - \frac{m α}{n}},

where

B_{Z_{m}}

is an m-dimensional unit ball of X.

Proof.

For

δ_{n}^{γ, α} {(K)}_{X} < ε

, there exists an

H_{α} (γ)

mapping

Φ

and a norm

{∥ \cdot ∥}_{Y_{n}}

such that

Φ (B (Y_{n}))

can approximate K with an accuracy of

ε

. That is, for any

y \in B (Y_{n})

and

z \in K

,

{∥ z - Φ (y) ∥}_{X} < ε .

Then, we consider a collection

{y_{1}, y_{2}, \dots, y_{N}} \subset B (Y_{n})

such that

{Φ (y_{1}), Φ (y_{2}), \dots, Φ (y_{N})}

is a maximal

ε

-packing of

Φ_{B (Y_{n})}

. By the definition of the maximal packing number and Hölder widths, for any

i \neq j, i, j = 1, \dots, N

, we obtain

ε < ∥ Φ (y_{i}) - Φ (y_{j}) ∥_{X} \leq γ {∥ y_{i} - y_{j} ∥}_{Y_{n}}^{α},

and thus

∥ y_{i} - y_{j} ∥_{Y_{n}} > {(\frac{ε}{γ})}^{\frac{1}{α}} .

(A16)

In addition, if we add an element

y \in B (Y_{n})

, then the set

{Φ (y), Φ (y_{1}), Φ (y_{2}), \dots, Φ (y_{N})}

is not an

ε

-packing of

Φ_{B (Y_{n})}

. Therefore, there exists a

j_{0} \in [1, N]

satisfying

∥ Φ (y) - Φ (y_{j_{0}}) ∥_{X} < ε,

which implies that

∥ z - Φ (y_{j_{0}}) ∥_{X} \leq ∥ z - Φ (y) ∥_{X} + {∥ Φ (y) - Φ (y_{j_{0}}) ∥}_{X} < 2 ε .

(A17)

By the definition of the maximal packing number and (A16),

N \leq P_{{(ε γ^{- 1})}^{1 / α}} (B (Y_{n})) .

Thus, by (A13), we have

N \leq {(3^{α} γ ε^{- 1})}^{\frac{n}{α}} .

It follows from (A17) that

{Φ (y_{1}), Φ (y_{2}), \dots, Φ (y_{N})}

is a

2 ε

-covering of K. Therefore,

{(3^{α} γ ε^{- 1})}^{\frac{n}{α}} \geq N \geq N_{2 ε} (K) .

Then, we obtain

γ^{\frac{n}{α}} \geq 3^{- n} ε^{\frac{n}{α}} N_{2 ε} (K),

(A18)

which means that (A15) holds true.

When

K = B_{Z_{m}}

, by (A14), we have

N_{2 ε} (K) \geq 2^{- 2 m} ε^{- m} .

It follows from (A18) that

γ \geq 3^{- α} ε N_{2 ε}^{\frac{α}{n}} (K) \geq 3^{- α} 2^{- \frac{2 m α}{n}} ε^{1 - \frac{m α}{n}} .

The proof of Lemma A6 is completed. □

Proof of Theorem 8.

By Lemma A6 with

ε = c_{0} \frac{{({log}_{2} n)}^{p}}{n^{q}}

, we have

\begin{matrix} N_{2 ε} (K) & \leq {(3^{α} γ ε^{- 1})}^{\frac{n}{α}} = {(3^{α} γ c_{0}^{- 1} {({log}_{2} n)}^{- p} n^{q})}^{\frac{n}{α}} \\ \leq 2^{\frac{n}{α} {{log}_{2} (3^{α} γ c_{0}^{- 1}) - p {log}_{2} ({log}_{2} n) + q {log}_{2} n}} \\ < 2^{c_{1} n {log}_{2} n}, \end{matrix}

where

c_{1}

is a constant depending on

γ, α, c_{0}, p

, and q. It follows from the definitions of the minimal covering number and entropy number that

ε_{c_{1} n {log}_{2} n} {(K)}_{X} \leq 2 ε = 2 c_{0} \frac{{({log}_{2} n)}^{p}}{n^{q}} .

(A19)

If we take

m = c_{1} n {log}_{2} n

, then

{log}_{2} m = {log}_{2} c_{1} + {log}_{2} n + {log}_{2} ({log}_{2} n) .

Thus, for sufficiently large n, we obtain

2^{- 1} {log}_{2} n < {log}_{2} m < 3 {log}_{2} n

. By using (A19) and

n = \frac{m}{c_{1} {log}_{2} n}

, we obtain

ε_{m} {(K)}_{X} \leq 2 c_{0} \frac{{({log}_{2} n)}^{p}}{{(\frac{m}{c_{1} {log}_{2} n})}^{q}} = 2 c_{0} c_{1}^{q} \frac{{({log}_{2} n)}^{p + q}}{m^{q}} \leq C \frac{{({log}_{2} m)}^{p + q}}{m^{q}} .

We complete the proof of Theorem 8. □

Appendix C.3. Proof of Theorem 9

To prove Theorem 9, we need the following lemma.

Lemma A7.

Suppose that the sequence

{η_{n}}_{n = 1}^{\infty}

of real numbers is decreasing to zero. Moreover, if

ε_{n} (K) \geq η_{n}, n \geq 1,

(A20)

and there exist

m \in N

and

θ > 0

satisfying

δ_{m}^{γ, α} {(K)}_{X} < θ,

then

η_{\frac{m}{α} {log}_{2} (3^{α} γ θ^{- 1})} < 2 θ .

Proof.

By using Lemma A6 with

ε = θ

, we have

N_{2 θ} (K) \leq {(3^{α} γ θ^{- 1})}^{\frac{m}{α}} = 2^{\frac{m}{α} {log}_{2} (3^{α} γ θ^{- 1})} .

Then, by the definition of the entropy number and (A20), we obtain

2 θ \geq ε_{α^{- 1} m {log}_{2} (3^{α} γ θ^{- 1})} > η_{α^{- 1} m {log}_{2} (3^{α} γ θ^{- 1})} .

The proof of Lemma A7 is completed. □

Proof of Theorem 9.

We prove all statements of the theorem by contradiction.

To prove Theorem 9 (i), we assume that (7) is false, meaning there exists a strictly increasing sequence of integers

{n_{k}}_{k = 1}^{\infty}

such that

a_{k} : = δ_{n_{k}}^{γ, α} {(K)}_{X} n_{k}^{q} {({log}_{2} n_{k})}^{q - p} \to 0, as k \to \infty .

Thus, we have

δ_{n_{k}}^{γ, α} {(K)}_{X} = \frac{a_{k} {({log}_{2} n_{k})}^{p - q}}{n_{k}^{q}} \leq \frac{2 a_{k} {({log}_{2} n_{k})}^{p - q}}{n_{k}^{q}} = : θ_{k} .

(A21)

Set

η_{n} : = c_{1} \frac{{({log}_{2} n)}^{p}}{n^{q}}

. It is known that

η_{n}

is decreasing to zero as

n \to \infty

. Therefore, by Lemma A7,

η_{α^{- 1} n_{k} {log}_{2} (3^{α} γ θ_{k}^{- 1})} \leq 2 θ_{k} .

Thus,

α^{q} c_{1} \frac{{[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p}}{n_{k}^{q} {[{log}_{2} (3^{α} γ θ_{k}^{- 1})]}^{q}} \leq \frac{4 a_{k} {({log}_{2} n_{k})}^{p - q}}{n_{k}^{q}},

that is,

{[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} \leq \frac{4 a_{k}}{c_{1} α^{q}} \cdot {({log}_{2} n_{k})}^{p - q} {[{log}_{2} (3^{α} γ θ_{k}^{- 1})]}^{q} .

(A22)

It is known that for sufficiently large k, we have

{log}_{2} n_{k} < n_{k} and a_{k}^{- 1} \to \infty .

(A23)

Then, by (A21), we have

\begin{matrix} {log}_{2} (3^{α} γ θ_{k}^{- 1}) & = {log}_{2} (\frac{3^{α} γ n_{k}^{q}}{2 a_{k} {({log}_{2} n_{k})}^{p - q}}) \\ < α {log}_{2} 1.5 γ + q {log}_{2} n_{k} + {log}_{2} a_{k}^{- 1} + (q - p) {log}_{2} ({log}_{2} n_{k}) \\ \leq 2 q {log}_{2} n_{k} + 2 {log}_{2} a_{k}^{- 1} . \end{matrix}

(A24)

Thus, by combining (A22) with (A24), we obtain

\begin{matrix} {[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} & \leq \frac{2^{q + 2} a_{k}}{c_{1} α^{q}} {({log}_{2} n_{k})}^{p - q} {({log}_{2} a_{k}^{- 1} + q {log}_{2} n_{k})}^{q} \\ = C_{0} a_{k} {({log}_{2} n_{k})}^{p} {(\frac{{log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}} + q)}^{q}, \end{matrix}

(A25)

where

C_{0} = 2^{q + 2} c_{1}^{- 1} α^{- q}

.

Next, we use the property of

y (x) = x^{p}, x > 0

. When

p < 0

,

y (x)

is decreasing on

x > 0

; when

p > 0

,

y (x)

is increasing on

x > 0

. Then, we divide it into the following three cases.

Case 1:

p = 0

. For sufficiently large k, it follows from (A25) and

n_{k} \to \infty

that

1 \leq C_{0} a_{k} {(\frac{{log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}} + q)}^{q} \leq C a_{k} {({log}_{2} a_{k}^{- 1})}^{q} .

Therefore,

a_{k}^{- 1} \leq C {({log}_{2} a_{k}^{- 1})}^{q}

, which implies that

a_{k}^{- 1} \to 0

, contradicting (A23).

Case 2:

p > 0

. For sufficiently large k and

0 < α < 1

, we have

{({log}_{2} n_{k})}^{p} \leq {[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} .

Then, it follows from (A25) that

1 \leq C_{0} a_{k} {(\frac{{log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}} + q)}^{q} \leq C a_{k} {({log}_{2} a_{k}^{- 1})}^{q},

which also implies that

a_{k}^{- 1} \to 0

, contradicting (A23).

Case 3:

p < 0

. For sufficiently large k and

0 < α < 1

, by (A23) and (A24), we obtain

\begin{matrix} {[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} \\ > {[{log}_{2} n_{k} + {log}_{2} (3^{α} γ θ_{k}^{- 1}) + {log}_{2} (α^{- 1})]}^{p} \\ \geq {[2 {log}_{2} n_{k} + {log}_{2} (3^{α} γ θ_{k}^{- 1})]}^{p} \\ \geq {[(2 q + 2) {log}_{2} n_{k} + 2 {log}_{2} a_{k}^{- 1}]}^{p} . \end{matrix}

Thus, by combining these results with (A25) and multiplying both sides by

{({log}_{2} n_{k})}^{- p}

, we obtain

{[(2 q + 2) + \frac{2 {log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}}]}^{p} \leq C_{0} a_{k} {(\frac{{log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}} + q)}^{q},

which implies that

a_{k}^{- 1} \leq C_{0} {(\frac{{log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}} + q)}^{q} {[(2 q + 2) + \frac{2 {log}_{2} a_{k}^{- 1}}{{log}_{2} n_{k}}]}^{- p} \leq C {({log}_{2} a_{k}^{- 1})}^{q - p} .

Therefore,

a_{k}^{- 1} \to 0

as

k \to \infty

, which contradicts (A23). We complete the proof of Theorem 9 (i).

To prove Theorem 9 (ii), we assume that (8) is false. Then, there is an increasing sequence of

{n_{k}}_{k = 1}^{\infty}

satisfying, for

q > 0

,

b_{k} : = δ_{n_{k}}^{γ, α} {(K)}_{X} {({log}_{2} n_{k})}^{q} \to 0, as k \to \infty .

(A26)

Set

η_{n} : = c_{1} {({log}_{2} n)}^{- q}

and

δ_{n_{k}}^{γ, α} {(K)}_{X} = b_{k} {({log}_{2} n_{k})}^{- q} \leq 2 b_{k} {({log}_{2} n_{k})}^{- q} = : ζ_{k} .

Then by Lemma A7, we have

{[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ ζ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{- q} \leq 2 ζ_{k} = \frac{4 b_{k}}{c_{1}} {({log}_{2} n_{k})}^{- q} .

(A27)

When

q > 0

, the function

y = x^{- q}, x > 0

is decreasing. For sufficiently large k, it follows from (A23) and

0 < α < 1

that

{[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ ζ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{- q} \geq {[2 {log}_{2} n_{k} + {log}_{2} (3^{α} γ ζ_{k}^{- 1})]}^{- q} .

(A28)

Moreover, it is known that

\begin{matrix} {log}_{2} (3^{α} γ ζ_{k}^{- 1}) & = {log}_{2} (\frac{3^{α} γ {({log}_{2} n_{k})}^{q}}{2 b_{k}}) \\ < α {log}_{2} 1.5 γ + {log}_{2} b_{k}^{- 1} + q {log}_{2} ({log}_{2} n_{k}) \\ \leq q {log}_{2} n_{k} + 2 {log}_{2} b_{k}^{- 1} . \end{matrix}

By combining (A27) with (A28), we obtain

\frac{4 b_{k}}{c_{1}} {({log}_{2} n_{k})}^{- q} \geq {[(q + 2) {log}_{2} n_{k} + 2 {log}_{2} b_{k}^{- 1}]}^{- q} .

Thus,

b_{k}^{- 1} \leq \frac{4}{c_{1}} {((q + 2) + \frac{2 {log}_{2} b_{k}^{- 1}}{{log}_{2} n_{k}})}^{q} \leq C {({log}_{2} b_{k}^{- 1})}^{q},

which implies that

b_{k}^{- 1} \to 0

as

k \to \infty

, contradicting (A26). We complete the proof of Theorem 9 (ii).

Finally, we prove Theorem 9 (iii). The proof is similar to those above. It is known from Corollary 1 that if

γ \geq \frac{d_{n_{0}} {(K)}_{X} + rad (K)}{2^{α - 1}},

then

δ_{n} : = δ_{n}^{γ, α} {(K)}_{X} \to 0, as n \to \infty .

By using Lemma A7 with

θ = 2 δ_{n_{k}}^{γ, α} {(K)}_{X}

and

η_{n} = c_{1} 2^{- c_{2} n^{p}}

, we have

c_{1} 2^{- c_{2} {[\frac{n_{k}}{α} {log}_{2} (2^{- 1} 3^{α} γ δ_{n_{k}}^{- 1})]}^{p}} \leq 4 δ_{n_{k}} .

(A29)

Taking the logarithm on both sides of (A29), and using the fact that

δ_{n_{k}}^{- 1} \to \infty

as

k \to \infty

, we have

{log}_{2} (\frac{c_{1}}{4} δ_{n_{k}}^{- 1}) \leq c_{2} \frac{n_{k}^{p}}{α^{p}} {[{log}_{2} (2^{- 1} 3^{α} γ δ_{n_{k}}^{- 1}))]}^{p},

which implies that

{log}_{2} (δ_{n_{k}}^{- 1}) \leq C n_{k}^{p} {[{log}_{2} (δ_{n_{k}}^{- 1})]}^{p} .

Thus,

{log}_{2} (δ_{n_{k}}^{- 1}) \leq C^{\frac{1}{1 - p}} {n_{k}}^{\frac{p}{1 - p}} .

Therefore, we obtain

δ_{n_{k}} \geq C_{1} 2^{- C^{\frac{1}{1 - p}} n_{k}^{\frac{p}{1 - p}}} \geq C_{1} 2^{- C_{2} n_{k}^{\frac{p}{1 - p}}},

which contradicts

δ_{n_{k}} \to 0

as

k \to \infty

. The proof of Theorem 9 is completed. □

Appendix D. Proofs of Section 5

Appendix D.1. Proof of Theorem 10

Proof of Theorem 10.

The proofs are similar to those of Theorem 9. To prove Theorem 10 (i), we assume that (11) is false, that is, there is an increasing sequence of

{n_{k}}_{k = 1}^{\infty}

satisfying

a_{k} : = δ_{n_{k}}^{γ_{n_{k}}, α} {(K)}_{X} n_{k}^{2 q} {({log}_{2} n_{k})}^{- p} \to 0, as k \to \infty .

Thus, we have

δ_{n_{k}}^{γ_{n_{k}}, α} {(K)}_{X} = \frac{a_{k} {({log}_{2} n_{k})}^{p}}{n_{k}^{2 q}} \leq \frac{2 a_{k} {({log}_{2} n_{k})}^{p}}{n_{k}^{2 q}} = : θ_{k} .

(A30)

It follows from Lemma A7 with

η_{n} : = c_{1} \frac{{({log}_{2} n)}^{p}}{n^{q}}

that

{[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ_{n_{k}} θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} \leq \frac{4 a_{k}}{c_{1} α^{q}} \cdot {({log}_{2} n_{k})}^{p} {[{log}_{2} (3^{α} γ_{n_{k}} θ_{k}^{- 1})]}^{q} n_{k}^{- q} .

(A31)

Recall that for sufficiently large k,

{log}_{2} n_{k} < n_{k} and a_{k}^{- 1} \to \infty .

(A32)

It is known from (A30) and

γ_{n_{k}} = M λ^{n_{k}}

that there exists a constant

C_{0} > 0

such that

\begin{matrix} {log}_{2} (3^{α} γ_{n_{k}} θ_{k}^{- 1}) & = {log}_{2} (\frac{3^{α} γ_{n_{k}} n_{k}^{2 q}}{2 a_{k} {({log}_{2} n_{k})}^{p}}) \\ < α {log}_{2} (1.5 \cdot M λ^{n_{k}}) + 2 q {log}_{2} n_{k} + {log}_{2} a_{k}^{- 1} - p {log}_{2} ({log}_{2} n_{k}) \\ \leq 2 C_{0} n_{k} + 2 {log}_{2} a_{k}^{- 1} . \end{matrix}

(A33)

Thus, by combining (A31) with (A33), we obtain

\begin{matrix} {[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ_{n_{k}} θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} & \leq \frac{2^{q + 2} a_{k}}{c_{1} α^{q}} {({log}_{2} n_{k})}^{p} {({log}_{2} a_{k}^{- 1} + C_{0} n_{k})}^{q} n_{k}^{- q} \\ = C_{1} a_{k} {({log}_{2} n_{k})}^{p} {(\frac{{log}_{2} a_{k}^{- 1}}{n_{k}} + C_{0})}^{q}, \end{matrix}

(A34)

where

C_{1} = 2^{q + 2} c_{1}^{- 1} α^{- q}

. Then, we divide it into the following two cases.

Case 1:

p \geq 0

. For sufficiently large k and

0 < α < 1

, we have

{({log}_{2} n_{k})}^{p} \leq {[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ_{n_{k}} θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} .

It follows from (A34) and

n_{k} \to \infty

that

1 \leq C_{1} a_{k} {(\frac{{log}_{2} a_{k}^{- 1}}{n_{k}} + C_{0})}^{q} < C a_{k} {({log}_{2} a_{k}^{- 1})}^{q} .

Therefore,

a_{k}^{- 1} \leq C {({log}_{2} a_{k}^{- 1})}^{q}

, which implies that

a_{k}^{- 1} \to 0

, contradicting (A32).

Case 2:

p < 0

. For sufficiently large k and

0 < α < 1

, by (A32) and (A33), we obtain

\begin{matrix} {[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ_{n_{k}} θ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{p} \geq {[2 {log}_{2} n_{k} + {log}_{2} (2 C_{0} n_{k} + 2 {log}_{2} a_{k}^{- 1})]}^{p} . \end{matrix}

Thus, by combining these results with (A34) and multiplying both sides by

{({log}_{2} n_{k})}^{- p}

, we obtain

{[2 + \frac{{log}_{2} (2 C_{0} n_{k} + 2 {log}_{2} a_{k}^{- 1})}{{log}_{2} n_{k}}]}^{p} \leq C_{1} a_{k} {(\frac{{log}_{2} a_{k}^{- 1}}{n_{k}} + C_{0})}^{q},

which implies that

a_{k}^{- 1} \leq C_{1} {(\frac{{log}_{2} a_{k}^{- 1}}{n_{k}} + C_{0})}^{q} {[2 + \frac{{log}_{2} (2 C_{0} n_{k} + 2 {log}_{2} a_{k}^{- 1})}{{log}_{2} n_{k}}]}^{- p} .

(A35)

Case 2.1: If

a_{k}^{- 1} \leq c n_{k}

, it follows from (A35) that

a_{k}^{- 1} \leq C,

where c, C are constants, which contradicts (A32).

Case 2.2: If

a_{k}^{- 1} > c n_{k}

, it follows from (A35) that

a_{k}^{- 1} \leq C {({log}_{2} a_{k}^{- 1})}^{q} {({log}_{2} ({log}_{2} a_{k}^{- 1}))}^{- p} \leq C {({log}_{2} a_{k}^{- 1})}^{q - p} .

Therefore,

a_{k}^{- 1} \to 0

as

k \to \infty

, which contradicts (A32). We complete the proof of Theorem 10 (i).

To prove Theorem 10 (ii), we assume that (12) is false. Then, there is an increasing sequence of

{n_{k}}_{k = 1}^{\infty}

satisfying, for

q > 0

,

b_{k} : = δ_{n_{k}}^{γ_{n_{k}}, α} {(K)}_{X} {({log}_{2} n_{k})}^{q} \to 0, as k \to \infty .

(A36)

Set

η_{n} : = c_{1} {({log}_{2} n)}^{- q}

and

δ_{n_{k}}^{γ_{n_{k}}, α} {(K)}_{X} = b_{k} {({log}_{2} n_{k})}^{- q} \leq 2 b_{k} {({log}_{2} n_{k})}^{- q} = : ζ_{k} .

Then, by Lemma A7, we have

{[{log}_{2} n_{k} + {log}_{2} ({log}_{2} (3^{α} γ_{n_{k}} ζ_{k}^{- 1})) + {log}_{2} (α^{- 1})]}^{- q} \leq 2 ζ_{k} = \frac{4 b_{k}}{c_{1}} {({log}_{2} n_{k})}^{- q} .

(A37)

For sufficiently large k, it follows from (A32) and

0 < α < 1

that there exists a constant

C_{0} > 0

such that

\begin{matrix} {log}_{2} (3^{α} γ_{n_{k}} ζ_{k}^{- 1}) = {log}_{2} (\frac{3^{α} M λ^{n_{k}} \cdot {({log}_{2} n_{k})}^{q}}{2 b_{k}}) < C_{0} n_{k} + 2 {log}_{2} b_{k}^{- 1} . \end{matrix}

By combining these results with (A37), we obtain

\frac{4 b_{k}}{c_{1}} {({log}_{2} n_{k})}^{- q} \geq {[2 {log}_{2} n_{k} + {log}_{2} (C_{0} n_{k} + 2 {log}_{2} b_{k}^{- 1})]}^{- q} .

Thus, we obtain

b_{k}^{- 1} \leq \frac{4}{c_{1}} {(2 + \frac{{log}_{2} (C_{0} n_{k} + 2 {log}_{2} b_{k}^{- 1})}{{log}_{2} n_{k}})}^{q} .

Case 2.1: If

b_{k}^{- 1} \leq c n_{k}

, then

b_{k}^{- 1} \leq C,

where c, C are constants, which contradicts (A36).

Case 2.2: If

b_{k}^{- 1} > c n_{k}

, it follows from (A35) that

b_{k}^{- 1} \leq C {({log}_{2} b_{k}^{- 1})}^{q} .

Therefore,

b_{k}^{- 1} \to 0

as

k \to \infty

, which contradicts (A36). We complete the proof of Theorem 10 (ii). □

Appendix D.2. Proof of Theorem 15

Proof of Theorem 15.

By Theorem 9 (i), it is clear that

δ_{n}^{γ, α} {(K_{η})}_{X} ≫ \frac{1}{n {log}_{2} (n + 1)} .

The proof of the upper bound is similar to the proof in [18]. We give the detailed proof, using the method from the proof of Theorem 7. For

j = 1, \dots, N

, define

l_{j} \in N \cup {0}

as

2^{- l_{j} - 1} < \frac{2 η_{j}}{γ} \leq 2^{- l_{j}} .

(A38)

Set

N = {(n + 1)}^{n}

. We could divide the unit ball

{[- 1, 1]}^{n} \subset R^{n}

into

k_{1}

non-overlapping open balls with side length

2^{- l_{k_{1}}}

,

(k_{2} - k_{1})

non-overlapping open balls with side length

2^{- l_{k_{2}}}

, ⋯, or

(k_{s} - k_{s - 1})

non-overlapping open balls with side length

2^{- l_{k_{s}}}

, where

k_{s} = N

and

l_{1} = \dots = l_{k_{1}} < l_{k_{1} + 1} = \dots = l_{k_{2}} < \dots < l_{k_{s - 1} + 1} = \dots = l_{k_{s}} .

Hence, there is a sequence of non-overlapping open balls

B_{j} \subset {[- 1, 1]}^{n}

with side length

2^{- l_{j}}

,

B_{j} : = B_{j} (y_{j}, 2^{- l_{j} - 1}), j = 1, \dots, N .

Let

ϕ_{j} : R^{n} \to R

be a mapping such that

ϕ_{j} (y) = η_{j} max {0, 1 - 2^{l_{j} + 1} ∥ y_{j} - y ∥_{ℓ_{\infty}^{n}}^{α}}, j = 1, \dots, N,

and the mapping

Φ

be such that

Φ (y) : = \sum_{j = 1}^{N} ϕ_{j} (y) \cdot e_{j} .

It is known from the proof of Theorem 7 and (A38) that

Φ

satisfies the

H_{α} (sup_{j = 1, \dots, N} {2^{l_{j} + 1} η_{j}})

condition; thus, it satisfies the

H_{α} (γ)

condition. Moreover,

Φ (y_{j}) = η_{j} e_{j}

,

j = 1, \dots, N

.

For

K_{η} = {η_{k} e_{k}}_{k = 1}^{\infty} \cup {0}

, by the decreasing property of

{η_{k}}_{k = 1}^{\infty}

, we have

\begin{matrix} sup_{k \geq 1} inf_{y \in B (Y_{n})} {∥ η_{k} e_{k} - Φ (y) ∥}_{X} \\ \leq sup_{k \geq 1} inf_{j = 1, \dots, N} {∥ η_{k} e_{k} - Φ (y_{j}) ∥}_{ℓ_{\infty}^{n}} \\ = sup_{k \geq 1} inf_{j = 1, \dots, N} {∥ η_{k} e_{k} - η_{j} e_{j} ∥}_{ℓ_{\infty}^{n}} \\ = η_{N}, \end{matrix}

and

inf_{y \in B (Y_{n})} {∥ 0 - Φ (y) ∥}_{X} \leq inf_{j = 1, \dots, N} ∥ Φ (y_{j}) ∥_{ℓ_{\infty}^{n}} = inf_{j = 1, \dots, N} {∥ η_{j} e_{j} ∥}_{ℓ_{\infty}^{n}} = η_{N} .

Thus,

δ_{n}^{γ, α} {(K_{η})}_{X} \leq sup_{f \in K_{η}} inf_{y \in B (Y_{n})} {∥ f - Φ (y) ∥}_{X} \leq η_{N} < \frac{1}{n {log}_{2} (n + 1)} .

The proof of Theorem 15 is completed. □

References

Kolmogoroff, A. Uber die beste Annaherung von Funktionen einer gegebenen Funktionenklasse. Ann. Math. 1936, 37, 107–110. [Google Scholar] [CrossRef]
Pinkus, A. n-Widths in Approximation Theory; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
Lorentz, G.G.; Golitschek, M.; Makovoz, Y. Constructive Approximation: Advanced Problems; Springer: Berlin, Germany, 1996. [Google Scholar]
Fang, G.; Ye, P. Probabilistic and average linear widths of Sobolev space with Gaussian measure. J. Complex. 2003, 19, 73–84. [Google Scholar]
Fang, G.; Ye, P. Probabilistic and average linear widths of Sobolev space with Gaussian measure in L_∞-Norm. Constr. Approx. 2004, 20, 159–172. [Google Scholar]
Duan, L.; Ye, P. Exact asymptotic orders of various randomized widths on Besov classes. Commun. Pure Appl. Anal. 2020, 19, 3957–3971. [Google Scholar] [CrossRef]
Duan, L.; Ye, P. Randomized approximation numbers on Besov classes with mixed smoothness. Int. J. Wavelets Multiresolut. Inf. Process. 2020, 18, 2050023. [Google Scholar] [CrossRef]
Liu, Y.; Li, X.; Li, H. n-Widths of Multivariate Sobolev Spaces with Common Smoothness in Probabilistic and Average Settings in the S_q Norm. Axioms 2023, 12, 698. [Google Scholar] [CrossRef]
Liu, Y.; Li, H.; Li, X. Approximation Characteristics of Gel’fand Type in Multivariate Sobolev Spaces with Mixed Derivative Equipped with Gaussian Measure. Axioms 2023, 12, 804. [Google Scholar] [CrossRef]
Wu, R.; Liu, Y.; Li, H. Probabilistic and Average Gel’fand Widths of Sobolev Space Equipped with Gaussian Measure in the S_q-Norm. Axioms 2024, 13, 492. [Google Scholar] [CrossRef]
Liu, Y.; Lu, M. Approximation problems on the smoothness classes. Acta Math. Sci. 2024, 44, 1721–1734. [Google Scholar] [CrossRef]
DeVore, R.; Howard, R.; Micchelli, C. Optimal nonlinear approximation. Manuscr. Math. 1989, 63, 469–478. [Google Scholar] [CrossRef]
DeVore, R.; Hanin, B.; Petrova, G. Neural network approximation. Acta Numer. 2021, 30, 327–444. [Google Scholar] [CrossRef]
Petrova, G.; Wojtaszczyk, P. Limitations on approximation by deep and shallow neural networks. J. Mach. Learn. Res. 2023, 24, 1–38. [Google Scholar]
DeVore, R.; Kyriazis, G.; Leviatan, D.; Tichomirov, V. Wavelet compression and nonlinear-widths. Adv. Comput. Math. 1993, 1, 197–214. [Google Scholar] [CrossRef]
Temlyakov, V. Nonlinear Kolmogorov widths. Math. Notes 1998, 63, 785–795. [Google Scholar] [CrossRef]
Cohen, A.; DeVore, R.; Petrova, G.; Wojtaszczyk, P. Optimal stable nonlinear approximation. Found. Comput. Math. 2022, 22, 607–648. [Google Scholar] [CrossRef]
Petrova, G.; Wojtaszczyk, P. Lipschitz widths. Constr. Approx. 2023, 57, 759–805. [Google Scholar] [CrossRef]
Petrova, G.; Wojtaszczyk, P. On the entropy numbers and the Kolmogorov widths. arXiv 2022, arXiv:2203.00605. [Google Scholar]
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef]
Shen, Z.; Yang, H.; Zhang, S. Optimal approximation rate of ReLU networks in terms of width and depth. J. Math. Pures Appl. 2022, 157, 101–135. [Google Scholar] [CrossRef]
Fiorenza, R. Hölder and Locally Hölder Continuous Functions, and Open Sets of Class C^k, C^k,λ; Birkhäuser: Basel, Switzerland, 2017. [Google Scholar]
Opschoor, J.; Schwab, C.; Zech, J. Exponential ReLU DNN expression of holomorphic maps in high dimension. Constr. Approx. 2021, 55, 537–582. [Google Scholar] [CrossRef]
Yang, Y.; Zhou, D. Optimal Rates of Approximation by Shallow ReLU^k Neural Networks and Applications to Nonparametric Regression. Constr. Approx. 2024, 1–32. [Google Scholar]
Lee, M. Mathematical Analysis and Performance Evaluation of the GELU Activation Function in Deep Learning. J. Math. 2023, 2023, 4229924. [Google Scholar] [CrossRef]
Forti, M.; Grazzini, M.; Nistri, P.; Pancioni, L. Generalized Lyapunov approach for convergence of neural networks with discontinuous or non-Lipschitz activations. Phys. D 2006, 214, 88–99. [Google Scholar] [CrossRef]
Gavalda, R.; Siegelmann, H. Discontinuities in recurrent neural networks. Neural Comput. 1999, 11, 715–745. [Google Scholar] [CrossRef]
Tatar, N. Hölder continuous activation functions in neural networks. Adv. Differ. Equ. Control Process. 2015, 15, 93–106. [Google Scholar]
Carl, B. Entropy numbers, s-numbers, and eigenvalue problems. J. Funct. Anal. 1981, 41, 290–306. [Google Scholar] [CrossRef]
Konyagin, S.; Temlyakov, V. The Entropy in Learning Theory. Error Estimates. Constr. Approx. 2007, 25, 1–27. [Google Scholar] [CrossRef]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
Donoho, D.L. Compressed sensing. IEEE Trans. Inform. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Siegel, J.W. Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces. J. Mach. Learn. Res. 2023, 24, 1–52. [Google Scholar]
Lu, J.; Shen, Z.; Yang, H.; Zhang, S. Deep network approximation for smooth functions. SIAM J. Math. Anal. 2021, 53, 5465–5506. [Google Scholar] [CrossRef]
Birman, M.; Solomyak, M. Piecewise polynomial approximations of functions of the class $W_{p}^{α}$ . Mat. Sb. 1967, 73, 331–355. (In Russian) [Google Scholar]
DeVore, R.; Sharpley, R. Besov spaces on domains in $R^{d}$ . Trans. Am. Math. Soc. 1993, 335, 843–864. [Google Scholar]
Mazzucato, A. Besov-Morrey spaces: Function space theory and applications to non-linear PDE. Trans. Am. Math. Soc. 2003, 355, 1297–1364. [Google Scholar] [CrossRef]
Garnett, J.; Le, T.; Meyer, Y.; Vese, A. Image decompositions using bounded variation and generalized homogeneous Besov spaces. Appl. Comput. Harmon. Anal. 2007, 23, 25–56. [Google Scholar] [CrossRef]
Marinucci, D.; Pietrobon, D.; Balbi, A.; Baldi, P.; Cabella, P.; Kerkyacharian, G.; Natoli, P.; Picard, D.; Vittorio, N. Spherical needlets for cosmic microwave background data analysis. Mon. Not. R. Astron. Soc. 2008, 383, 539–545. [Google Scholar] [CrossRef]
Dai, F.; Xu, Y. Approximation Theory and Harmonic Analysis on Spheres and Balls; Springer Monographs in Mathematics; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Feng, H.; Huang, S.; Zhou, D.X. Generalization analysis of CNNs for classification on spheres. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 6200–6213. [Google Scholar] [CrossRef] [PubMed]
Kushpel, A.; Tozoni, S. Entropy numbers of Sobolev and Besov classes on homogeneous spaces. In Advances in Analysis; World Scientific Publishing: Hackensack, NJ, USA, 2005; pp. 89–98. [Google Scholar]
Zhou, D.X. Theory of deep convolutional neural networks: Downsampling. Neural Netw. 2020, 124, 319–327. [Google Scholar] [CrossRef]
Zhou, D.X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 2020, 48, 787–794. [Google Scholar] [CrossRef]
Mao, T.; Shi, Z.; Zhou, D.X. Theory of deep convolutional neural networks III: Approximating radial functions. Neural Netw. 2021, 144, 778–790. [Google Scholar] [CrossRef]
Kühn, T. Entropy Numbers of General Diagonal Operators. Rev. Mat. Complut. 2005, 18, 479–491. [Google Scholar] [CrossRef]
Carl, B.; Stephani, I. Entropy, Compactness and the Approximation of Operators; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
Wojtaszczyk, P. Banach Spaces for Analysts; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]

Figure 1. Approximation error and the number of elements n and depth d: classical methods (

n

-term wavelets and adaptive finite elements) vs. new tools (deep neural networks).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

The Theory and Applications of Hölder Widths

Abstract

1. Introduction

2. Fundamental Properties of Hölder Widths

3. The Relationship Between Hölder Widths and Other Widths

3.1. The Relationship Between Hölder Widths, $n$ -Kolmogorov Widths, and Linear Widths

3.2. The Relationship Between Hölder Widths and Nonlinear $(n, N)$ -Widths

4. Comparison Between Hölder Widths and Entropy Numbers

5. Some Applications

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs of Section 2

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proofs of Theorems 2–4

Appendix B. Proofs of Section 3

Appendix B.1. Proofs of Theorem 5 and Corollary 1

Appendix B.2. Proofs of Theorem 6

Appendix C. Proofs of Section 4

Appendix C.1. Proof of Theorem 7

Appendix C.2. Proof of Theorem 8

Appendix C.3. Proof of Theorem 9

Appendix D. Proofs of Section 5

Appendix D.1. Proof of Theorem 10

Appendix D.2. Proof of Theorem 15

References

Article Metrics

Citations

Article Access Statistics

The Theory and Applications of Hölder Widths

Abstract

1. Introduction

2. Fundamental Properties of Hölder Widths

3. The Relationship Between Hölder Widths and Other Widths

3.1. The Relationship Between Hölder Widths, n -Kolmogorov Widths, and Linear Widths

3.2. The Relationship Between Hölder Widths and Nonlinear ( n , N ) -Widths

4. Comparison Between Hölder Widths and Entropy Numbers

5. Some Applications

6. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs of Section 2

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proofs of Theorems 2–4

Appendix B. Proofs of Section 3

Appendix B.1. Proofs of Theorem 5 and Corollary 1

Appendix B.2. Proofs of Theorem 6

Appendix C. Proofs of Section 4

Appendix C.1. Proof of Theorem 7

Appendix C.2. Proof of Theorem 8

Appendix C.3. Proof of Theorem 9

Appendix D. Proofs of Section 5

Appendix D.1. Proof of Theorem 10

Appendix D.2. Proof of Theorem 15

References

Article Metrics

Citations

Article Access Statistics

3.1. The Relationship Between Hölder Widths, $n$ -Kolmogorov Widths, and Linear Widths

3.2. The Relationship Between Hölder Widths and Nonlinear $(n, N)$ -Widths