Next Article in Journal
Analyzing Uniqueness of Solutions in Nonlinear Fractional Differential Equations with Discontinuities Using Lebesgue Spaces
Previous Article in Journal
The Characteristic Relation in Two-Dimensional Type I Intermittency
Previous Article in Special Issue
Numerical Computation of 2D Domain Integrals in Boundary Element Method by (α, β) Distance Transformation for Transient Heat Conduction Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Theory and Applications of Hölder Widths

Department of Applied Mathematics, School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
*
Author to whom correspondence should be addressed.
Axioms 2025, 14(1), 25; https://doi.org/10.3390/axioms14010025
Submission received: 4 December 2024 / Revised: 23 December 2024 / Accepted: 27 December 2024 / Published: 31 December 2024

Abstract

:
We introduce the Hölder width, which measures the best error performance of some recent nonlinear approximation methods, such as deep neural network approximation. Then, we investigate the relationship between Hölder widths and other widths, showing that some Hölder widths are essentially smaller than n -Kolmogorov widths and linear widths. We also prove that, as the Hölder constants grow with n, the Hölder widths are much smaller than the entropy numbers. The fact that Hölder widths are smaller than the known widths implies that the nonlinear approximation represented by deep neural networks can provide a better approximation order than other existing approximation methods, such as adaptive finite elements and n -term wavelet approximation. In particular, we show that Hölder widths for Sobolev and Besov classes, induced by deep neural networks, are O ( n 2 s / d ) and are much smaller than other known widths and entropy numbers, which are O ( n s / d ) .

1. Introduction

Width theory is one of the most important topics in approximation theory because widths can be considered approximation standards that indicate the accuracy achievable for a given function class using an approximation method. They have been extensively studied and applied in various fields, providing a benchmark for the best performance of different approximation techniques. One of the earliest studies on widths was Kolmogorov’s work in 1936, where he introduced the concept of n -Kolmogorov widths [1]. With the development of modern science and engineering, the theory of widths has also developed rapidly, greatly promoting research into various linear and nonlinear approximation methods. Problems related to width theory have been and continue to be studied by many experts, including Pinkus, Lorentz, DeVore, Temlyakov, et al. [2,3,4,5,6,7,8,9,10,11,12]. In addition, nonlinear methods play a crucial role in understanding complex phenomena across various applications, such as compressed sensing, signal processing, and neural networks [13,14,15]. Widths such as manifold widths, nonlinear widths, and Lipschitz widths have been utilized as fundamental measures to assess the optimal convergence rate of these nonlinear methods [12,15,16,17,18].
It is known that neural networks can serve as powerful nonlinear tools. For instance, the ReLU (Rectified Linear Unit) activation function,
σ ( t ) : = ReLU ( t ) = max { 0 , t } , t R ,
is characterized as a Lipschitz mapping, which has led to the introduction of stable manifold widths and Lipschitz widths. In [13,17], Cohen et al. and DeVore, et al. investigated stable manifold widths to quantify error performance in nonlinear approximation methods, such as compressed sensing and neural networks. They discussed the fundamental properties of these widths and established their connections with entropy numbers. In [18,19], Petrova and Wojtaszczyk introduced Lipschitz widths and showed their relationships with other widths and entropy numbers. However, not all mappings are Lipschitz; thus, it is essential to consider weaker conditions to understand the error performance of nonlinear approximation methods. One such condition is the Hölder mapping, which we explore in this paper. We will introduce the concept of Hölder widths and investigate their relationship with other widths and entropy numbers. Our results may provide a better understanding of the effects of such nonlinear approximation methods and their potential applications in deep neural networks.
Many authors have achieved profound results for the ReLU activation function, which acts as a L i p 1 continuous function in feed-forward deep neural networks (DNNs) [14,17,18,20,21]. It is known that the mapping Φ : ( B l c n , · l c n ) C ( [ 0 , 1 ] d ) in [18] with the ReLU activation function is a C n W n Lipschitz mapping in DNNs, where the network has a width W and adepth n. Unlike the Lipschitz width, the terms ‘width’ and ‘depth’ here indicate the scale of the network. The performance of this Lipschitz mapping is discussed in [14].
We introduce a more flexible assumption. Let ( Y , d ) and ( Z , ρ ) be metric spaces. Moreover, we assume that the space Z is separable. Define Φ : ( Y , d ) ( Z , ρ ) as an α -Hölder mapping with coefficient γ if for any x , y Y ,
ρ Φ ( x ) , Φ ( y ) γ d α x , y , γ > 0 and 0 < α 1 .
We could also say that Φ satisfies the H α ( γ ) condition [22] equivalent to
sup x , y Y ρ Φ ( x ) , Φ ( y ) d α x , y γ , γ > 0 and 0 < α 1 .
We provide some remarks on Hölder mappings below.
Remark 1.
If α = 1 , then Φ is Lipschitz continuous.
Remark 2.
If Y is bounded and Φ is an α-Hölder mapping, then for any β α , Φ is a β-Hölder mapping.
Remark 3.
The minimum α-Hölder coefficient γ = 0 if and only if Φ is constant.
Note that the RePU (Rectified Power Unit) activation function [23,24],
σ 1 ( t ) : = RePU ( t ) = max { 0 , t } α , t R ,
with α N , α 2 , and the GELU (Gaussian Error Linear Unit) activation function [25],
σ 2 ( t ) : = GELU ( t ) = 0.5 t 1 + tanh 2 π ( t + 0.044715 t 3 ) , t R .
can be considered 1-Hölder mappings in bounded spaces. The performance of these mappings can be found in [14,18]. Moreover, there are various α -Hölder activation functions with 0 < α < 1 . In [26], Forti, Grazzini, et al. obtained global convergence results, where the neuron activations were modeled by α i -Hölder continuous functions with α i (0, 1), such as
σ i ( t ) = k i sign ( t ) | t | α i ,
where k i > 0 and i θ C defined in [26]. These activations can significantly increase the computational power [26,27,28]. Motivated by the above results, we mainly focus on the α -Hölder condition with 0 < α < 1 , which is weaker than the Lipschitz condition.
Now, we introduce Hölder widths, which measure the best error performance of some recent nonlinear approximation methods characterized by Hölder mappings. Throughout this paper, let X be a Banach space with a norm · X , and let Y n be an n -dimensional Banach space with a norm · Y n on R n , n 1 . Denote the unit ball of Y n by B ( Y n ) : = { y R n : y Y n 1 } .
Let K be a bounded subset of X. For γ 0 , 0 < α < 1 , we define the fixed Hölder widths
δ n γ , α ( K , Y n ) X : = inf Φ n sup f K inf y B ( Y n ) f Φ n ( y ) X ,
where Φ n : B ( Y n ) , · Y n X satisfies
sup x , y B ( Y n ) Φ ( x ) Φ ( y ) X x y Y n α γ .
Next, we define the Hölder width
δ n γ , α ( K ) X : = inf · Y n δ n γ , α ( K , Y n ) X ,
where the infimum is taken over all norms · Y n on R n . From definition (1), we see that the error of any numerical method based on Hölder mappings will not be smaller than the Hölder widths.
We propose the Hölder widths, exploring their properties and relationships with other known widths and entropy numbers. In Section 2, we establish the fundamental properties of Hölder widths. In Section 3, we compare Hölder widths with n -Kolmogorov widths, linear widths, and nonlinear ( n , N ) -widths. In Section 4, we investigate the relationship between Hölder widths and entropy numbers. In Section 5, we provide some specific applications and derive the asymptotic order of Hölder widths for Sobolev classes B W p s ( [ 0 , 1 ] d ) and Besov classes B B p , τ s ( [ 0 , 1 ] d ) using deep neural networks. In Section 6, we provide some concluding remarks. All detailed proofs for the results from Section 2, Section 3 and Section 4 are included in Appendix A, Appendix B and Appendix C, and the proofs for Theorems 10 and 15 in Section 5 are provided in Appendix D.

2. Fundamental Properties of Hölder Widths

Recall that the radius of a set K X is defined as
rad ( K ) = : inf g X sup f K f g X .
It is known from Remark 3 that a function that satisfies the H α ( 0 ) condition is a constant function. Then, for the n -dimensional space Y n ,
rad ( K ) = δ n 0 , α ( K , Y n ) X = δ n 0 , α ( K ) X .
Moreover, for a fixed constant α , it is known from (2) that δ n γ , α ( K ) X is decreasing with respect to γ and n, that is, (i) if γ 1 γ 2 , then δ n γ 2 , α ( K ) X δ n γ 1 , α ( K ) X rad ( K ) < , and (ii) if n 1 n 2 , then δ n 2 γ , α ( K ) X δ n 1 γ , α ( K ) X .
In addition, it is easy to see that the space ( R n , · Y n ) in (1) and (2) can be replaced with any n -dimensional normed space ( Z n , · Z n ) such that
δ n γ , α ( K ) X = inf · Z n δ n γ , α ( K , Z n ) X , δ n γ , α ( K , Z n ) X = inf Φ n sup f K inf z B ( Z n ) f Φ n ( z ) X ,
where B ( Z n ) : = { z Z n : z Z n 1 } .
Denote by
p n : = ( R n , · p )
the space R n equipped with the p norm, that is, for y = ( y 1 , y 2 , , y n ) R n ,
y n : = max j | y j | , y p n : = j = 1 n y j p 1 / p .
Recall that an ε -covering of K is a collection { g 1 , , g m } X such that
K j = 1 m B ( g j , ε ) .
The minimal ε -covering number N ε ( K ) is the minimal cardinality of the ε -covering of K. We say that a set K is totally bounded if for every ε > 0 , N ε ( K ) < .
We establish the following fundamental properties of Hölder widths.
Theorem 1.
Let K be a compact subset of X. For any n N , γ > 0 , and 0 < α < 1 , there exists a norm · Y on R n satisfying for y R n ,
y n y Y y 1 n ,
such that
δ n γ , α ( K ) X = δ n γ , α ( K , Y ) X .
Theorem 2.
Let K be a compact subset of X. As a function of γ, the Hölder width δ n γ , α ( K ) X is continuous.
Theorem 3.
A subset K of X is totally bounded if and only if for every n N ,
lim γ δ n γ , α ( K ) X = 0 .
Theorem 4.
Let K X and 0 < α < 1 . If for every ε > 0 , there exist two numbers n N and γ > 0 such that δ n γ , α ( K ) X ε , then K is totally bounded.
It is important to delve deeper into Hölder widths since the Hölder condition offers a more flexible framework that can accommodate a broader range of functions, making it particularly suitable for analyzing and approximating complicated functions and datasets. In the forthcoming sections, we will gain deeper insights into the approximation performance measured by Hölder widths.

3. The Relationship Between Hölder Widths and Other Widths

In this section, we will demonstrate that Hölder widths are smaller than other known widths, such as n -Kolmogorov, linear, and nonlinear ( n , N ) -widths. In the following sections, we assume that K is a compact subset of the Banach space X, which is our main concern.

3.1. The Relationship Between Hölder Widths, n -Kolmogorov Widths, and Linear Widths

We recall the definition of the n -Kolmogorov width of K from [2], as follows:
d 0 ( K ) X : = sup f K f X , d n ( K ) X : = inf dim ( X n ) = n sup f K inf g X n f g X , n 1 ,
where the infimum is taken over all n -dimensional subspaces X n X .
It is known that the n -Kolmogorov width determines the optimal errors generated by approximating the ‘worst’ element of the set K using n -dimensional subspaces of X. We will show that some Hölder widths are essentially smaller than n -Kolmogorov widths.
Theorem 5.
For a compact set K X and n 1 , 0 < α < 1 ,
δ n γ , α ( K ) X d n ( K ) X , f o r γ = d n ( K ) X + rad ( K ) 2 α 1 .
Corollary 1.
If K X is compact, then for each n 0 N and each γ d n 0 ( K ) X + rad ( K ) 2 α 1 ,
lim n δ n γ , α ( K ) X = 0 .
Remark 4.
Recall that the linear width d n L ( K ) X is defined as
d 0 L ( K ) X : = sup f K f X , d n L ( K ) X : = inf L L n sup f K f L ( f ) X , n 1 ,
where the infimum is taken over the class L n of all continuous linear operators from X into itself with rank at most n. It follows from the definitions of the n -Kolmogorov width and linear width that d n ( K ) X d n L ( K ) X . Thus, Theorem 5 implies that
δ n γ , α ( K ) X d n L ( K ) X , f o r γ = d n ( K ) X + rad ( K ) 2 α 1 .

3.2. The Relationship Between Hölder Widths and Nonlinear ( n , N ) -Widths

To evaluate the performance of the best n -term approximation with respect to different systems, such as the trigonometric system and wavelet bases, Temlyakov introduced the nonlinear ( n , N ) -width in [16], which is defined as follows: for N 1 , n 1 ,
d 0 ( K , N ) X : = sup f K f X , d n ( K , N ) X : = inf L N sup f K inf X n L N dist ( f , X n ) X .
where the second infimum is taken over the sets L N of N linear spaces X n X of dimension n. The nonlinear ( n , N ) -width can reflect the approximation performance of greedy algorithms.
It is clear that d n ( K , 1 ) X = d n ( K ) X . The larger N is, the more flexibility we have in approximating f. Moreover, it is known from (6) and Theorem 5 that
d n ( K , N ) X d n · N ( K ) X δ n · N γ , α ( K ) X , where γ = 2 2 α rad ( K ) .
Moreover, we obtain the following inequalities, revealing the relationship between Hölder widths and nonlinear ( n , N ) -widths.
Theorem 6.
For any n 1 , N > 1 , 0 < α < 1 , and any compact set K X with sup f K f X = 1 , we have
δ n + 1 4 ( N + 1 ) n , α ( K ) X d n ( K , N ) X , and δ n + log 2 N 12 n , α ( K ) X d n ( K , N ) X .

4. Comparison Between Hölder Widths and Entropy Numbers

We first recall the definition of the entropy number from [29]. The entropy number ε n ( K ) X is defined as
ε n ( K ) X : = inf { ε > 0 : K j = 1 2 n B ( g j , ε ) , g j X , j = 1 , 2 , , 2 n } ,
which is the infimum of all ε > 0 for which 2 n balls of radius ε cover a compact set K X , and B ( g j , ε ) : = { g X : g g j X ε } .
Entropy numbers have many applications in fields such as compressed sensing, statistics, and learning theory [30,31,32]. They can provide a benchmark for the best error performance of numerical recovery algorithms. Sometimes, estimating the entropy numbers ε n ( K ) X is more accessible than computing other known widths, such as n -Kolmogorov widths and nonlinear ( n , N ) -widths. For example, for some model classes K, such as unit balls in classical Sobolev and Besov spaces, the entropy numbers are known and can also be used to estimate the lower bound of these widths.
In this section, we compare the convergence rate of the Hölder widths with that of the entropy numbers. We first obtain the following general results.
Theorem 7.
For any n 1 and 0 < α < 1 , we have
δ n 2 k rad ( K ) , α ( K ) X ε k n ( K ) X , k = 1 , 2 , .
Specifically, if k = n , then
δ n 2 n rad ( K ) , α ( K ) X ε n 2 ( K ) X .
Remark 5.
It is known from Theorem 7 that if k = 1 , then
δ n 2 · rad ( K ) , α ( K ) X ε n ( K ) X .
So, by the decreasing property of δ n γ , α ( K ) X with respect to γ, Hölder widths are smaller than entropy numbers for γ 2 · rad ( K ) .
It follows from Theorem 7 with k = 1 that the following corollary holds.
Corollary 2.
Let 0 < α < 1 , and γ 2 · rad ( K ) .
(i) If the following inequality holds:
ε n ( K ) X c 1 ( log 2 n ) p n q , n 1 ,
where c 1 , p , q are some constants such that p R , c 1 , q > 0 , then we have
δ n γ , α ( K ) X C ( log 2 n ) p n q , n 1 ,
where C is a positive constant.
(ii) If the following inequality holds:
ε n ( K ) X c 1 ( log 2 n ) q , n 1 ,
where c 1 , q are two positive constants, then we have
δ n γ , α ( K ) X C ( log 2 n ) q , n 1 ,
where C is a positive constant.
Theorem 7 and Corollary 2 state that given an upper bound on the entropy number, it can be determined that the upper bound of the Hölder width is less than it. Conversely, if we know a lower bound of the entropy number, then we can obtain a lower bound of the Hölder width. To show this, we need the following theorem.
Theorem 8.
Let γ > 0 , 0 < α < 1 . If there exists n > n 0 , where n 0 = n 0 ( c 0 , α , p , q ) satisfies p R , q > 0 ,
δ n γ , α ( K ) X < c 0 ( log 2 n ) p n q ,
then for m = c 1 n log 2 n ,
ε m ( K ) X < C ( log 2 m ) p + q m q ,
where c 1 , C are constants that depend on γ , α , c 0 , p , and q.
Based on Theorem 8, we can obtain a lower bound of the Hölder width from that of the entropy number.
Theorem 9.
(i) If the following inequality holds:
ε n ( K ) X > c 1 ( log 2 n ) p n q , n 1 ,
where c 1 , p , q are constants such that p R , c 1 , q > 0 , then for each γ > 0 and 0 < α < 1 , we have
δ n γ , α ( K ) X C ( log 2 n ) p q n q , n 1 ,
where C is a positive constant.
(ii) If the following inequality holds:
ε n ( K ) X > c 1 ( log 2 n ) q , n 1 ,
where c 1 , q are some positive constants, then for each γ > 0 and 0 < α < 1 , we have
δ n γ , α ( K ) X C ( log 2 n ) q , n 1 ,
where C is a positive constant.
(iii) If the following inequality holds:
ε n ( K ) X > c 1 2 c 2 n p , n 1 ,
where c 1 , c 2 , p are some constants such that c 1 , c 2 > 0 , and 0 < p < 1 , then for each
γ > d n ( K ) X + rad ( K ) 2 α 1 , and 0 < α < 1 ,
we have
δ n γ , α ( K ) X C 1 2 C 2 n p 1 p , n 1 ,
where C 1 , C 2 are two positive constants.
Combining Corollary 2 (ii) with Theorem 9 (ii), we derive the following corollary.
Corollary 3.
For any compact set K X , 0 < α < 1 , γ max 2 · rad ( K ) , d n 0 ( K ) X + rad ( K ) 2 α 1 , and q > 0 , then
ε n ( K ) X ( log 2 n ) q δ n γ , α ( K ) X ( log 2 n ) q , for any n 1 .

5. Some Applications

The importance of the Hölder width lies in its lower bound, which is independent of any specific algorithm. This bound not only reveals the limitations of certain approximation tools but also provides information on the order of the Hölder width without knowing the concrete algorithms. The insights from the lower bound can help us show the optimality of some existing algorithms or prompt us to design optimal algorithms that can achieve such a bound. In essence, the concept of width is independent of any specific algorithm but inspires us to design optimal algorithms.
We apply the above general theoretical results to some important function spaces and obtain the corresponding orders of Hölder widths.
First, we remark that some common neural networks are Hölder mappings. Therefore, we can obtain their asymptotic orders of Hölder widths characterized by these fully connected feed-forward neural networks. In the following discussion, we mainly consider the Banach spaces X of functions, where C ( [ 0 , 1 ] d ) is continuously embedded in X. Let σ : R R be an activation function. Denote by Ω : = [ 0 , 1 ] d and Φ σ : B l n ˜ , · l n ˜ C ( Ω ) .
A feed-forward neural network with width W, depth n, and activation σ produces a family Σ n , σ C ( Ω ) :
Σ n , σ : = Φ σ ( t ) : t R n ˜ , n ˜ = n ˜ ( W , n ) = C 0 n ,
which generates an approximation to a target element f K . For every t R n ˜ , there exists a continuous function Φ σ ( t ) Σ n , σ on Ω
Φ σ ( t ) : = T ( n ) σ ¯ T ( n 1 ) σ ¯ σ ¯ T ( 0 ) ,
where the affine mappings T ( 0 ) : R d R W , T ( k ) : R W R W , k = 1 , 2 , n 1 , and T ( n ) : R W R , and the function σ ¯ : R W R W
σ ¯ ( z j + 1 , , z j + W ) : = σ ( z j + 1 ) , , σ ( z j + W ) .
Here, t is the vector whose coordinates are the entries of the matrices and biases of T ( k ) , k = 0 , 1 , , n . We note that the dimension of any hidden layer can naturally be expanded; thus, any fully connected network can be made to have a fixed width [13,33]. Our assumption about a fixed width W can simplify the computations and notations.
Proposition 1.
If σ is an H α ( γ ) mapping, then Φ σ : B l n ˜ , · l n ˜ , defined in (10), is an H α ( Γ n ) mapping, which means that for t , t B l n ˜ ,
Φ σ ( t ) Φ σ ( t ) C ( Ω ) Γ n t t l n ˜ α ,
where Γ n = M d , α γ 0 n ( 2 W ) n α , γ 0 = max { γ , 1 } , and M d , α is a constant.
It follows that a specific neural network can be a Hölder mapping with coefficient Γ n = M d , α ( 2 W ) α γ 0 n to approximate the target element f, where 0 < α < 1 and W > 1 . Then, we consider the lower bound for the Hölder width with coefficient γ n = M λ n , M > 0 , λ > 1 , which also implies the lower bound of the DNN approximation.
Theorem 10.
Let 0 < α < 1 and γ n = M λ n , where M > 0 , λ > 1 , n 1 .
(i) If the following inequality holds:
ε n ( K ) X > c 1 ( log 2 n ) p n q , n 1 ,
where c 1 , p , q are constants such that p R , c 1 , q > 0 , then
δ n γ n , α ( K ) X C ( log 2 n ) p n 2 q , n 1 ,
where C is a positive constant.
(ii) If the following inequality holds:
ε n ( K ) X > c 1 ( log 2 n ) q , n 1 ,
where c 1 , q are some positive constants, then
δ n γ n , α ( K ) X C ( log 2 n ) q , n 1 ,
where C is a positive constant.
Theorems 7 and 10 imply the following corollary, which means that if we know the asymptotic orders of some entropy numbers, we can obtain the asymptotic orders of their Hölder widths.
Corollary 4.
Let 0 < α < 1 and γ n = M λ n , where M > 0 , λ > 1 , n 1 .
(i) If the following holds:
ε n ( K ) X ( log 2 n ) p n q , n 1 ,
where p , q are some constants such that p R , q > 0 , then we have
δ n γ n , α ( K ) X ( log 2 n ) p n 2 q , n 1 .
(ii) If the following holds:
ε n ( K ) X ( log 2 n ) q , n 1 ,
where q is a positive constant, then we have
δ n γ n , α ( K ) X ( log 2 n ) q , n 1 .
We point out that Corollary 4 provides a tool for giving lower bounds on how well a compact set K can be approximated by a DNN, where all weights and biases are from the unit ball of some norm · Y n ˜ . The classical model classes K for multivariate functions can be the unit balls of smoothness spaces, such as classical Lipschitz, Hölder, Sobolev, and Besov spaces. For any model class K, denote the unit ball of K by
B K : = { f K : f K 1 } .
For any K , A X , let
dist K , A X : = sup f K inf g A f g X .
Many experts have investigated the performance of deep learning approximation for these function classes of the Lebesgue space L p on the cube Ω [20,21,33,34].
First, we determine the exact order of the Hölder width for the classical Sobolev class. For 1 p , we denote by L p ( Ω ) the usual Lebesgue space on Ω equipped with the L p norm
f p : = f L p ( Ω ) = Ω | f ( x ) | p d μ ( x ) 1 / p , 1 p < , ess sup x Ω | f ( x ) | , p = .
For s N , 1 p , we say that function f belongs to the Sobolev space W p s ( Ω ) if f L p ( Ω ) , and the norm · W p s of f is given by, for | k | : = j = 1 d | k j | = s ,
f W p s : = f p + max | k | = s D k f p < .
It is known from [35] that when s > d ( 1 / p 1 / q ) + , 1 < p , q <
ε n B W p s ( Ω ) L q n s / d .
Moreover, the approximation error rates O ( n s / d ) for Sobolev and Besov classes can be obtained using many classical methods of nonlinear approximation, such as adaptive finite elements or n -term wavelet approximation [13,15].
It follows from Theorem 10 and Corollary 4 that for some deep neural networks,
dist B W p s ( Ω ) , Σ n , σ L q δ n ˜ γ n , α B W p s ( Ω ) L q n 2 s / d ,
where Σ n , σ is produced by some neural networks with depth n, fixed width, and H α ( γ ) activation functions σ , 0 < α 1 . Compared with classical methods, the factor 2 in the exponent leaves open the possibility of improved approximation rates when using deep neural networks. Indeed, it is known from [33] that there exists a neural network Σ n , σ with depth n, width 25 d + 31 , and ReLU activation function σ such that, for f B W p s ( Ω ) ,
inf f n Σ n , σ f f n q f W p s · n 2 s / d ,
The author used a novel bit-extraction technique, which gives an optimal encoding of sparse vectors, to obtain the upper bounds O ( n 2 s / d ) .
Based on the above discussion, we obtain the Hölder widths for the Sobolev classes B W p s ( Ω ) .
Theorem 11.
Let 1 < p , q < , 0 < α 1 , and s > d 1 p 1 q + . Then, there exist M > 0 , λ > 1 such that for γ n = M λ n ,
δ n γ n , α B W p s ( Ω ) L q dist B W p s ( Ω ) , Σ n , σ L q n 2 s / d .
Proof. 
It is known from (13) and Corollary 4 that for any M > 0 , λ > 1 ,
δ n γ n , α B W p s ( Ω ) L q n 2 s / d .
Moreover, it is known from (14) that
dist B W p s ( Ω ) , Σ n , σ L q n 2 s / d .
It follows from (15) that
dist B W p s ( Ω ) , Σ n , σ L q sup f B W p s ( Ω ) inf f n Σ n , σ f f n q n 2 s / d ,
and for any f B W p s ( Ω ) ,
inf f n Σ n , σ f f n q n 2 s / d .
Therefore, there exist constants c, c 0 > 0 such that
δ c 0 n c ( 50 d + 62 ) n , 1 B W p s ( Ω ) L q n 2 s / d .
Replacing c 0 n with n, it follows from the decreasing property of the Hölder width that there exist M = 2 c > 0 and λ = ( 50 d + 62 ) 1 / c 0 > 1 such that
δ n γ n , α B W p s ( Ω ) L q δ n c λ n , 1 B W p s ( Ω ) L q n 2 s / d .
Thus, we complete the proof of Theorem 11. □
Remark 6.
Theorem 11 implies that the upper bound in inequality (15) is sharp.
For a fixed s N and d = 1 , Figure 1 compares the performance of the approximation error versus the number of elements or layers using different approximation methods, such as the classical methods represented by n -term wavelets, adaptive finite elements, and new tools represented by deep neural networks. The blue solid line shows the approximation error decreasing at a rate of O ( n s ) with the number of elements n for n -term wavelets or adaptive finite elements, while the orange dashed line indicates a faster decay of O ( n 2 s ) with depth n for deep neural networks. Overall, deep neural networks significantly outperform classical methods, such as n -term wavelets or adaptive finite elements, offering more rapid convergence and potentially higher accuracy in approximating functions. We call this phenomenon the super-convergence of deep neural networks, where the classical Hölder, Sobolev, and Besov classes on [ 0 , 1 ] d can achieve super-convergence [20,21].
The results of Theorem 11 can be extended to Besov spaces, which are much more general than Sobolev spaces. It is well known that functions from Besov spaces have been widely used in approximation theory, statistics, image processing, and machine learning (see [6,36,37,38], and the references therein). Recall that for r N , the modulus of smoothness of order r of f L p ( Ω ) is
ω r ( f , t ) p : = sup | h | t h r ( f , · , Ω ) p ,
where h r ( f , x ) is the r-th order difference of f with step h, and
h r ( f , x , Ω ) : = h r ( f , x ) , x , x + h , , x + r h Ω , 0 , otherwise .
For 0 < s < r , 1 p , τ , we say that function f belongs to the Besov space B p , τ s ( Ω ) if f L p ( Ω ) , and the norm · B p , τ s of f, given by
f B p , τ s : = f p + 0 1 ω r ( f , t ) p t s τ d t t 1 τ , 1 τ < ; f p + sup t > 0 ω r ( f , t ) p t s , τ = ,
is finite. It is known from [15] that when s > d ( 1 / p 1 / q ) + , q
ε n B B p , τ s ( Ω ) L q n s / d .
It follows from Theorem 10 and Corollary 4 that for some deep neural networks,
dist B B p , τ s ( Ω ) , Σ n , σ L q δ n ˜ γ n , α B B p , τ s ( Ω ) L q n 2 s / d ,
and it is also known from [33] that there exists a neural network Σ n , σ with depth n, width 25 d + 31 , and ReLU activation function σ such that for f B B p , τ s ( Ω ) ,
inf f n Σ n , σ f f n q f B p , τ s · n 2 s / d .
Thus, we obtain the Hölder widths for the Besov classes B B p , τ s ( Ω ) .
Theorem 12.
Let 1 < p , q < , 1 τ , 0 < α 1 , and s > d 1 p 1 q + . Then, there exist M > 0 , λ > 1 such that for γ n = M λ n ,
δ n γ n , α B B p , τ s ( Ω ) L q dist B B p , τ s ( Ω ) , Σ n , σ L q n 2 s / d .
The proof of Theorem 12 is similar to that of Theorem 11, so we omit the details here.
Note that any numerical algorithm based on Hölder mappings will have a convergence rate that is not faster than that of the Hölder width. This characterizes the limitation of the approximation power of some deep neural networks.
Meanwhile, it is well known that spherical approximation has been widely applied in many fields, such as cosmic microwave background analysis, global ionospheric prediction for geomagnetic storms, climate change modeling, environmental governance, and other spherical signals [39].
We recall some concepts on the sphere from [40]. Let
S d 1 : = x = ( x 1 , , x d ) R d : j = 1 d x j 2 = 1
be the unit sphere in R d equipped with the rotation-invariant measure d μ ( x ) normalized by S d 1 d μ ( x ) = 1 . For 1 p < , denote by L p ( S d 1 ) the usual Lebesgue space on S d 1 endowed with the L p norm
f L p ( S d 1 ) : = S d 1 | f ( x ) | p d μ ( x ) 1 / p for 1 p < .
In [41], Feng et al. studied the approximation of the Sobolev class on the sphere S d 1 , denoted by W p s ( S d 1 ) , using convolutional neural networks with layer J. Recall that a function f belongs to the Sobolev class W p s ( S d 1 ) if f L p ( S d 1 ) , and the norm · W p s ( S d 1 ) of f is given by
f W p s ( S d 1 ) : = f L p ( S d 1 ) + ( 0 ) s / 2 f L p ( S d 1 ) < ,
where 0 is the Laplace–Beltrami operator on the sphere. The authors obtained the upper bound O ( J s d 1 ) of the error of such an approximation in L p ( S d 1 ) .
However, it is known from [42] that when s > ( d 1 ) ( 1 / p 1 / q ) + , 1 < p , q <
ε n B W p s ( S d 1 ) L q n s d 1 .
Then, it follows from Corollary 4 that the lower bound of the Hölder width for B W p s ( S d 1 ) may be O ( n 2 s d 1 ) .
Theorem 13.
Let 1 < p , q < , 0 < α 1 , and s > ( d 1 ) 1 p 1 q + . Then, there exist M > 0 , λ > 1 such that for γ n = M λ n ,
dist B W p s ( S d 1 ) , Σ n , σ L q δ n γ n , α B W p s ( S d 1 ) L q n 2 s d 1 .
With the development of spherical theory [40,43,44,45] and the relationship between Hölder widths and entropy numbers, we conjecture that the approximation order of W p s ( S d 1 ) using some fully connected feed-forward neural networks with the bit-extraction technique may be O ( n 2 s d 1 ) . This conjecture can be formulated as follows.
Conjecture 1.
Let 1 < p , q < , 0 < α 1 , and s > ( d 1 ) 1 p 1 q + . Then, there exist M > 0 , λ > 1 such that for γ n = M λ n ,
δ n γ n , α B W p s ( S d 1 ) L q dist B W p s ( S d 1 ) , Σ n , σ L q n 2 s d 1 .
Our results show that networks modeled by both Hölder and Lipschitz (e.g., ReLU) functions can achieve the approximation error O ( n 2 s / d ) , which is superior to classical approximation tools such as n -term wavelets and adaptive finite elements. We achieve the same approximation error for networks with a weaker condition, 0 < α < 1 , which gives us more options for neural network approximation tools. For example, if we need to solve numerically differential equations with a discontinuous right-hand side modeling neural network dynamics, we can choose networks with non-Lipschitz activation functions, using the fact that non-Lipschitz functions share the peculiar property that even small variations in the neuron state are able to produce significant changes in the neuron output [26,27,28]. The results of the Hölder widths would help us select suitable Hölder activation functions. Moreover, it is known from Appendix B.2 that the Hölder width is smaller than the Lipschitz width in the sense that
δ n 2 γ , α ( K ) X δ n γ , 1 ( K ) X .
Thus, from the perspective of the approximation error, the Hölder activation function performs better than the Lipschitz activation function. However, it is currently unknown what magnitude this improvement can reach. It would be interesting to study this problem.
Next, we estimate the Hölder widths for discrete Lebesgue spaces. For 1 q < , denote by q the set of all sequences { x k } k = 1 with
x q : = k = 1 | x k | q 1 / q < .
Let τ = { t k } k = 1 be a sequence such that t k = log 2 ( k + 1 ) 1 / 2 for k 1 . Denote by
A τ : = y 2 : y k = t k x k , where j = 1 | x k | 1 ,
where x = ( x 1 , x 2 , ) 1 , and y = ( y 1 , y 2 , ) 2 .
Theorem 14.
For the space X = 2 and the subset A τ , its Hölder width satisfies γ 2 · rad ( A τ ) and 0 < α < 1 ,
log n 1 / 2 n 1 / 2 δ n γ , α ( A τ ) X n 1 / 2 .
Proof. 
It is known from [46] that its entropy number satisfies
ε n ( A τ ) 2 n 1 / 2 .
It follows from Corollary 2 that there exists C 1 > 0 such that its Hölder width satisfies γ 2 · rad ( A τ ) ,
δ n γ , α ( A τ ) 2 C 1 n 1 / 2 .
It follows from Theorem 9 (i) that there exists C 2 > 0 such that its Hölder width satisfies
δ n γ , α ( A τ ) 2 C 2 log n 1 / 2 n 1 / 2 .
The proof of Theorem 14 is completed. □
Remark 7.
It is known from [18] that its n -Kolmogorov width satisfies
d n ( A τ ) 2 ( log 2 n ) 1 / 2 .
Theorem 14 illustrates that the Hölder width of A τ is smaller than that of the n -Kolmogorov width.
Finally, we obtain the asymptotic order of the Hölder width for c 0 , which is the Banach space of all sequences converging to 0, equipped with the norm. Let η = { η k } k = 1 be a sequence with η k = log 2 ( k + 1 ) 1 , k 1 . Denote a compact subset of X by
K η : = { η k e k } k = 1 { 0 } ,
where the sequence { e k } k = 1 is the standard basis in X. It is known from [18] that its entropy number satisfies
ε n ( K η ) X 1 n .
Theorem 15.
For the space X = c 0 and the subset K η , we obtain that its Hölder width satisfies γ > 2 , 0 < α < 1 ,
δ n γ , α ( K η ) X 1 n log 2 ( n + 1 ) .
Theorem 15 shows the sharpness of Theorem 9 (i).

6. Concluding Remarks

We introduce the Hölder width, which measures the best error performance of some recent nonlinear approximation methods. We investigate the relationship between Hölder widths and other known widths, demonstrating that some Hölder widths are essentially smaller than n -Kolmogorov widths and linear widths. Moreover, we obtain that as the Hölder constants grow with n, the Hölder widths are much smaller than the entropy numbers. The significance of Hölder widths being smaller than known widths is that it implies that some nonlinear approximations, such as deep neural network approximations, may yield a better approximation order than other known classical approximation methods. In fact, we show that the asymptotic orders of Hölder widths for the Sobolev classes B W p s ( Ω ) and the Besov classes B B p , τ s ( Ω ) are O ( n 2 s / d ) for s > d 1 p 1 q + using deep neural networks, while other known widths might be O ( n s / d ) . This result shows that deep neural networks significantly outperform classical methods of approximation, such as adaptive finite elements and n -term wavelet approximation. Indeed, the Hölder width in neural networks serves two purposes. On the one hand, it demonstrates the superior approximation power of deep neural networks. On the other hand, it reveals the limitation in the approximating ability of some deep neural networks. These features are crucial for a deeper understanding and further exploration of the approximation power of deep neural networks. It would be interesting to calculate the Hölder widths for some important function classes.

Author Contributions

M.L. and P.Y. contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the National Natural Science Foundation of China (Grant No. 11671213).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs of Section 2

Appendix A.1. Proof of Theorem 1

To prove Theorem 1, we need the following lemmas.
Lemma A1
(Auerbach lemma [47,48]). Let X n be an n -dimensional Banach space and X n * be its dual space. Then, there exist elements x 1 , , x n X n and functionals f 1 , , f n X n * such that for 1 i , j n ,
x j X n = f j X n * = 1 , and f j ( x i ) = 1 , i = j , 0 , i j .
Lemma A2.
Let K be a bounded subset of X. For any n N , γ > 0 , and 0 < α < 1 , we can limit the infimum in (2) to normed spaces ( R n , · Z n ) with the norm · Z n satisfying, for y R n ,
y n y Z n y 1 n .
Proof. 
According to Lemma A1, we can find vectors { v j } j = 1 n R n and linear functionals { f j } j = 1 n on the space ( R n , · Y n ) satisfying
v j Y n = f j Y n * = 1 , and f i ( v j ) = 1 , i = j , 0 , i j ,
where Y n * is the dual space of Y n and i , j = 1 . , n .
For y = ( y 1 , y 2 , , y n ) R n , we can consider a new norm · Z n on R n and a mapping ϕ 0 : ( B ( Z n ) , · Z n ) ( B ( Y n ) , · Y n ) as
y Z n : = j = 1 n y j v j Y n , and ϕ 0 ( y ) : = j = 1 n y j v j .
It is clear that ϕ 0 ( B ( Z n ) ) = B ( Y n ) .
In this case, we can construct an H α ( γ ) mapping Φ ˜ n : B ( Z n ) , · Z n X . For any H α ( γ ) mapping Φ n : ( B ( Y n ) , · Y n ) X , we can define Φ ˜ n as Φ ˜ n : = Φ n ϕ 0 . Then,
Φ ˜ n ( y ) Φ ˜ n ( y ) X = Φ n ϕ 0 ( y ) Φ n ϕ 0 ( y ) X γ ϕ 0 ( y ) ϕ 0 ( y ) Y n α .
By (A2) and (A3),
Φ ˜ n ( y ) Φ ˜ n ( y ) X γ j = 1 n y j y j v j Y n α = γ y j y j Z n α .
Therefore, Φ ˜ n satisfies the H α ( γ ) condition, and Φ ˜ n ( B ( Z n ) ) = Φ n ( B ( Y n ) ) .
Moreover, we can verify that the construction (A3) satisfies the condition (A1). Indeed,
y Z n j = 1 n y j v j Y n = j = 1 n y j = y 1 n ,
and for f i Y n * ,
y Z n = j = 1 n y j v j Y n = sup f Y n * = 1 f j = 1 n y j v j f i j = 1 n y j v j = y i , i = 1 , , n .
Then, y Z n max i | y i | = y n . Thus, the proof of Lemma A2 is completed. □
We also need the following α -Hölder version of Ascoli’s theorem.
Lemma A3.
For a separable metric space ( Z , ρ ) and a metric space ( Y , d ) where every closed ball is compact, let Ψ j : Y Z be a sequence of α-Hölder mappings such that there exist y Y and z Z satisfying Ψ j ( y ) = z for j 1 . Then, there exists a subsequence { Ψ j k } k = 1 , which converges pointwise to an α-Hölder function Φ : Y Z . If ( Y , d ) is also compact, then the convergence is uniform.
Proof. 
For f Y , we have
ρ Ψ j ( f ) , z = ρ Ψ j ( f ) , Ψ j ( y ) γ d α f , y .
Fix a countable dense subset M = { f j } j = 1 Y , and define B j : = B z , γ d α ( f j , y ) as the closed ball in Z with radius γ d α ( f j , y ) centered at z for j 1 . It follows that the Cartesian product
B : = B 1 × B 2 × ,
is a compact metric space under the product topology.
Let
F n : = { Ψ n ( f 1 ) , Ψ n ( f 2 ) , , Ψ n ( f j ) , } B .
Then, there exists a subsequence { F n s } s = 1 and an element F B such that
F ( j ) = lim s F n s ( j ) = lim s Ψ n s ( f j ) , j = 1 , 2 , .
Thus, we obtain a function Φ : M Z satisfying
Φ ( f j ) = F ( j ) = lim s Ψ n s ( f j ) ,
where ρ ( Φ ( f j ) , Φ ( f i ) ) γ d α ( f j , f i ) for i, j 1 . Since M is dense, Φ can extend to an α -Hölder function on Y, that is, Φ : Y Z . Furthermore, for every f Y , Φ ( f ) = lim s Ψ n s ( f ) , which implies that Ψ n s is pointwise convergent to Φ .
If ( Y , d ) is compact, then for ε > 0 , we can cover Y by a finite number of ε 1 α -balls with centers g 1 , g k Y . Thus, for sufficiently large s, we have
sup f Y ρ ( Ψ n s ( f ) , Φ ( f ) ) sup f Y ρ ( Ψ n s ( f ) , Ψ n s ( g i ) ) + ρ ( Ψ n s ( g i ) , Φ ( g i ) ) + sup f Y ρ ( Φ ( g i ) , Φ ( f ) ) 2 γ ε + max i = 1 , , n ρ ( Ψ n s ( g i ) , Φ ( g i ) ) ( 2 γ + 1 ) ε ,
which implies that the convergence is uniform. The proof of Lemma A3 is completed. □
Now, we are ready to prove all theorems in this section.
Proof of Theorem 1.
By (1) and (2), it is obvious that for any n -dimensional space Y with a norm · Y on R n ,
δ n γ , α ( K ) X δ n γ , α ( K , Y ) X .
Thus, we only need to prove that
δ n γ , α ( K ) X δ n γ , α ( K , Y ) X .
By Lemma A2, we can find a sequence { Ψ j } j = 1 , where Ψ j : ( B ( Z j ) , · Z j ) X satisfies the H α ( γ ) condition and the norm · Z j on R n satisfies (5), such that
d j : = sup f K inf y B ( Z j ) f Ψ j ( y ) X δ n γ , α ( K ) X , as j .
According to Lemma A3, there exists a subsequence · Z j k of the sequence of norms · Z j that converges pointwise on R n and uniformly on B n to a norm · Y on R n satisfying (5).
Thus, there exists a number j 0 N such that for every j j 0 , there is a corresponding value ε j with 0 < ε j < 1 , lim j ε j = 0 and
y Z j ε j y Y y Z j + ε j , for all y n 1 .
If y ( 1 ε j ) B ( Z j ) , then we have y Z j 1 ε j and y Y 1 ε j + ε j 1 . And if y B ( Y ) , then we have y Z j 1 + ε j . Thus,
( 1 ε j ) B ( Z j ) B ( Y ) ( 1 + ε j ) B ( Z j ) , j j 0 .
Next, we define the mapping Ψ ˜ j : ( 1 + ε j ) B ( Z j ) X as
Ψ ˜ j : = Ψ j ( 1 + ε j ) 1 y .
For any y , y ( 1 + ε j ) B ( Z j ) , we have
Ψ ˜ j ( y ) Ψ ˜ j ( y ) X = Ψ j ( 1 + ε j ) 1 y Ψ j ( 1 + ε j ) 1 y X γ 1 + ε j α y y Z j α < γ y y Z j α ,
so Ψ ˜ j also satisfies the H α ( γ ) condition. We could write Φ j as the restriction of Ψ ˜ j on B ( Y ) .
Let f K and j j 0 . Then, for each θ > 0 , we can find an element y = y ( f , j , θ ) B ( Z j ) such that f Ψ j ( y ) X < d j + θ . Taking
z : = ( 1 ε j ) y ( 1 ε j ) B ( Z j ) B ( Y ) ,
we have
inf x B Y f Φ j ( x ) X f Φ j ( z ) X = f Ψ ˜ j ( z ) X = f Ψ j ( ( 1 + ε j ) 1 z ) X = f Ψ j 1 ε j 1 + ε j · y X f Ψ j ( y ) X + Ψ j ( y ) Ψ j 1 ε j 1 + ε j · y X < d j + θ + γ 1 1 ε j 1 + ε j α y Z j α d j + θ + γ 2 ε j 1 + ε j α .
As θ 0 and by taking the supremum over f K , we have
δ n γ , α ( K , Y ) X sup f K inf x B ( Y ) f Φ j ( x ) X d j + γ 2 ε j 1 + ε j α .
As j , we have d j δ n γ , α ( K ) X and ε j 0 . Thus, δ n γ , α ( K , Y ) X δ n γ , α ( K ) X .
The proof of Theorem 1 is completed. □

Appendix A.2. Proofs of Theorems 2–4

Proof of Theorem 2.
We begin with the continuity of the Hölder widths δ n γ , α ( K ) X at γ = 0 . For any H α ( γ ) mapping Φ : ( B ( Y n ) , · Y n ) X , f K and y B ( Y n ) , we have
f Φ ( y ) X f Φ ( 0 ) X Φ ( 0 ) Φ ( y ) X f Φ ( 0 ) X γ y Y n α f Φ ( 0 ) X γ .
Thus, for any H α ( γ ) mapping Φ ,
sup f K inf y B ( Y n ) f Φ ( y ) X sup f K f Φ ( 0 ) X γ inf g X sup f K f g X γ ,
which implies that
δ n γ , α ( K , Y n ) X inf g X sup f K f g X γ .
It follows from (3), (4), and the decreasing property of δ n γ , α ( K ) X with respect to γ that
rad ( K ) γ δ n γ , α ( K ) X rad ( K ) ,
which implies that the continuity at γ = 0 .
Next, we prove that δ n γ , α ( K ) X is continuous for γ > 0 by contradiction. For convenience, denote by F ( γ ) : = δ n γ , α ( K ) X . We assume that F is not continuous at some γ 0 > 0 . Therefore, there exists ξ 0 > 0 and a sequence of real numbers δ k 0 , k 1 such that
| F ( γ 0 + δ k ) F ( γ 0 δ k ) | ξ 0 .
It is known from the definition of Hölder widths that for fixed δ k , there exists an H α ( γ 0 + δ k ) mapping Φ n : ( B ( Y n ) , · Y n ) X such that
F ( γ 0 + δ k ) sup f K inf y B ( Y n ) f Φ n ( y ) X F ( γ 0 + δ k ) + δ k .
Let λ k : = γ 0 δ k γ 0 + δ k , and Φ ˜ n : = λ k Φ n . Then, Φ ˜ n is an H α ( γ 0 δ k ) mapping, and
F ( γ 0 δ k ) sup f K inf y B ( Y n ) f Φ ˜ n ( y ) X sup f K inf y B ( Y n ) λ k f Φ n ( y ) X + ( 1 λ k ) f X λ k ( F ( γ 0 + δ k ) + δ k ) + ( 1 λ k ) sup f K f X .
It follows from the compactness of K and (A5) that
F ( γ 0 + δ k ) + ξ 0 F ( γ 0 δ k ) λ k ( F ( γ 0 + δ k ) + δ k ) + ( 1 λ k ) C ,
where C : = sup f K f X < . As δ k 0 , we have λ k 0 , and thus F ( γ 0 ) + ξ 0 F ( γ 0 ) , which contradicts the assumption that ξ 0 > 0 . Thus, δ n γ , α ( K ) X is continuous as a function of γ 0 . We complete the proof of Theorem 2. □
Proof of Theorem 3.
If lim γ 0 δ n γ , α ( K ) X = 0 , then for any δ > 0 , there exists a norm · Y n and an H α ( γ ) mapping Φ n such that
sup f K inf y B ( Y n ) f Φ n ( y ) X < δ 2 .
Thus, for a given f K , we can find y f B ( Y n ) such that
f Φ n ( y f ) X δ 2 .
For a compact set B ( Y n ) , there is a finite collection { y j } j = 1 N B ( Y n ) such that
B Y n j = 1 N B y j , δ 2 γ 1 / α .
Therefore, for y f B ( Y n ) , there exists j 0 { 1 , , N } such that y f B y j 0 , δ 2 γ 1 / α , and
Φ ( y f ) Φ ( y j 0 ) X γ y f y j 0 Y n α δ 2 .
Thus, for any f K , there exists j 0 { 1 , , N } such that
f Φ ( y j 0 ) X f Φ ( y f ) X + Φ ( y f ) Φ ( y j 0 ) X δ ,
which implies that K j = 1 N B ( Φ ( y j ) , δ ) . By the arbitrariness of δ , the set K is totally bounded.
If K is totally bounded, then for ε > 0 , we can find a minimal δ -covering { f j } j = 1 N ε ( K ) and a suitable γ > 0 such that
diam K · ( N ε ( K ) 1 ) 2 γ ,
where diam K : = sup f , g K f g X . We only consider the case n = 1 since δ n γ , α ( K ) X δ 1 γ , α ( K ) X for any n 1 . Set the points in ( [ 1 , 1 ] , | · | ) satisfying
t j : = 1 + 2 j 2 N ε ( K ) 1 1 α , j = 1 , , N ε ( K ) ,
and the continuous piecewise linear function Φ : [ 1 , 1 ] , | · | X satisfying
Φ ( t j ) : = f j , j = 1 , , N ε ( K ) .
Thus, it follows from (A6) that for 0 < α < 1 ,
max j = 1 , , N ε ( K ) 1 f j + 1 f j X | t j + 1 t j | α diam K · ( N ε ( K ) 1 ) 2 · j 1 / α ( j 1 ) 1 / α α γ ,
which implies that Φ satisfies the H α ( γ ) condition. Therefore, we have
sup f K inf y [ 1 , 1 ] f Φ ( y ) X δ ,
and thus lim γ 0 δ 1 γ , α ( K ) X = 0 . The proof of Theorem 3 is completed. □
Proof of Theorem 4.
The proof is similar to that of Theorem 3. For any fixed η > 0 , there exist n 0 N and γ 0 > 0 such that we can find a norm · Y n 0 in R n 0 and an H α ( γ 0 ) mapping Φ satisfying
sup f K inf y B ( Y n 0 ) f Φ ( y ) X < η 2 .
Since B ( Y n 0 ) is compact, there is a finite collection { y j } j = 1 N B ( Y n 0 ) , which is a η 2 γ 0 1 / α -covering of B ( Y n 0 ) . Then, for any y B ( Y n 0 ) , we can find y j 0 , j 0 = { 1 , , N } such that
Φ ( y ) Φ ( y j 0 ) X γ 0 y y j 0 Y n 0 α η 2 .
Thus, for any f K , we can find y B ( Y n 0 ) and y j 0 , j = { 1 , , N } such that
f Φ ( y j 0 ) X f Φ ( y ) X + Φ ( y ) Φ ( y j 0 ) X η ,
which implies that K j = 1 N B ( Φ ( y j ) , η ) . The proof of Theorem 4 is completed. □

Appendix B. Proofs of Section 3

Appendix B.1. Proofs of Theorem 5 and Corollary 1

Proof of Theorem 5.
We begin with
γ > d n ( K ) X + rad ( K ) 2 α 1 , and η : = γ d n ( K ) X + rad ( K ) 2 α 1 > 0 .
Choose a number η 1 such that 0 < η 1 < η . Let X n be an n -dimensional linear subspace of X that satisfies the inequality
sup f K inf g X n f g X d n ( K ) X + 2 α 1 η 1 .
Then, for every f K , there is an element h : = h ( f ) in X n such that
f h X d n ( K ) X + 2 α 1 η 1 .
Denote the set of all such elements as
M : = { h ( f ) : f K } X n .
Next, fix f 0 X such that
sup f K f f 0 rad ( K ) + 2 α 1 η 2 α 1 η 1 .
Then, for f K ,
h f 0 X h ( f ) f X + f f 0 X < d n ( K ) X + rad ( K ) + 2 α 1 η = 2 α 1 γ .
Thus, we have
rad ( M ) 2 α 1 γ , and M { g X n : g f 0 X 2 α 1 γ } = : B ( f 0 , 2 α 1 γ ) .
In addition, denote the unit ball in X n by B ( X n ) , which is defined by
B ( X n ) : = { g X n : g X = 1 } .
Define the mapping Φ : ( B ( X n ) , · X ) X such that
Φ ( g ) : = f 0 + 2 α 1 γ g .
Then, for g 1 , g 2 B ( X n ) ,
Φ ( g 1 ) Φ ( g 2 ) X = 2 α 1 γ g 1 g 2 X γ g 1 g 2 X α .
Thus, Φ satisfies the H α ( γ ) condition, and Φ ( B ( X n ) ) = B ( f 0 , 2 α 1 γ ) .
For M Φ ( B ( X n ) ) and (A7), we have
sup f K inf g X n f Φ ( g ) X sup f K inf h M f h X d n ( K ) X + 2 α 1 η 1 .
Thus,
δ n γ , α ( K ) X d n ( K ) X + 2 α 1 η 1 .
Let η 1 0 , then for any γ > d n ( K ) X + rad ( K ) 2 α 1 ,
δ n γ , α ( K ) X d n ( K ) X .
By Theorem 3 as γ d n ( K ) X + rad ( K ) 2 α 1 , we complete the proof of Theorem 5. □
Proof of Corollary 1.
For a compact set K X , it is known from [2] that the sequence { d n ( K ) X } n = 1 is decreasing and tends to zero. Denote
γ 0 = d n 0 ( K ) X + rad ( K ) 2 α 1 .
For γ γ 0 , it follows from Theorem 5 that
δ n γ , α ( K ) X δ n γ 0 , α ( K ) X .
Then, it is clear that Corollary 1 holds true. □

Appendix B.2. Proofs of Theorem 6

To prove Theorem 6, we use a result from [14].
Lemma A4.
For any n 1 , N > 1 , and any compact set K X with sup f K f X = 1 , the following inequalities hold
δ n + 1 2 ( N + 1 ) n , 1 ( K ) X d n ( K , N ) X , a n d δ n + log 2 N 6 n , 1 ( K ) X d n ( K , N ) X .
Proof of Theorem 6.
By the definition of δ n γ , 1 ( K ) X , there is an ε > 0 and a L i p γ mapping Ψ : ( B ( Y n ) , · Y n ) X satisfying
f Ψ n ( y ) X δ n γ , 1 ( K , Y n ) X + ε ,
where, for any y , y B ( Y n ) , the mapping Ψ satisfies
Ψ ( y ) Ψ ( y ) X γ y y Y n .
Then, we have
Ψ ( y ) Ψ ( y ) X 2 1 α γ y y Y n α 2 γ y y Y n α .
Thus, Ψ satisfies the α -Hölder condition, and Ψ is an H α ( 2 γ ) mapping.
Therefore, it is known from (A8) and the definition of Hölder widths that
δ n 2 γ , α ( K , Y n ) X sup f K inf y B ( Y n ) f Ψ n ( y ) X δ n γ , 1 ( K , Y n ) X + ε .
Taking ε 0 , we obtain
δ n 2 γ , α ( K ) X δ n γ , 1 ( K ) X .
Thus, by Lemma A4, we complete the proof of Theorem 6. □

Appendix C. Proofs of Section 4

Appendix C.1. Proof of Theorem 7

To prove Theorem 7, we recall a result from [22].
Proposition A1
([22]). If Φ satisfies the H α 1 ( γ 1 ) condition, and Ψ satisfies the H α 1 ( γ 2 ) condition, the composition Φ Ψ satisfies the H α 1 α 2 ( γ 1 γ 2 α 2 ) condition. In addition, if α 1 = α 2 = α , then Φ + Ψ satisfies the H α ( γ 1 + γ 2 ) condition.
Proof of Theorem 7.
Note that entropy numbers and Hölder widths are invariant in terms of translation, that is, for any g X ,
ε n ( K ) X = ε n ( K g ) X , δ n γ , α ( K ) X = δ n γ , α ( K g ) X .
We only need to consider the compact set K B X ( 0 , r ) : = { g X : g X r , where r > 0 } .
For any ε > 0 , it is known from the definition of Hölder widths that there is an element g X satisfying
sup f K f g X < rad ( K ) + ε 2 k .
Thus, we could set r = rad ( K ) + ε 2 k .
Let η > 0 and the set S k n : = { h 1 , h 2 , h 2 k n } K , satisfying, for any f K , that there is an element h j S k n such that
f h j X ε k n ( K ) X + η .
Next, we split the unit ball ( B n , · n ) = [ 1 , 1 ] n R n into 2 k n non-overlapping open balls B j with side length 2 1 k . Denote by y j the center of B j . Let ϕ j : R n R be the mapping such that
ϕ j ( y ) = max { 0 , 1 2 k y j y n α } , j = 1 , , 2 k n .
It is clear that z = max { 0 , x } , x R satisfies the H 1 ( 1 ) condition, and z = x α , x 0 satisfies the H α ( 1 ) condition. Indeed, for a constant 0 < α < 1 , the function ζ ( t ) : = ( 1 + t ) α 1 + t α on t 0 is non-negative and has maximum 1. Set t = x 1 x 2 for any x 1 , x 2 > 0 . Then, we have
1 + x 1 x 2 α 1 + x 1 x 2 α ,
which implies that | x 1 + x 2 | α x 1 α + x 2 α , and thus
| x 1 α x 2 α | | x 1 x 2 | α .
So, z = x α , x 0 satisfies the H α ( 1 ) condition. It follows from Proposition A1 that ϕ j satisfies the H α ( 2 k ) condition. Denote by Y = ( R n , · n ) . Let the mapping Φ : Y X satisfy
Φ ( y ) : = j = 1 2 k n h j ϕ j ( y ) .
Then, we prove that Φ satisfies the H α 2 k r condition and Φ ( y j ) = h j .
It is known from (A9) that if y = y j , then ϕ j ( y ) = 1 , and if y B j , then ϕ j ( y ) = 0 . Thus, Φ ( y j ) = h j . Moreover, for any y , y Y , we consider the following three cases.
Case 1: If y , y B n , then Φ ( y ) Φ ( y ) X = 0 2 k max j h j X y y Y α .
Case 2: If y B n and y B n , then there is a j 0 such that y B j 0 . Thus,
Φ ( y ) Φ ( y ) X = h j 0 ϕ j 0 ( y ) 0 X max j h j X · ϕ j 0 ( y ) ϕ j 0 ( y ) 2 k max j h j X · y y Y α .
Case 3: If y , y B n , then there exist two numbers j 0 , j 1 such that y B j 0 and y B j 1 . Therefore,
Φ ( y ) Φ ( y ) X max j h j X · | ϕ j 0 ( y ) ϕ j 1 ( y ) | .
We divide it into the following cases.
Case 3.1: If j 0 = j 1 , it follows from ϕ j 0 satisfying the H α ( 2 k ) condition that
Φ ( y ) Φ ( y ) X max j h j X · | ϕ j 0 ( y ) ϕ j 0 ( y ) | 2 k max j h j X · y y Y α .
Case 3.2: If j 0 j 1 , we have
| ϕ j 0 ( y ) ϕ j 1 ( y ) | = 2 k · y j 0 y n α y j 1 y n α .
Due to the arbitrariness of y , y and j 0 , j 1 , we can assume that y j 0 y n α y j 1 y n α . Then,
y j 0 y n α y j 1 y n α = y j 0 y n α y j 1 y n α y j 1 y n α y j 1 y n α y y n α ,
where the last inequality uses (A10). Otherwise,
y j 0 y n α y j 1 y n α = y j 1 y n α y j 0 y n α y j 0 y n α y j 0 y n α y y n α .
Thus, by combining (A11) with (A12), we have
Φ ( y ) Φ ( y ) X 2 k max j h j X · y y Y α .
Therefore, Φ satisfies the H α 2 k r condition, where r = rad ( K ) + ε 2 k > max j h j X .
Then, we have
δ n 2 k r , α ( K ) X = δ n 2 k rad ( K ) + ε , α ( K ) X ε k n ( K ) X + η .
Let η 0 and ε 0 . We obtain
δ n 2 k rad ( K ) , α ( K ) X ε k n ( K ) X .
We complete the proof of Theorem 7. □

Appendix C.2. Proof of Theorem 8

To prove Theorem 8, we recall some definitions and lemmas.
An ε -packing of K is a collection { f 1 , , f l } K , there is
min i j f i f j X > ε .
The maximal ε -packing number P ε ( K ) is the cardinality of the largest ε -packing of K.
It is known ([3], Chapter 15) that P 2 ε ( K ) N ε ( K ) P ε ( K ) , and Lemma A5 holds.
Lemma A5
([3]). For the ball B r : = { x Y n : x Y n r } and 0 ε r , we have
2 n r ε n P ε ( B r ) 3 n r ε n ,
and
2 n r ε n N ε ( B r ) 3 n r ε n ,
Lemma A6.
If δ n γ , α ( K ) X < ε , then
γ 3 α ε N 2 ε α n ( K ) .
Specifically, if δ n γ , α ( B Z m ) X ε , then
γ 3 α 2 2 m α n ε 1 m α n ,
where B Z m is an m-dimensional unit ball of X.
Proof. 
For δ n γ , α ( K ) X < ε , there exists an H α ( γ ) mapping Φ and a norm · Y n such that Φ ( B ( Y n ) ) can approximate K with an accuracy of ε . That is, for any y B ( Y n ) and z K ,
z Φ ( y ) X < ε .
Then, we consider a collection { y 1 , y 2 , , y N } B ( Y n ) such that { Φ ( y 1 ) , Φ ( y 2 ) , , Φ ( y N ) } is a maximal ε -packing of Φ B ( Y n ) . By the definition of the maximal packing number and Hölder widths, for any i j , i , j = 1 , , N , we obtain
ε < Φ ( y i ) Φ ( y j ) X γ y i y j Y n α ,
and thus
y i y j Y n > ε γ 1 α .
In addition, if we add an element y B ( Y n ) , then the set { Φ ( y ) , Φ ( y 1 ) , Φ ( y 2 ) , , Φ ( y N ) } is not an ε -packing of Φ B ( Y n ) . Therefore, there exists a j 0 [ 1 , N ] satisfying
Φ ( y ) Φ ( y j 0 ) X < ε ,
which implies that
z Φ ( y j 0 ) X z Φ ( y ) X + Φ ( y ) Φ ( y j 0 )   X < 2 ε .
By the definition of the maximal packing number and (A16),
N P ( ε γ 1 ) 1 / α ( B ( Y n ) ) .
Thus, by (A13), we have
N ( 3 α γ ε 1 ) n α .
It follows from (A17) that { Φ ( y 1 ) , Φ ( y 2 ) , , Φ ( y N ) } is a 2 ε -covering of K. Therefore,
( 3 α γ ε 1 ) n α N N 2 ε ( K ) .
Then, we obtain
γ n α 3 n ε n α N 2 ε ( K ) ,
which means that (A15) holds true.
When K = B Z m , by (A14), we have N 2 ε ( K ) 2 2 m ε m . It follows from (A18) that
γ 3 α ε N 2 ε α n ( K ) 3 α 2 2 m α n ε 1 m α n .
The proof of Lemma A6 is completed. □
Proof of Theorem 8.
By Lemma A6 with ε = c 0 ( log 2 n ) p n q , we have
N 2 ε ( K ) ( 3 α γ ε 1 ) n α = ( 3 α γ c 0 1 ( log 2 n ) p n q ) n α 2 n α { log 2 ( 3 α γ c 0 1 ) p log 2 ( log 2 n ) + q log 2 n } < 2 c 1 n log 2 n ,
where c 1 is a constant depending on γ , α , c 0 , p , and q. It follows from the definitions of the minimal covering number and entropy number that
ε c 1 n log 2 n ( K ) X 2 ε = 2 c 0 ( log 2 n ) p n q .
If we take m = c 1 n log 2 n , then
log 2 m = log 2 c 1 + log 2 n + log 2 ( log 2 n ) .
Thus, for sufficiently large n, we obtain 2 1 log 2 n < log 2 m < 3 log 2 n . By using (A19) and n = m c 1 log 2 n , we obtain
ε m ( K ) X 2 c 0 ( log 2 n ) p m c 1 log 2 n q = 2 c 0 c 1 q ( log 2 n ) p + q m q C ( log 2 m ) p + q m q .
We complete the proof of Theorem 8. □

Appendix C.3. Proof of Theorem 9

To prove Theorem 9, we need the following lemma.
Lemma A7.
Suppose that the sequence { η n } n = 1 of real numbers is decreasing to zero. Moreover, if
ε n ( K ) η n , n 1 ,
and there exist m N and θ > 0 satisfying
δ m γ , α ( K ) X < θ ,
then
η m α log 2 ( 3 α γ θ 1 ) < 2 θ .
Proof. 
By using Lemma A6 with ε = θ , we have
N 2 θ ( K ) ( 3 α γ θ 1 ) m α = 2 m α log 2 ( 3 α γ θ 1 ) .
Then, by the definition of the entropy number and (A20), we obtain
2 θ ε α 1 m log 2 ( 3 α γ θ 1 ) > η α 1 m log 2 ( 3 α γ θ 1 ) .
The proof of Lemma A7 is completed. □
Proof of Theorem 9.
We prove all statements of the theorem by contradiction.
To prove Theorem 9 (i), we assume that (7) is false, meaning there exists a strictly increasing sequence of integers { n k } k = 1 such that
a k : = δ n k γ , α ( K ) X n k q ( log 2 n k ) q p 0 , as k .
Thus, we have
δ n k γ , α ( K ) X = a k ( log 2 n k ) p q n k q 2 a k ( log 2 n k ) p q n k q = : θ k .
Set η n : = c 1 ( log 2 n ) p n q . It is known that η n is decreasing to zero as n . Therefore, by Lemma A7,
η α 1 n k log 2 ( 3 α γ θ k 1 ) 2 θ k .
Thus,
α q c 1 log 2 n k + log 2 log 2 ( 3 α γ θ k 1 ) + log 2 ( α 1 ) p n k q log 2 ( 3 α γ θ k 1 ) q 4 a k ( log 2 n k ) p q n k q ,
that is,
log 2 n k + log 2 log 2 ( 3 α γ θ k 1 ) + log 2 ( α 1 ) p 4 a k c 1 α q · ( log 2 n k ) p q log 2 ( 3 α γ θ k 1 ) q .
It is known that for sufficiently large k, we have
log 2 n k < n k and a k 1 .
Then, by (A21), we have
log 2 ( 3 α γ θ k 1 ) = log 2 3 α γ n k q 2 a k ( log 2 n k ) p q < α log 2 1.5 γ + q log 2 n k + log 2 a k 1 + ( q p ) log 2 ( log 2 n k ) 2 q log 2 n k + 2 log 2 a k 1 .
Thus, by combining (A22) with (A24), we obtain
log 2 n k + log 2 log 2 ( 3 α γ θ k 1 ) + log 2 ( α 1 ) p 2 q + 2 a k c 1 α q ( log 2 n k ) p q log 2 a k 1 + q log 2 n k q = C 0 a k ( log 2 n k ) p log 2 a k 1 log 2 n k + q q ,
where C 0 = 2 q + 2 c 1 1 α q .
Next, we use the property of y ( x ) = x p , x > 0 . When p < 0 , y ( x ) is decreasing on x > 0 ; when p > 0 , y ( x ) is increasing on x > 0 . Then, we divide it into the following three cases.
Case 1: p = 0 . For sufficiently large k, it follows from (A25) and n k that
1 C 0 a k log 2 a k 1 log 2 n k + q q C a k ( log 2 a k 1 ) q .
Therefore, a k 1 C ( log 2 a k 1 ) q , which implies that a k 1 0 , contradicting (A23).
Case 2: p > 0 . For sufficiently large k and 0 < α < 1 , we have
( log 2 n k ) p log 2 n k + log 2 log 2 ( 3 α γ θ k 1 ) + log 2 ( α 1 ) p .
Then, it follows from (A25) that
1 C 0 a k log 2 a k 1 log 2 n k + q q C a k ( log 2 a k 1 ) q ,
which also implies that a k 1 0 , contradicting (A23).
Case 3: p < 0 . For sufficiently large k and 0 < α < 1 , by (A23) and (A24), we obtain
log 2 n k + log 2 log 2 ( 3 α γ θ k 1 ) + log 2 ( α 1 ) p > log 2 n k + log 2 ( 3 α γ θ k 1 ) + log 2 ( α 1 ) p 2 log 2 n k + log 2 ( 3 α γ θ k 1 ) p ( 2 q + 2 ) log 2 n k + 2 log 2 a k 1 p .
Thus, by combining these results with (A25) and multiplying both sides by ( log 2 n k ) p , we obtain
( 2 q + 2 ) + 2 log 2 a k 1 log 2 n k p C 0 a k log 2 a k 1 log 2 n k + q q ,
which implies that
a k 1 C 0 log 2 a k 1 log 2 n k + q q ( 2 q + 2 ) + 2 log 2 a k 1 log 2 n k p C ( log 2 a k 1 ) q p .
Therefore, a k 1 0 as k , which contradicts (A23). We complete the proof of Theorem 9 (i).
To prove Theorem 9 (ii), we assume that (8) is false. Then, there is an increasing sequence of { n k } k = 1 satisfying, for q > 0 ,
b k : = δ n k γ , α ( K ) X ( log 2 n k ) q 0 , as k .
Set η n : = c 1 ( log 2 n ) q and
δ n k γ , α ( K ) X = b k ( log 2 n k ) q 2 b k ( log 2 n k ) q = : ζ k .
Then by Lemma A7, we have
log 2 n k + log 2 log 2 ( 3 α γ ζ k 1 ) + log 2 ( α 1 ) q 2 ζ k = 4 b k c 1 ( log 2 n k ) q .
When q > 0 , the function y = x q , x > 0 is decreasing. For sufficiently large k, it follows from (A23) and 0 < α < 1 that
log 2 n k + log 2 log 2 ( 3 α γ ζ k 1 ) + log 2 ( α 1 ) q 2 log 2 n k + log 2 ( 3 α γ ζ k 1 ) q .
Moreover, it is known that
log 2 ( 3 α γ ζ k 1 ) = log 2 3 α γ ( log 2 n k ) q 2 b k < α log 2 1.5 γ + log 2 b k 1 + q log 2 ( log 2 n k ) q log 2 n k + 2 log 2 b k 1 .
By combining (A27) with (A28), we obtain
4 b k c 1 ( log 2 n k ) q ( q + 2 ) log 2 n k + 2 log 2 b k 1 q .
Thus,
b k 1 4 c 1 ( q + 2 ) + 2 log 2 b k 1 log 2 n k q C ( log 2 b k 1 ) q ,
which implies that b k 1 0 as k , contradicting (A26). We complete the proof of Theorem 9 (ii).
Finally, we prove Theorem 9 (iii). The proof is similar to those above. It is known from Corollary 1 that if
γ d n 0 ( K ) X + rad ( K ) 2 α 1 ,
then
δ n : = δ n γ , α ( K ) X 0 , as n .
By using Lemma A7 with θ = 2 δ n k γ , α ( K ) X and η n = c 1 2 c 2 n p , we have
c 1 2 c 2 n k α log 2 ( 2 1 3 α γ δ n k 1 ) p 4 δ n k .
Taking the logarithm on both sides of (A29), and using the fact that δ n k 1 as k , we have
log 2 c 1 4 δ n k 1 c 2 n k p α p log 2 ( 2 1 3 α γ δ n k 1 ) ) p ,
which implies that
log 2 ( δ n k 1 ) C n k p log 2 ( δ n k 1 ) p .
Thus,
log 2 ( δ n k 1 ) C 1 1 p n k p 1 p .
Therefore, we obtain
δ n k C 1 2 C 1 1 p n k p 1 p C 1 2 C 2 n k p 1 p ,
which contradicts δ n k 0 as k . The proof of Theorem 9 is completed. □

Appendix D. Proofs of Section 5

Appendix D.1. Proof of Theorem 10

Proof of Theorem 10.
The proofs are similar to those of Theorem 9. To prove Theorem 10 (i), we assume that (11) is false, that is, there is an increasing sequence of { n k } k = 1 satisfying
a k : = δ n k γ n k , α ( K ) X n k 2 q ( log 2 n k ) p 0 , as k .
Thus, we have
δ n k γ n k , α ( K ) X = a k ( log 2 n k ) p n k 2 q 2 a k ( log 2 n k ) p n k 2 q = : θ k .
It follows from Lemma A7 with η n : = c 1 ( log 2 n ) p n q that
log 2 n k + log 2 log 2 ( 3 α γ n k θ k 1 ) + log 2 ( α 1 ) p 4 a k c 1 α q · ( log 2 n k ) p log 2 ( 3 α γ n k θ k 1 ) q n k q .
Recall that for sufficiently large k,
log 2 n k < n k and a k 1 .
It is known from (A30) and γ n k = M λ n k that there exists a constant C 0 > 0 such that
log 2 ( 3 α γ n k θ k 1 ) = log 2 3 α γ n k n k 2 q 2 a k ( log 2 n k ) p < α log 2 ( 1.5 · M λ n k ) + 2 q log 2 n k + log 2 a k 1 p log 2 ( log 2 n k ) 2 C 0 n k + 2 log 2 a k 1 .
Thus, by combining (A31) with (A33), we obtain
log 2 n k + log 2 log 2 ( 3 α γ n k θ k 1 ) + log 2 ( α 1 ) p 2 q + 2 a k c 1 α q ( log 2 n k ) p log 2 a k 1 + C 0 n k q n k q = C 1 a k ( log 2 n k ) p log 2 a k 1 n k + C 0 q ,
where C 1 = 2 q + 2 c 1 1 α q . Then, we divide it into the following two cases.
Case 1: p 0 . For sufficiently large k and 0 < α < 1 , we have
( log 2 n k ) p log 2 n k + log 2 log 2 ( 3 α γ n k θ k 1 ) + log 2 ( α 1 ) p .
It follows from (A34) and n k that
1 C 1 a k log 2 a k 1 n k + C 0 q < C a k ( log 2 a k 1 ) q .
Therefore, a k 1 C ( log 2 a k 1 ) q , which implies that a k 1 0 , contradicting (A32).
Case 2: p < 0 . For sufficiently large k and 0 < α < 1 , by (A32) and (A33), we obtain
log 2 n k + log 2 log 2 ( 3 α γ n k θ k 1 ) + log 2 ( α 1 ) p 2 log 2 n k + log 2 2 C 0 n k + 2 log 2 a k 1 p .
Thus, by combining these results with (A34) and multiplying both sides by ( log 2 n k ) p , we obtain
2 + log 2 2 C 0 n k + 2 log 2 a k 1 log 2 n k p C 1 a k log 2 a k 1 n k + C 0 q ,
which implies that
a k 1 C 1 log 2 a k 1 n k + C 0 q 2 + log 2 2 C 0 n k + 2 log 2 a k 1 log 2 n k p .
Case 2.1: If a k 1 c n k , it follows from (A35) that a k 1 C , where c, C are constants, which contradicts (A32).
Case 2.2: If a k 1 > c n k , it follows from (A35) that
a k 1 C log 2 a k 1 q log 2 ( log 2 a k 1 ) p C log 2 a k 1 q p .
Therefore, a k 1 0 as k , which contradicts (A32). We complete the proof of Theorem 10 (i).
To prove Theorem 10 (ii), we assume that (12) is false. Then, there is an increasing sequence of { n k } k = 1 satisfying, for q > 0 ,
b k : = δ n k γ n k , α ( K ) X ( log 2 n k ) q 0 , as k .
Set η n : = c 1 ( log 2 n ) q and
δ n k γ n k , α ( K ) X = b k ( log 2 n k ) q 2 b k ( log 2 n k ) q = : ζ k .
Then, by Lemma A7, we have
log 2 n k + log 2 log 2 ( 3 α γ n k ζ k 1 ) + log 2 ( α 1 ) q 2 ζ k = 4 b k c 1 ( log 2 n k ) q .
For sufficiently large k, it follows from (A32) and 0 < α < 1 that there exists a constant C 0 > 0 such that
log 2 ( 3 α γ n k ζ k 1 ) = log 2 3 α M λ n k · ( log 2 n k ) q 2 b k < C 0 n k + 2 log 2 b k 1 .
By combining these results with (A37), we obtain
4 b k c 1 ( log 2 n k ) q 2 log 2 n k + log 2 C 0 n k + 2 log 2 b k 1 q .
Thus, we obtain
b k 1 4 c 1 2 + log 2 C 0 n k + 2 log 2 b k 1 log 2 n k q .
Case 2.1: If b k 1 c n k , then b k 1 C , where c, C are constants, which contradicts (A36).
Case 2.2: If b k 1 > c n k , it follows from (A35) that
b k 1 C log 2 b k 1 q .
Therefore, b k 1 0 as k , which contradicts (A36). We complete the proof of Theorem 10 (ii). □

Appendix D.2. Proof of Theorem 15

Proof of Theorem 15.
By Theorem 9 (i), it is clear that
δ n γ , α ( K η ) X 1 n log 2 ( n + 1 ) .
The proof of the upper bound is similar to the proof in [18]. We give the detailed proof, using the method from the proof of Theorem 7. For j = 1 , , N , define l j N { 0 } as
2 l j 1 < 2 η j γ 2 l j .
Set N = ( n + 1 ) n . We could divide the unit ball [ 1 , 1 ] n R n into k 1 non-overlapping open balls with side length 2 l k 1 , ( k 2 k 1 ) non-overlapping open balls with side length 2 l k 2 , ⋯, or ( k s k s 1 ) non-overlapping open balls with side length 2 l k s , where k s = N and
l 1 = = l k 1 < l k 1 + 1 = = l k 2 < < l k s 1 + 1 = = l k s .
Hence, there is a sequence of non-overlapping open balls B j [ 1 , 1 ] n with side length 2 l j ,
B j : = B j y j , 2 l j 1 , j = 1 , , N .
Let ϕ j : R n R be a mapping such that
ϕ j ( y ) = η j max { 0 , 1 2 l j + 1 y j y n α } , j = 1 , , N ,
and the mapping Φ be such that
Φ ( y ) : = j = 1 N ϕ j ( y ) · e j .
It is known from the proof of Theorem 7 and (A38) that Φ satisfies the H α sup j = 1 , , N { 2 l j + 1 η j } condition; thus, it satisfies the H α γ condition. Moreover, Φ ( y j ) = η j e j , j = 1 , , N .
For K η = { η k e k } k = 1 { 0 } , by the decreasing property of { η k } k = 1 , we have
sup k 1 inf y B ( Y n ) η k e k Φ ( y ) X sup k 1 inf j = 1 , , N η k e k Φ ( y j ) n = sup k 1 inf j = 1 , , N η k e k η j e j n = η N ,
and
inf y B ( Y n ) 0 Φ ( y ) X inf j = 1 , , N Φ ( y j ) n = inf j = 1 , , N η j e j n = η N .
Thus,
δ n γ , α ( K η ) X sup f K η inf y B ( Y n ) f Φ ( y ) X η N < 1 n log 2 ( n + 1 ) .
The proof of Theorem 15 is completed. □

References

  1. Kolmogoroff, A. Uber die beste Annaherung von Funktionen einer gegebenen Funktionenklasse. Ann. Math. 1936, 37, 107–110. [Google Scholar] [CrossRef]
  2. Pinkus, A. n-Widths in Approximation Theory; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
  3. Lorentz, G.G.; Golitschek, M.; Makovoz, Y. Constructive Approximation: Advanced Problems; Springer: Berlin, Germany, 1996. [Google Scholar]
  4. Fang, G.; Ye, P. Probabilistic and average linear widths of Sobolev space with Gaussian measure. J. Complex. 2003, 19, 73–84. [Google Scholar]
  5. Fang, G.; Ye, P. Probabilistic and average linear widths of Sobolev space with Gaussian measure in L-Norm. Constr. Approx. 2004, 20, 159–172. [Google Scholar]
  6. Duan, L.; Ye, P. Exact asymptotic orders of various randomized widths on Besov classes. Commun. Pure Appl. Anal. 2020, 19, 3957–3971. [Google Scholar] [CrossRef]
  7. Duan, L.; Ye, P. Randomized approximation numbers on Besov classes with mixed smoothness. Int. J. Wavelets Multiresolut. Inf. Process. 2020, 18, 2050023. [Google Scholar] [CrossRef]
  8. Liu, Y.; Li, X.; Li, H. n-Widths of Multivariate Sobolev Spaces with Common Smoothness in Probabilistic and Average Settings in the Sq Norm. Axioms 2023, 12, 698. [Google Scholar] [CrossRef]
  9. Liu, Y.; Li, H.; Li, X. Approximation Characteristics of Gel’fand Type in Multivariate Sobolev Spaces with Mixed Derivative Equipped with Gaussian Measure. Axioms 2023, 12, 804. [Google Scholar] [CrossRef]
  10. Wu, R.; Liu, Y.; Li, H. Probabilistic and Average Gel’fand Widths of Sobolev Space Equipped with Gaussian Measure in the Sq-Norm. Axioms 2024, 13, 492. [Google Scholar] [CrossRef]
  11. Liu, Y.; Lu, M. Approximation problems on the smoothness classes. Acta Math. Sci. 2024, 44, 1721–1734. [Google Scholar] [CrossRef]
  12. DeVore, R.; Howard, R.; Micchelli, C. Optimal nonlinear approximation. Manuscr. Math. 1989, 63, 469–478. [Google Scholar] [CrossRef]
  13. DeVore, R.; Hanin, B.; Petrova, G. Neural network approximation. Acta Numer. 2021, 30, 327–444. [Google Scholar] [CrossRef]
  14. Petrova, G.; Wojtaszczyk, P. Limitations on approximation by deep and shallow neural networks. J. Mach. Learn. Res. 2023, 24, 1–38. [Google Scholar]
  15. DeVore, R.; Kyriazis, G.; Leviatan, D.; Tichomirov, V. Wavelet compression and nonlinear-widths. Adv. Comput. Math. 1993, 1, 197–214. [Google Scholar] [CrossRef]
  16. Temlyakov, V. Nonlinear Kolmogorov widths. Math. Notes 1998, 63, 785–795. [Google Scholar] [CrossRef]
  17. Cohen, A.; DeVore, R.; Petrova, G.; Wojtaszczyk, P. Optimal stable nonlinear approximation. Found. Comput. Math. 2022, 22, 607–648. [Google Scholar] [CrossRef]
  18. Petrova, G.; Wojtaszczyk, P. Lipschitz widths. Constr. Approx. 2023, 57, 759–805. [Google Scholar] [CrossRef]
  19. Petrova, G.; Wojtaszczyk, P. On the entropy numbers and the Kolmogorov widths. arXiv 2022, arXiv:2203.00605. [Google Scholar]
  20. Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef]
  21. Shen, Z.; Yang, H.; Zhang, S. Optimal approximation rate of ReLU networks in terms of width and depth. J. Math. Pures Appl. 2022, 157, 101–135. [Google Scholar] [CrossRef]
  22. Fiorenza, R. Hölder and Locally Hölder Continuous Functions, and Open Sets of Class Ck, Ck,λ; Birkhäuser: Basel, Switzerland, 2017. [Google Scholar]
  23. Opschoor, J.; Schwab, C.; Zech, J. Exponential ReLU DNN expression of holomorphic maps in high dimension. Constr. Approx. 2021, 55, 537–582. [Google Scholar] [CrossRef]
  24. Yang, Y.; Zhou, D. Optimal Rates of Approximation by Shallow ReLUk Neural Networks and Applications to Nonparametric Regression. Constr. Approx. 2024, 1–32. [Google Scholar]
  25. Lee, M. Mathematical Analysis and Performance Evaluation of the GELU Activation Function in Deep Learning. J. Math. 2023, 2023, 4229924. [Google Scholar] [CrossRef]
  26. Forti, M.; Grazzini, M.; Nistri, P.; Pancioni, L. Generalized Lyapunov approach for convergence of neural networks with discontinuous or non-Lipschitz activations. Phys. D 2006, 214, 88–99. [Google Scholar] [CrossRef]
  27. Gavalda, R.; Siegelmann, H. Discontinuities in recurrent neural networks. Neural Comput. 1999, 11, 715–745. [Google Scholar] [CrossRef]
  28. Tatar, N. Hölder continuous activation functions in neural networks. Adv. Differ. Equ. Control Process. 2015, 15, 93–106. [Google Scholar]
  29. Carl, B. Entropy numbers, s-numbers, and eigenvalue problems. J. Funct. Anal. 1981, 41, 290–306. [Google Scholar] [CrossRef]
  30. Konyagin, S.; Temlyakov, V. The Entropy in Learning Theory. Error Estimates. Constr. Approx. 2007, 25, 1–27. [Google Scholar] [CrossRef]
  31. Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
  32. Donoho, D.L. Compressed sensing. IEEE Trans. Inform. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
  33. Siegel, J.W. Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces. J. Mach. Learn. Res. 2023, 24, 1–52. [Google Scholar]
  34. Lu, J.; Shen, Z.; Yang, H.; Zhang, S. Deep network approximation for smooth functions. SIAM J. Math. Anal. 2021, 53, 5465–5506. [Google Scholar] [CrossRef]
  35. Birman, M.; Solomyak, M. Piecewise polynomial approximations of functions of the class W p α . Mat. Sb. 1967, 73, 331–355. (In Russian) [Google Scholar]
  36. DeVore, R.; Sharpley, R. Besov spaces on domains in R d . Trans. Am. Math. Soc. 1993, 335, 843–864. [Google Scholar]
  37. Mazzucato, A. Besov-Morrey spaces: Function space theory and applications to non-linear PDE. Trans. Am. Math. Soc. 2003, 355, 1297–1364. [Google Scholar] [CrossRef]
  38. Garnett, J.; Le, T.; Meyer, Y.; Vese, A. Image decompositions using bounded variation and generalized homogeneous Besov spaces. Appl. Comput. Harmon. Anal. 2007, 23, 25–56. [Google Scholar] [CrossRef]
  39. Marinucci, D.; Pietrobon, D.; Balbi, A.; Baldi, P.; Cabella, P.; Kerkyacharian, G.; Natoli, P.; Picard, D.; Vittorio, N. Spherical needlets for cosmic microwave background data analysis. Mon. Not. R. Astron. Soc. 2008, 383, 539–545. [Google Scholar] [CrossRef]
  40. Dai, F.; Xu, Y. Approximation Theory and Harmonic Analysis on Spheres and Balls; Springer Monographs in Mathematics; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  41. Feng, H.; Huang, S.; Zhou, D.X. Generalization analysis of CNNs for classification on spheres. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 6200–6213. [Google Scholar] [CrossRef] [PubMed]
  42. Kushpel, A.; Tozoni, S. Entropy numbers of Sobolev and Besov classes on homogeneous spaces. In Advances in Analysis; World Scientific Publishing: Hackensack, NJ, USA, 2005; pp. 89–98. [Google Scholar]
  43. Zhou, D.X. Theory of deep convolutional neural networks: Downsampling. Neural Netw. 2020, 124, 319–327. [Google Scholar] [CrossRef]
  44. Zhou, D.X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 2020, 48, 787–794. [Google Scholar] [CrossRef]
  45. Mao, T.; Shi, Z.; Zhou, D.X. Theory of deep convolutional neural networks III: Approximating radial functions. Neural Netw. 2021, 144, 778–790. [Google Scholar] [CrossRef]
  46. Kühn, T. Entropy Numbers of General Diagonal Operators. Rev. Mat. Complut. 2005, 18, 479–491. [Google Scholar] [CrossRef]
  47. Carl, B.; Stephani, I. Entropy, Compactness and the Approximation of Operators; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
  48. Wojtaszczyk, P. Banach Spaces for Analysts; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Figure 1. Approximation error and the number of elements n and depth d: classical methods ( n -term wavelets and adaptive finite elements) vs. new tools (deep neural networks).
Figure 1. Approximation error and the number of elements n and depth d: classical methods ( n -term wavelets and adaptive finite elements) vs. new tools (deep neural networks).
Axioms 14 00025 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, M.; Ye, P. The Theory and Applications of Hölder Widths. Axioms 2025, 14, 25. https://doi.org/10.3390/axioms14010025

AMA Style

Lu M, Ye P. The Theory and Applications of Hölder Widths. Axioms. 2025; 14(1):25. https://doi.org/10.3390/axioms14010025

Chicago/Turabian Style

Lu, Man, and Peixin Ye. 2025. "The Theory and Applications of Hölder Widths" Axioms 14, no. 1: 25. https://doi.org/10.3390/axioms14010025

APA Style

Lu, M., & Ye, P. (2025). The Theory and Applications of Hölder Widths. Axioms, 14(1), 25. https://doi.org/10.3390/axioms14010025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop