Abstract
We introduce the Hölder width, which measures the best error performance of some recent nonlinear approximation methods, such as deep neural network approximation. Then, we investigate the relationship between Hölder widths and other widths, showing that some Hölder widths are essentially smaller than -Kolmogorov widths and linear widths. We also prove that, as the Hölder constants grow with n, the Hölder widths are much smaller than the entropy numbers. The fact that Hölder widths are smaller than the known widths implies that the nonlinear approximation represented by deep neural networks can provide a better approximation order than other existing approximation methods, such as adaptive finite elements and -term wavelet approximation. In particular, we show that Hölder widths for Sobolev and Besov classes, induced by deep neural networks, are and are much smaller than other known widths and entropy numbers, which are .
Keywords:
Hölder widths; deep neural networks; entropy numbers; nonlinear approximation; n-Kolmogorov widths; nonlinear (n, N)-widths; Sobolev classes; Besov classes MSC:
41A46; 41A65; 68T07; 46N10
1. Introduction
Width theory is one of the most important topics in approximation theory because widths can be considered approximation standards that indicate the accuracy achievable for a given function class using an approximation method. They have been extensively studied and applied in various fields, providing a benchmark for the best performance of different approximation techniques. One of the earliest studies on widths was Kolmogorov’s work in 1936, where he introduced the concept of -Kolmogorov widths []. With the development of modern science and engineering, the theory of widths has also developed rapidly, greatly promoting research into various linear and nonlinear approximation methods. Problems related to width theory have been and continue to be studied by many experts, including Pinkus, Lorentz, DeVore, Temlyakov, et al. [,,,,,,,,,,]. In addition, nonlinear methods play a crucial role in understanding complex phenomena across various applications, such as compressed sensing, signal processing, and neural networks [,,]. Widths such as manifold widths, nonlinear widths, and Lipschitz widths have been utilized as fundamental measures to assess the optimal convergence rate of these nonlinear methods [,,,,].
It is known that neural networks can serve as powerful nonlinear tools. For instance, the ReLU (Rectified Linear Unit) activation function,
is characterized as a Lipschitz mapping, which has led to the introduction of stable manifold widths and Lipschitz widths. In [,], Cohen et al. and DeVore, et al. investigated stable manifold widths to quantify error performance in nonlinear approximation methods, such as compressed sensing and neural networks. They discussed the fundamental properties of these widths and established their connections with entropy numbers. In [,], Petrova and Wojtaszczyk introduced Lipschitz widths and showed their relationships with other widths and entropy numbers. However, not all mappings are Lipschitz; thus, it is essential to consider weaker conditions to understand the error performance of nonlinear approximation methods. One such condition is the Hölder mapping, which we explore in this paper. We will introduce the concept of Hölder widths and investigate their relationship with other widths and entropy numbers. Our results may provide a better understanding of the effects of such nonlinear approximation methods and their potential applications in deep neural networks.
Many authors have achieved profound results for the ReLU activation function, which acts as a continuous function in feed-forward deep neural networks (DNNs) [,,,,]. It is known that the mapping in [] with the ReLU activation function is a Lipschitz mapping in DNNs, where the network has a width W and adepth n. Unlike the Lipschitz width, the terms ‘width’ and ‘depth’ here indicate the scale of the network. The performance of this Lipschitz mapping is discussed in [].
We introduce a more flexible assumption. Let and be metric spaces. Moreover, we assume that the space Z is separable. Define as an -Hölder mapping with coefficient if for any ,
We could also say that satisfies the condition [] equivalent to
We provide some remarks on Hölder mappings below.
Remark 1.
If , then Φ is Lipschitz continuous.
Remark 2.
If Y is bounded and Φ is an α-Hölder mapping, then for any , Φ is a β-Hölder mapping.
Remark 3.
The minimum α-Hölder coefficient if and only if Φ is constant.
Note that the RePU (Rectified Power Unit) activation function [,],
with , and the GELU (Gaussian Error Linear Unit) activation function [],
can be considered 1-Hölder mappings in bounded spaces. The performance of these mappings can be found in [,]. Moreover, there are various -Hölder activation functions with . In [], Forti, Grazzini, et al. obtained global convergence results, where the neuron activations were modeled by -Hölder continuous functions with (0, 1), such as
where and defined in []. These activations can significantly increase the computational power [,,]. Motivated by the above results, we mainly focus on the -Hölder condition with , which is weaker than the Lipschitz condition.
Now, we introduce Hölder widths, which measure the best error performance of some recent nonlinear approximation methods characterized by Hölder mappings. Throughout this paper, let X be a Banach space with a norm , and let be an -dimensional Banach space with a norm on , . Denote the unit ball of by .
Let K be a bounded subset of X. For , , we define the fixed Hölder widths
where satisfies
Next, we define the Hölder width
where the infimum is taken over all norms on . From definition (1), we see that the error of any numerical method based on Hölder mappings will not be smaller than the Hölder widths.
We propose the Hölder widths, exploring their properties and relationships with other known widths and entropy numbers. In Section 2, we establish the fundamental properties of Hölder widths. In Section 3, we compare Hölder widths with -Kolmogorov widths, linear widths, and nonlinear -widths. In Section 4, we investigate the relationship between Hölder widths and entropy numbers. In Section 5, we provide some specific applications and derive the asymptotic order of Hölder widths for Sobolev classes and Besov classes using deep neural networks. In Section 6, we provide some concluding remarks. All detailed proofs for the results from Section 2, Section 3 and Section 4 are included in Appendix A, Appendix B and Appendix C, and the proofs for Theorems 10 and 15 in Section 5 are provided in Appendix D.
2. Fundamental Properties of Hölder Widths
Recall that the radius of a set is defined as
It is known from Remark 3 that a function that satisfies the condition is a constant function. Then, for the -dimensional space ,
Moreover, for a fixed constant , it is known from (2) that is decreasing with respect to and n, that is, (i) if , then , and (ii) if , then .
In addition, it is easy to see that the space (, ) in (1) and (2) can be replaced with any -dimensional normed space (,) such that
where
Denote by
the space equipped with the norm, that is, for ,
Recall that an -covering of K is a collection such that
The minimal -covering number is the minimal cardinality of the -covering of K. We say that a set K is totally bounded if for every , .
We establish the following fundamental properties of Hölder widths.
Theorem 1.
Let K be a compact subset of X. For any , , and , there exists a norm on satisfying for ,
such that
Theorem 2.
Let K be a compact subset of X. As a function of γ, the Hölder width is continuous.
Theorem 3.
A subset K of X is totally bounded if and only if for every ,
Theorem 4.
Let and . If for every , there exist two numbers and such that , then K is totally bounded.
It is important to delve deeper into Hölder widths since the Hölder condition offers a more flexible framework that can accommodate a broader range of functions, making it particularly suitable for analyzing and approximating complicated functions and datasets. In the forthcoming sections, we will gain deeper insights into the approximation performance measured by Hölder widths.
3. The Relationship Between Hölder Widths and Other Widths
In this section, we will demonstrate that Hölder widths are smaller than other known widths, such as -Kolmogorov, linear, and nonlinear -widths. In the following sections, we assume that K is a compact subset of the Banach space X, which is our main concern.
3.1. The Relationship Between Hölder Widths, -Kolmogorov Widths, and Linear Widths
We recall the definition of the -Kolmogorov width of K from [], as follows:
where the infimum is taken over all -dimensional subspaces .
It is known that the -Kolmogorov width determines the optimal errors generated by approximating the ‘worst’ element of the set K using -dimensional subspaces of X. We will show that some Hölder widths are essentially smaller than -Kolmogorov widths.
Theorem 5.
For a compact set and , ,
Corollary 1.
If is compact, then for each and each ,
Remark 4.
Recall that the linear width is defined as
where the infimum is taken over the class of all continuous linear operators from X into itself with rank at most n. It follows from the definitions of the -Kolmogorov width and linear width that . Thus, Theorem 5 implies that
3.2. The Relationship Between Hölder Widths and Nonlinear -Widths
To evaluate the performance of the best -term approximation with respect to different systems, such as the trigonometric system and wavelet bases, Temlyakov introduced the nonlinear -width in [], which is defined as follows: for , ,
where the second infimum is taken over the sets of N linear spaces of dimension n. The nonlinear -width can reflect the approximation performance of greedy algorithms.
It is clear that . The larger N is, the more flexibility we have in approximating f. Moreover, it is known from (6) and Theorem 5 that
Moreover, we obtain the following inequalities, revealing the relationship between Hölder widths and nonlinear -widths.
Theorem 6.
For any , , , and any compact set with , we have
4. Comparison Between Hölder Widths and Entropy Numbers
We first recall the definition of the entropy number from []. The entropy number is defined as
which is the infimum of all for which balls of radius cover a compact set , and .
Entropy numbers have many applications in fields such as compressed sensing, statistics, and learning theory [,,]. They can provide a benchmark for the best error performance of numerical recovery algorithms. Sometimes, estimating the entropy numbers is more accessible than computing other known widths, such as -Kolmogorov widths and nonlinear -widths. For example, for some model classes K, such as unit balls in classical Sobolev and Besov spaces, the entropy numbers are known and can also be used to estimate the lower bound of these widths.
In this section, we compare the convergence rate of the Hölder widths with that of the entropy numbers. We first obtain the following general results.
Theorem 7.
For any and , we have
Specifically, if , then
Remark 5.
It is known from Theorem 7 that if , then
So, by the decreasing property of with respect to γ, Hölder widths are smaller than entropy numbers for .
It follows from Theorem 7 with that the following corollary holds.
Corollary 2.
Let , and .
(i) If the following inequality holds:
where are some constants such that , , then we have
where C is a positive constant.
(ii) If the following inequality holds:
where are two positive constants, then we have
where C is a positive constant.
Theorem 7 and Corollary 2 state that given an upper bound on the entropy number, it can be determined that the upper bound of the Hölder width is less than it. Conversely, if we know a lower bound of the entropy number, then we can obtain a lower bound of the Hölder width. To show this, we need the following theorem.
Theorem 8.
Let , . If there exists , where satisfies ,
then for ,
where are constants that depend on , and q.
Based on Theorem 8, we can obtain a lower bound of the Hölder width from that of the entropy number.
Theorem 9.
(i) If the following inequality holds:
where are constants such that , , then for each and , we have
where C is a positive constant.
(ii) If the following inequality holds:
where are some positive constants, then for each and , we have
where C is a positive constant.
(iii) If the following inequality holds:
where are some constants such that , and , then for each
we have
where are two positive constants.
Combining Corollary 2 (ii) with Theorem 9 (ii), we derive the following corollary.
Corollary 3.
For any compact set , , , and , then
5. Some Applications
The importance of the Hölder width lies in its lower bound, which is independent of any specific algorithm. This bound not only reveals the limitations of certain approximation tools but also provides information on the order of the Hölder width without knowing the concrete algorithms. The insights from the lower bound can help us show the optimality of some existing algorithms or prompt us to design optimal algorithms that can achieve such a bound. In essence, the concept of width is independent of any specific algorithm but inspires us to design optimal algorithms.
We apply the above general theoretical results to some important function spaces and obtain the corresponding orders of Hölder widths.
First, we remark that some common neural networks are Hölder mappings. Therefore, we can obtain their asymptotic orders of Hölder widths characterized by these fully connected feed-forward neural networks. In the following discussion, we mainly consider the Banach spaces X of functions, where is continuously embedded in X. Let be an activation function. Denote by and .
A feed-forward neural network with width W, depth n, and activation produces a family :
which generates an approximation to a target element . For every there exists a continuous function on
where the affine mappings , , , and , and the function
Here, t is the vector whose coordinates are the entries of the matrices and biases of , . We note that the dimension of any hidden layer can naturally be expanded; thus, any fully connected network can be made to have a fixed width [,]. Our assumption about a fixed width W can simplify the computations and notations.
Proposition 1.
If σ is an mapping, then , defined in (10), is an mapping, which means that for ,
where , , and is a constant.
It follows that a specific neural network can be a Hölder mapping with coefficient to approximate the target element f, where and . Then, we consider the lower bound for the Hölder width with coefficient , , , which also implies the lower bound of the DNN approximation.
Theorem 10.
Let and , where , , .
(i) If the following inequality holds:
where are constants such that , , then
where C is a positive constant.
(ii) If the following inequality holds:
where are some positive constants, then
where C is a positive constant.
Theorems 7 and 10 imply the following corollary, which means that if we know the asymptotic orders of some entropy numbers, we can obtain the asymptotic orders of their Hölder widths.
Corollary 4.
Let and , where , , .
(i) If the following holds:
where are some constants such that , , then we have
(ii) If the following holds:
where q is a positive constant, then we have
We point out that Corollary 4 provides a tool for giving lower bounds on how well a compact set K can be approximated by a DNN, where all weights and biases are from the unit ball of some norm . The classical model classes K for multivariate functions can be the unit balls of smoothness spaces, such as classical Lipschitz, Hölder, Sobolev, and Besov spaces. For any model class K, denote the unit ball of K by
For any , let
Many experts have investigated the performance of deep learning approximation for these function classes of the Lebesgue space on the cube [,,,].
First, we determine the exact order of the Hölder width for the classical Sobolev class. For , we denote by the usual Lebesgue space on equipped with the norm
For , , we say that function f belongs to the Sobolev space if , and the norm of f is given by, for ,
It is known from [] that when ,
Moreover, the approximation error rates for Sobolev and Besov classes can be obtained using many classical methods of nonlinear approximation, such as adaptive finite elements or -term wavelet approximation [,].
It follows from Theorem 10 and Corollary 4 that for some deep neural networks,
where is produced by some neural networks with depth n, fixed width, and activation functions , . Compared with classical methods, the factor 2 in the exponent leaves open the possibility of improved approximation rates when using deep neural networks. Indeed, it is known from [] that there exists a neural network with depth n, width , and ReLU activation function such that, for
The author used a novel bit-extraction technique, which gives an optimal encoding of sparse vectors, to obtain the upper bounds .
Based on the above discussion, we obtain the Hölder widths for the Sobolev classes .
Theorem 11.
Let , , and . Then, there exist , such that for ,
Proof.
It follows from (15) that
and for any
Therefore, there exist constants c, such that
Replacing with n, it follows from the decreasing property of the Hölder width that there exist and such that
Thus, we complete the proof of Theorem 11. □
Remark 6.
Theorem 11 implies that the upper bound in inequality (15) is sharp.
For a fixed and , Figure 1 compares the performance of the approximation error versus the number of elements or layers using different approximation methods, such as the classical methods represented by -term wavelets, adaptive finite elements, and new tools represented by deep neural networks. The blue solid line shows the approximation error decreasing at a rate of with the number of elements n for -term wavelets or adaptive finite elements, while the orange dashed line indicates a faster decay of with depth n for deep neural networks. Overall, deep neural networks significantly outperform classical methods, such as -term wavelets or adaptive finite elements, offering more rapid convergence and potentially higher accuracy in approximating functions. We call this phenomenon the super-convergence of deep neural networks, where the classical Hölder, Sobolev, and Besov classes on can achieve super-convergence [,].
Figure 1.
Approximation error and the number of elements n and depth d: classical methods (-term wavelets and adaptive finite elements) vs. new tools (deep neural networks).
The results of Theorem 11 can be extended to Besov spaces, which are much more general than Sobolev spaces. It is well known that functions from Besov spaces have been widely used in approximation theory, statistics, image processing, and machine learning (see [,,,], and the references therein). Recall that for , the modulus of smoothness of order r of is
where is the r-th order difference of f with step h, and
For , , we say that function f belongs to the Besov space if , and the norm of f, given by
is finite. It is known from [] that when ,
It follows from Theorem 10 and Corollary 4 that for some deep neural networks,
and it is also known from [] that there exists a neural network with depth n, width , and ReLU activation function such that for
Thus, we obtain the Hölder widths for the Besov classes .
Theorem 12.
Let , , , and . Then, there exist , such that for ,
The proof of Theorem 12 is similar to that of Theorem 11, so we omit the details here.
Note that any numerical algorithm based on Hölder mappings will have a convergence rate that is not faster than that of the Hölder width. This characterizes the limitation of the approximation power of some deep neural networks.
Meanwhile, it is well known that spherical approximation has been widely applied in many fields, such as cosmic microwave background analysis, global ionospheric prediction for geomagnetic storms, climate change modeling, environmental governance, and other spherical signals [].
We recall some concepts on the sphere from []. Let
be the unit sphere in equipped with the rotation-invariant measure normalized by . For , denote by the usual Lebesgue space on endowed with the norm
In [], Feng et al. studied the approximation of the Sobolev class on the sphere , denoted by , using convolutional neural networks with layer J. Recall that a function f belongs to the Sobolev class if , and the norm of f is given by
where is the Laplace–Beltrami operator on the sphere. The authors obtained the upper bound of the error of such an approximation in .
However, it is known from [] that when ,
Then, it follows from Corollary 4 that the lower bound of the Hölder width for may be .
Theorem 13.
Let , , and . Then, there exist , such that for ,
With the development of spherical theory [,,,] and the relationship between Hölder widths and entropy numbers, we conjecture that the approximation order of using some fully connected feed-forward neural networks with the bit-extraction technique may be . This conjecture can be formulated as follows.
Conjecture 1.
Let , , and . Then, there exist , such that for ,
Our results show that networks modeled by both Hölder and Lipschitz (e.g., ReLU) functions can achieve the approximation error , which is superior to classical approximation tools such as -term wavelets and adaptive finite elements. We achieve the same approximation error for networks with a weaker condition, , which gives us more options for neural network approximation tools. For example, if we need to solve numerically differential equations with a discontinuous right-hand side modeling neural network dynamics, we can choose networks with non-Lipschitz activation functions, using the fact that non-Lipschitz functions share the peculiar property that even small variations in the neuron state are able to produce significant changes in the neuron output [,,]. The results of the Hölder widths would help us select suitable Hölder activation functions. Moreover, it is known from Appendix B.2 that the Hölder width is smaller than the Lipschitz width in the sense that
Thus, from the perspective of the approximation error, the Hölder activation function performs better than the Lipschitz activation function. However, it is currently unknown what magnitude this improvement can reach. It would be interesting to study this problem.
Next, we estimate the Hölder widths for discrete Lebesgue spaces. For , denote by the set of all sequences with
Let be a sequence such that for . Denote by
where , and .
Theorem 14.
For the space and the subset , its Hölder width satisfies and ,
Proof.
It is known from [] that its entropy number satisfies
It follows from Corollary 2 that there exists such that its Hölder width satisfies ,
It follows from Theorem 9 (i) that there exists such that its Hölder width satisfies
The proof of Theorem 14 is completed. □
Remark 7.
It is known from [] that its -Kolmogorov width satisfies
Theorem 14 illustrates that the Hölder width of is smaller than that of the -Kolmogorov width.
Finally, we obtain the asymptotic order of the Hölder width for , which is the Banach space of all sequences converging to 0, equipped with the norm. Let be a sequence with , . Denote a compact subset of X by
where the sequence is the standard basis in X. It is known from [] that its entropy number satisfies
Theorem 15.
For the space and the subset , we obtain that its Hölder width satisfies , ,
Theorem 15 shows the sharpness of Theorem 9 (i).
6. Concluding Remarks
We introduce the Hölder width, which measures the best error performance of some recent nonlinear approximation methods. We investigate the relationship between Hölder widths and other known widths, demonstrating that some Hölder widths are essentially smaller than -Kolmogorov widths and linear widths. Moreover, we obtain that as the Hölder constants grow with n, the Hölder widths are much smaller than the entropy numbers. The significance of Hölder widths being smaller than known widths is that it implies that some nonlinear approximations, such as deep neural network approximations, may yield a better approximation order than other known classical approximation methods. In fact, we show that the asymptotic orders of Hölder widths for the Sobolev classes and the Besov classes are for using deep neural networks, while other known widths might be . This result shows that deep neural networks significantly outperform classical methods of approximation, such as adaptive finite elements and -term wavelet approximation. Indeed, the Hölder width in neural networks serves two purposes. On the one hand, it demonstrates the superior approximation power of deep neural networks. On the other hand, it reveals the limitation in the approximating ability of some deep neural networks. These features are crucial for a deeper understanding and further exploration of the approximation power of deep neural networks. It would be interesting to calculate the Hölder widths for some important function classes.
Author Contributions
M.L. and P.Y. contributed equally to this paper. All authors have read and agreed to the published version of the manuscript.
Funding
The work was supported by the National Natural Science Foundation of China (Grant No. 11671213).
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Proofs of Section 2
Appendix A.1. Proof of Theorem 1
To prove Theorem 1, we need the following lemmas.
Lemma A1
(Auerbach lemma [,]). Let be an -dimensional Banach space and be its dual space. Then, there exist elements and functionals such that for ,
Lemma A2.
Let K be a bounded subset of X. For any , , and , we can limit the infimum in (2) to normed spaces with the norm satisfying, for ,
Proof.
According to Lemma A1, we can find vectors and linear functionals on the space (, ) satisfying
where is the dual space of and .
For , we can consider a new norm on and a mapping as
It is clear that .
We also need the following -Hölder version of Ascoli’s theorem.
Lemma A3.
For a separable metric space and a metric space where every closed ball is compact, let be a sequence of α-Hölder mappings such that there exist and satisfying for . Then, there exists a subsequence , which converges pointwise to an α-Hölder function . If is also compact, then the convergence is uniform.
Proof.
For , we have
Fix a countable dense subset , and define as the closed ball in Z with radius centered at z for . It follows that the Cartesian product
is a compact metric space under the product topology.
Let
Then, there exists a subsequence and an element such that
Thus, we obtain a function satisfying
where for i, . Since is dense, can extend to an -Hölder function on Y, that is, . Furthermore, for every , which implies that is pointwise convergent to .
If is compact, then for , we can cover Y by a finite number of -balls with centers . Thus, for sufficiently large s, we have
which implies that the convergence is uniform. The proof of Lemma A3 is completed. □
Now, we are ready to prove all theorems in this section.
Proof of Theorem 1.
By (1) and (2), it is obvious that for any -dimensional space Y with a norm on ,
Thus, we only need to prove that
By Lemma A2, we can find a sequence , where satisfies the condition and the norm on satisfies (5), such that
According to Lemma A3, there exists a subsequence of the sequence of norms that converges pointwise on and uniformly on to a norm on satisfying (5).
Thus, there exists a number such that for every , there is a corresponding value with , and
If , then we have and And if , then we have . Thus,
Next, we define the mapping as
For any , we have
so also satisfies the condition. We could write as the restriction of on .
Let and . Then, for each , we can find an element such that . Taking
we have
As and by taking the supremum over , we have
As , we have and . Thus, .
The proof of Theorem 1 is completed. □
Appendix A.2. Proofs of Theorems 2–4
Proof of Theorem 2.
We begin with the continuity of the Hölder widths at . For any mapping , and , we have
Thus, for any mapping ,
which implies that
It follows from (3), (4), and the decreasing property of with respect to that
which implies that the continuity at .
Next, we prove that is continuous for by contradiction. For convenience, denote by . We assume that F is not continuous at some . Therefore, there exists and a sequence of real numbers , such that
It is known from the definition of Hölder widths that for fixed , there exists an mapping such that
Let , and Then, is an mapping, and
It follows from the compactness of K and (A5) that
where . As , we have , and thus , which contradicts the assumption that . Thus, is continuous as a function of . We complete the proof of Theorem 2. □
Proof of Theorem 3.
If then for any , there exists a norm and an mapping such that
Thus, for a given , we can find such that
For a compact set , there is a finite collection such that
Therefore, for , there exists such that , and
Thus, for any , there exists such that
which implies that . By the arbitrariness of , the set K is totally bounded.
If K is totally bounded, then for , we can find a minimal -covering and a suitable such that
where . We only consider the case since for any . Set the points in satisfying
and the continuous piecewise linear function satisfying
Thus, it follows from (A6) that for ,
which implies that satisfies the condition. Therefore, we have
and thus The proof of Theorem 3 is completed. □
Proof of Theorem 4.
The proof is similar to that of Theorem 3. For any fixed , there exist and such that we can find a norm in and an mapping satisfying
Since is compact, there is a finite collection , which is a -covering of . Then, for any , we can find , such that
Thus, for any , we can find and , such that
which implies that . The proof of Theorem 4 is completed. □
Appendix B. Proofs of Section 3
Appendix B.1. Proofs of Theorem 5 and Corollary 1
Proof of Theorem 5.
We begin with
Choose a number such that . Let be an -dimensional linear subspace of X that satisfies the inequality
Then, for every , there is an element in such that
Denote the set of all such elements as
Next, fix such that
Then, for ,
Thus, we have
In addition, denote the unit ball in by , which is defined by
Define the mapping such that
Then, for ,
Thus, satisfies the condition, and .
For and (A7), we have
Thus,
Let , then for any ,
By Theorem 3 as , we complete the proof of Theorem 5. □
Proof of Corollary 1.
For a compact set , it is known from [] that the sequence is decreasing and tends to zero. Denote
For , it follows from Theorem 5 that
Then, it is clear that Corollary 1 holds true. □
Appendix B.2. Proofs of Theorem 6
To prove Theorem 6, we use a result from [].
Lemma A4.
For any , , and any compact set with , the following inequalities hold
Proof of Theorem 6.
By the definition of , there is an and a mapping satisfying
where, for any , the mapping satisfies
Then, we have
Thus, satisfies the -Hölder condition, and is an mapping.
Therefore, it is known from (A8) and the definition of Hölder widths that
Taking , we obtain
Thus, by Lemma A4, we complete the proof of Theorem 6. □
Appendix C. Proofs of Section 4
Appendix C.1. Proof of Theorem 7
To prove Theorem 7, we recall a result from [].
Proposition A1
([]). If Φ satisfies the condition, and Ψ satisfies the condition, the composition satisfies the condition. In addition, if , then satisfies the condition.
Proof of Theorem 7.
Note that entropy numbers and Hölder widths are invariant in terms of translation, that is, for any ,
We only need to consider the compact set .
For any , it is known from the definition of Hölder widths that there is an element satisfying
Thus, we could set .
Let and the set , satisfying, for any , that there is an element such that
Next, we split the unit ball into non-overlapping open balls with side length . Denote by the center of . Let be the mapping such that
It is clear that , satisfies the condition, and , satisfies the condition. Indeed, for a constant , the function on is non-negative and has maximum 1. Set for any . Then, we have
which implies that , and thus
So, , satisfies the condition. It follows from Proposition A1 that satisfies the condition. Denote by . Let the mapping satisfy
Then, we prove that satisfies the condition and .
It is known from (A9) that if , then , and if , then . Thus, . Moreover, for any , we consider the following three cases.
Case 1: If , then .
Case 2: If and , then there is a such that . Thus,
Case 3: If , then there exist two numbers , such that and . Therefore,
We divide it into the following cases.
Case 3.1: If , it follows from satisfying the condition that
Case 3.2: If , we have
Due to the arbitrariness of and , we can assume that . Then,
where the last inequality uses (A10). Otherwise,
Thus, by combining (A11) with (A12), we have
Therefore, satisfies the condition, where .
Then, we have
Let and . We obtain
We complete the proof of Theorem 7. □
Appendix C.2. Proof of Theorem 8
To prove Theorem 8, we recall some definitions and lemmas.
An -packing of K is a collection , there is
The maximal -packing number is the cardinality of the largest -packing of K.
It is known ([], Chapter 15) that , and Lemma A5 holds.
Lemma A5
([]). For the ball and , we have
and
Lemma A6.
If , then
Specifically, if , then
where is an m-dimensional unit ball of X.
Proof.
For , there exists an mapping and a norm such that can approximate K with an accuracy of . That is, for any and ,
Then, we consider a collection such that is a maximal -packing of . By the definition of the maximal packing number and Hölder widths, for any , we obtain
and thus
In addition, if we add an element , then the set is not an -packing of . Therefore, there exists a satisfying
which implies that
By the definition of the maximal packing number and (A16),
Thus, by (A13), we have
It follows from (A17) that is a -covering of K. Therefore,
Then, we obtain
which means that (A15) holds true.
Proof of Theorem 8.
By Lemma A6 with , we have
where is a constant depending on , and q. It follows from the definitions of the minimal covering number and entropy number that
If we take , then
Thus, for sufficiently large n, we obtain . By using (A19) and , we obtain
We complete the proof of Theorem 8. □
Appendix C.3. Proof of Theorem 9
To prove Theorem 9, we need the following lemma.
Lemma A7.
Suppose that the sequence of real numbers is decreasing to zero. Moreover, if
and there exist and satisfying
then
Proof.
By using Lemma A6 with , we have
Then, by the definition of the entropy number and (A20), we obtain
The proof of Lemma A7 is completed. □
Proof of Theorem 9.
We prove all statements of the theorem by contradiction.
To prove Theorem 9 (i), we assume that (7) is false, meaning there exists a strictly increasing sequence of integers such that
Thus, we have
Set . It is known that is decreasing to zero as . Therefore, by Lemma A7,
Thus,
that is,
It is known that for sufficiently large k, we have
Then, by (A21), we have
Thus, by combining (A22) with (A24), we obtain
where .
Next, we use the property of . When , is decreasing on ; when , is increasing on . Then, we divide it into the following three cases.
Case 1: . For sufficiently large k, it follows from (A25) and that
Therefore, , which implies that , contradicting (A23).
Case 2: . For sufficiently large k and , we have
Then, it follows from (A25) that
which also implies that , contradicting (A23).
Case 3: . For sufficiently large k and , by (A23) and (A24), we obtain
Thus, by combining these results with (A25) and multiplying both sides by , we obtain
which implies that
Therefore, as , which contradicts (A23). We complete the proof of Theorem 9 (i).
To prove Theorem 9 (ii), we assume that (8) is false. Then, there is an increasing sequence of satisfying, for ,
Set and
Then by Lemma A7, we have
When , the function is decreasing. For sufficiently large k, it follows from (A23) and that
Moreover, it is known that
By combining (A27) with (A28), we obtain
Thus,
which implies that as , contradicting (A26). We complete the proof of Theorem 9 (ii).
Finally, we prove Theorem 9 (iii). The proof is similar to those above. It is known from Corollary 1 that if
then
By using Lemma A7 with and , we have
Taking the logarithm on both sides of (A29), and using the fact that as , we have
which implies that
Thus,
Therefore, we obtain
which contradicts as . The proof of Theorem 9 is completed. □
Appendix D. Proofs of Section 5
Appendix D.1. Proof of Theorem 10
Proof of Theorem 10.
The proofs are similar to those of Theorem 9. To prove Theorem 10 (i), we assume that (11) is false, that is, there is an increasing sequence of satisfying
Thus, we have
It follows from Lemma A7 with that
Recall that for sufficiently large k,
It is known from (A30) and that there exists a constant such that
Thus, by combining (A31) with (A33), we obtain
where . Then, we divide it into the following two cases.
Case 1: . For sufficiently large k and , we have
It follows from (A34) and that
Therefore, , which implies that , contradicting (A32).
Case 2: . For sufficiently large k and , by (A32) and (A33), we obtain
Thus, by combining these results with (A34) and multiplying both sides by , we obtain
which implies that
Case 2.2: If , it follows from (A35) that
Therefore, as , which contradicts (A32). We complete the proof of Theorem 10 (i).
To prove Theorem 10 (ii), we assume that (12) is false. Then, there is an increasing sequence of satisfying, for ,
Set and
Then, by Lemma A7, we have
Appendix D.2. Proof of Theorem 15
Proof of Theorem 15.
By Theorem 9 (i), it is clear that
The proof of the upper bound is similar to the proof in []. We give the detailed proof, using the method from the proof of Theorem 7. For , define as
Set . We could divide the unit ball into non-overlapping open balls with side length , non-overlapping open balls with side length , ⋯, or non-overlapping open balls with side length , where and
Hence, there is a sequence of non-overlapping open balls with side length ,
Let be a mapping such that
and the mapping be such that
It is known from the proof of Theorem 7 and (A38) that satisfies the condition; thus, it satisfies the condition. Moreover, , .
For , by the decreasing property of , we have
and
Thus,
The proof of Theorem 15 is completed. □
References
- Kolmogoroff, A. Uber die beste Annaherung von Funktionen einer gegebenen Funktionenklasse. Ann. Math. 1936, 37, 107–110. [Google Scholar] [CrossRef]
- Pinkus, A. n-Widths in Approximation Theory; Springer Science & Business Media: Berlin, Germany, 2012. [Google Scholar]
- Lorentz, G.G.; Golitschek, M.; Makovoz, Y. Constructive Approximation: Advanced Problems; Springer: Berlin, Germany, 1996. [Google Scholar]
- Fang, G.; Ye, P. Probabilistic and average linear widths of Sobolev space with Gaussian measure. J. Complex. 2003, 19, 73–84. [Google Scholar]
- Fang, G.; Ye, P. Probabilistic and average linear widths of Sobolev space with Gaussian measure in L∞-Norm. Constr. Approx. 2004, 20, 159–172. [Google Scholar]
- Duan, L.; Ye, P. Exact asymptotic orders of various randomized widths on Besov classes. Commun. Pure Appl. Anal. 2020, 19, 3957–3971. [Google Scholar] [CrossRef]
- Duan, L.; Ye, P. Randomized approximation numbers on Besov classes with mixed smoothness. Int. J. Wavelets Multiresolut. Inf. Process. 2020, 18, 2050023. [Google Scholar] [CrossRef]
- Liu, Y.; Li, X.; Li, H. n-Widths of Multivariate Sobolev Spaces with Common Smoothness in Probabilistic and Average Settings in the Sq Norm. Axioms 2023, 12, 698. [Google Scholar] [CrossRef]
- Liu, Y.; Li, H.; Li, X. Approximation Characteristics of Gel’fand Type in Multivariate Sobolev Spaces with Mixed Derivative Equipped with Gaussian Measure. Axioms 2023, 12, 804. [Google Scholar] [CrossRef]
- Wu, R.; Liu, Y.; Li, H. Probabilistic and Average Gel’fand Widths of Sobolev Space Equipped with Gaussian Measure in the Sq-Norm. Axioms 2024, 13, 492. [Google Scholar] [CrossRef]
- Liu, Y.; Lu, M. Approximation problems on the smoothness classes. Acta Math. Sci. 2024, 44, 1721–1734. [Google Scholar] [CrossRef]
- DeVore, R.; Howard, R.; Micchelli, C. Optimal nonlinear approximation. Manuscr. Math. 1989, 63, 469–478. [Google Scholar] [CrossRef]
- DeVore, R.; Hanin, B.; Petrova, G. Neural network approximation. Acta Numer. 2021, 30, 327–444. [Google Scholar] [CrossRef]
- Petrova, G.; Wojtaszczyk, P. Limitations on approximation by deep and shallow neural networks. J. Mach. Learn. Res. 2023, 24, 1–38. [Google Scholar]
- DeVore, R.; Kyriazis, G.; Leviatan, D.; Tichomirov, V. Wavelet compression and nonlinear-widths. Adv. Comput. Math. 1993, 1, 197–214. [Google Scholar] [CrossRef]
- Temlyakov, V. Nonlinear Kolmogorov widths. Math. Notes 1998, 63, 785–795. [Google Scholar] [CrossRef]
- Cohen, A.; DeVore, R.; Petrova, G.; Wojtaszczyk, P. Optimal stable nonlinear approximation. Found. Comput. Math. 2022, 22, 607–648. [Google Scholar] [CrossRef]
- Petrova, G.; Wojtaszczyk, P. Lipschitz widths. Constr. Approx. 2023, 57, 759–805. [Google Scholar] [CrossRef]
- Petrova, G.; Wojtaszczyk, P. On the entropy numbers and the Kolmogorov widths. arXiv 2022, arXiv:2203.00605. [Google Scholar]
- Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef]
- Shen, Z.; Yang, H.; Zhang, S. Optimal approximation rate of ReLU networks in terms of width and depth. J. Math. Pures Appl. 2022, 157, 101–135. [Google Scholar] [CrossRef]
- Fiorenza, R. Hölder and Locally Hölder Continuous Functions, and Open Sets of Class Ck, Ck,λ; Birkhäuser: Basel, Switzerland, 2017. [Google Scholar]
- Opschoor, J.; Schwab, C.; Zech, J. Exponential ReLU DNN expression of holomorphic maps in high dimension. Constr. Approx. 2021, 55, 537–582. [Google Scholar] [CrossRef]
- Yang, Y.; Zhou, D. Optimal Rates of Approximation by Shallow ReLUk Neural Networks and Applications to Nonparametric Regression. Constr. Approx. 2024, 1–32. [Google Scholar]
- Lee, M. Mathematical Analysis and Performance Evaluation of the GELU Activation Function in Deep Learning. J. Math. 2023, 2023, 4229924. [Google Scholar] [CrossRef]
- Forti, M.; Grazzini, M.; Nistri, P.; Pancioni, L. Generalized Lyapunov approach for convergence of neural networks with discontinuous or non-Lipschitz activations. Phys. D 2006, 214, 88–99. [Google Scholar] [CrossRef]
- Gavalda, R.; Siegelmann, H. Discontinuities in recurrent neural networks. Neural Comput. 1999, 11, 715–745. [Google Scholar] [CrossRef]
- Tatar, N. Hölder continuous activation functions in neural networks. Adv. Differ. Equ. Control Process. 2015, 15, 93–106. [Google Scholar]
- Carl, B. Entropy numbers, s-numbers, and eigenvalue problems. J. Funct. Anal. 1981, 41, 290–306. [Google Scholar] [CrossRef]
- Konyagin, S.; Temlyakov, V. The Entropy in Learning Theory. Error Estimates. Constr. Approx. 2007, 25, 1–27. [Google Scholar] [CrossRef]
- Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
- Donoho, D.L. Compressed sensing. IEEE Trans. Inform. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
- Siegel, J.W. Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces. J. Mach. Learn. Res. 2023, 24, 1–52. [Google Scholar]
- Lu, J.; Shen, Z.; Yang, H.; Zhang, S. Deep network approximation for smooth functions. SIAM J. Math. Anal. 2021, 53, 5465–5506. [Google Scholar] [CrossRef]
- Birman, M.; Solomyak, M. Piecewise polynomial approximations of functions of the class . Mat. Sb. 1967, 73, 331–355. (In Russian) [Google Scholar]
- DeVore, R.; Sharpley, R. Besov spaces on domains in . Trans. Am. Math. Soc. 1993, 335, 843–864. [Google Scholar]
- Mazzucato, A. Besov-Morrey spaces: Function space theory and applications to non-linear PDE. Trans. Am. Math. Soc. 2003, 355, 1297–1364. [Google Scholar] [CrossRef]
- Garnett, J.; Le, T.; Meyer, Y.; Vese, A. Image decompositions using bounded variation and generalized homogeneous Besov spaces. Appl. Comput. Harmon. Anal. 2007, 23, 25–56. [Google Scholar] [CrossRef]
- Marinucci, D.; Pietrobon, D.; Balbi, A.; Baldi, P.; Cabella, P.; Kerkyacharian, G.; Natoli, P.; Picard, D.; Vittorio, N. Spherical needlets for cosmic microwave background data analysis. Mon. Not. R. Astron. Soc. 2008, 383, 539–545. [Google Scholar] [CrossRef]
- Dai, F.; Xu, Y. Approximation Theory and Harmonic Analysis on Spheres and Balls; Springer Monographs in Mathematics; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Feng, H.; Huang, S.; Zhou, D.X. Generalization analysis of CNNs for classification on spheres. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 6200–6213. [Google Scholar] [CrossRef] [PubMed]
- Kushpel, A.; Tozoni, S. Entropy numbers of Sobolev and Besov classes on homogeneous spaces. In Advances in Analysis; World Scientific Publishing: Hackensack, NJ, USA, 2005; pp. 89–98. [Google Scholar]
- Zhou, D.X. Theory of deep convolutional neural networks: Downsampling. Neural Netw. 2020, 124, 319–327. [Google Scholar] [CrossRef]
- Zhou, D.X. Universality of deep convolutional neural networks. Appl. Comput. Harmon. Anal. 2020, 48, 787–794. [Google Scholar] [CrossRef]
- Mao, T.; Shi, Z.; Zhou, D.X. Theory of deep convolutional neural networks III: Approximating radial functions. Neural Netw. 2021, 144, 778–790. [Google Scholar] [CrossRef]
- Kühn, T. Entropy Numbers of General Diagonal Operators. Rev. Mat. Complut. 2005, 18, 479–491. [Google Scholar] [CrossRef]
- Carl, B.; Stephani, I. Entropy, Compactness and the Approximation of Operators; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
- Wojtaszczyk, P. Banach Spaces for Analysts; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).