Abstract
We closely examine a recently introduced notion of average smoothness. The latter defined a weak and strong average-Lipschitz seminorm for real-valued functions on general metric spaces. Specializing to the standard metric on the real line, we compare these notions to bounded variation (BV) and discover that the weak notion is strictly weaker than BV while the strong notion is strictly stronger. Along the way, we discover that the weak average smooth class is also considerably larger in a certain combinatorial sense, which is made precise by the fat-shattering dimension.
1. Introduction
A function is L-Lipschitz if for all , and is the smallest L for which this holds. If f has an integrable derivative, its variation is given by (the more general definition is given in (2)). Since , we have the obvious relation . No reverse inequality is possible: since for monotone f, we have [1], a function whose value increases from 0 to with a sharp “jump” in the middle can have an arbitrarily large L and arbitrarily small V.
Motivated by questions in machine learning and statistics, Ashlagi et al. [2] introduced two notions of “average Lipschitz” in general metric probability spaces: a weak one and a strong one (follow-up works extended these results to average Hölder smoothness [3,4]). For the special case of the metric space equipped with the standard metric and the uniform distribution U, their definitions are as follows. Both notions rely on the local slope of at a point x, which is defined (and denoted) as follows:
The strong and weak average smoothness of f are defined, respectively, by
where X is a random variable distributed according to U on , is the usual expectation, and is the weak norm of the random variable Z:
Both and satisfy the homogeneity axiom of seminorms (meaning that ), and additionally satisfies the triangle inequality and hence is a true seminorm. The weak norm satisfies the weaker inequality [5], which also inherits.
We now recall the definition of the variation of :
(when and , we omit these), as well as the Lipschitz and bounded variation function classes:
The discussion above implies the (well-known) strict containment
In addition, we define the strong and weak average smoothness classes
By Markov’s inequality and the fact that the expectation is bounded by the supremum, we have
whence
all of these containments were shown to be strict in [2]. The containments in (3) and (4) leave open the relation between BV and , which we resolve in this work:
Theorem 1.
.
We also provide a quantitative, finitary relation between these clases:
Theorem 2.
For any , we have
Finally, we recall the definition of the fat-shattering dimension, which is a combinatorial complexity measure of function classes of central importance in statistics, empirical processes, and machine learning [6,7]. Let F be a collection of functions mapping to . For , a set is said to be -shattered by F if
The -fat-shattering dimension, denoted by , is the size of the largest -shattered set (possibly ∞). It is known [8] that for , we have . This same bound holds for .
Although the strong smoothness class has the same combinatorial complexity as the BV and Lipschitz classes, for weak average smoothness, this quantity turns out to be considerably greater:
Theorem 3.
For , let and . Then,
- 1.
- whenever ;
- 2.
- for .
Notation.
We write and use to denote the Lebesgue measure (length) of sets in .
2. Proofs
We begin with a variant of the standard covering lemma.
Lemma 1.
For any sequence of closed segments in , there is a subsequence indexed by such that for all distinct we have and .
Proof.
We proceed by induction on n. Let denote the intersection graph of the : the vertices correspond to the segments and if .
Suppose that G contains a cycle, and let be the segments in the cycle sorted by their right endpoint. Since , we have . If , then . Otherwise, and . Either way, we have found a segment that is completely covered by the other vertices of G. After removing it, we obtain of size with , so applying the inductive hypothesis on the segments in I yields the desired result. If G does not contain a cycle, then is bipartite, where are disjoint and nonempty. Clearly, , and thus , so taking either or (which is possible since the segments inside each part are disjoint) yields the desired result. □
Next, we reduce the proof of Theorem 2 to the case of right-continuous monotone functions.
Lemma 2.
If for every right-continuous monotone function we have , then the bound holds for all . Furthermore, both inequalities are tight.
Proof.
We begin by observing that we can restrict our attention to monotone functions, since is monotone and has the same variation as f, but , which means .
Thus, the inequality immediately implies that . If f is monotone, it can only have jump discontinuities. Let denote the set of right discontinuities of f. Note that since f is monotone, I is at most countable. Define the modified version of f to be
Note that is monotone and right-continuous. It is not hard to see that if , then and for all , which implies that and allows us to restrict our discussion to right-continuous functions. If , then we can extend the domain of to for all , where for all . Denote the extended function by , then since and for all , we can conclude that
□
2.1. Proof of Theorem 2
We first show that . We may assume without loss of generality that . We will use the notation for the variation of f when restricted to the segment . Since f is of bounded variation, the function is well defined for . By Lemma 2, we may assume without loss of generality that f is right-continuous. Thus, is monotone and right-continuous and thus induces a Lebesgue–Stieltjes measure on , which we denote by . We now define the maximal function as follows:
where the segments are taken to be , respectively, whenever or . A standard argument shows that is open, whence is measurable.
We now observe that everywhere in . Indeed, if , then holds for , and hence . The case of is completely analogous, whence . For X uniformly distributed over , we have and showing reduces to bounding the latter probability by .
We now closely follow the proof of Theorem 7.4 in [9] and bound by bounding for arbitrary compact . For , denote by some lengths such that . Denote by the open interval . Then, clearly, . Since K is compact, a finite cover by intervals exists. By Lemma 1, there exists such that for all distinct , we have and . Finally, by the definition of the ’s, for each , it holds that . We can now write
where the last inequality holds since the intervals in I are disjoint. Since , it immediately follows that .
It remains to show that . Let us denote by the partition of , and let denote the variation of f relative to . It suffices to show that for any such partition , we have . Now
Finally, the tightness of the first claimed inequality is witnessed by the step function and of the second inequality by . □
2.2. Proof of Theorem 1
The claimed containments are immediate from Theorem 2; only the separations remain to be shown. The first of these is obvious: the step function has bounded variation but infinite strong average Lipschitz [2] (Appendix I). We proceed with the second separation:
Lemma 3.
There exists an such that but .
Proof.
Let be the piecewise linear function defined on , , by
and extended to by linear interpolation.
Clearly, . To bound , note that any witnessing also verify . Let denote the interval . If , then there is an such that . Now, either x or lies in for . If with , then . If, however, for , then and since , we have . We conclude that implies and hence ; this proves the claim. □
Remark.
Another function with this property is .
2.3. Proof of Theorem 3
2.3.1. Proof That Whenever
Consider the partition of into segments where . We define . This specifies f at all endpoints of . For , we define , i.e., f is piecewise linear with slope in . Similarly to Lemma 3, if , then . Now, suppose , i.e., there exists with . This implies that either x or lies in for some (the slope of the line connecting x and lies between the slopes of the segments containing ). If for some , then . If, however, for some then and since , we have . Since implies , we can conclude that . An immediate corollary is that -shatters the infinite set for — which is even stronger than having arbitrarily large -shattered sets. Note that this is close to tight, since for , we cannot -shatter even a set of two points. Suppose and , then for we have , hence , which means is not -shattered by . □
2.3.2. Proof That for
The upper bound follows immediately from . For the lower bound, take a packing of . For labeling , consider the linear interpolation of , and observe that the interpolation f satisfies everywhere. □
Author Contributions
Conceptualization, A.E. and A.K.; validation, A.E. and A.K.; formal analysis, A.E. and A.K.; writing—review and editing, A.E. and A.K.; All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Appell, J.; Banaś, J.; Merentes, N. Bounded Variation and Around; De Gruyter Series in Nonlinear Analysis and Applications; De Gruyter: Berlin/Heidelberg, Germany, 2014; Volume 17, pp. x+476. [Google Scholar]
- Ashlagi, Y.; Gottlieb, L.; Kontorovich, A. Functions with average smoothness: Structure, algorithms, and learning. J. Mach. Learn. Res. 2024, 25, 117:1–117:54. [Google Scholar]
- Hanneke, S.; Kontorovich, A.; Kornowski, G. Efficient Agnostic Learning with Average Smoothness. In Proceedings of the International Conference on Algorithmic Learning Theory, La Jolla, CA, USA, 25–28 February 2024; Volume 237, pp. 719–731. Available online: https://proceedings.mlr.press/v237/hanneke24a.html (accessed on 22 August 2025).
- Kornowski, G.; Hanneke, S.; Kontorovich, A. Near-Optimal Learning with Average Hölder Smoothness. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Available online: https://proceedings.neurips.cc/paper_files/paper/2023/hash/42afce512806ab874b9f99ed9a08055e-Abstract-Conference.html (accessed on 22 August 2025).
- Hagelstein, P.A. Weak L1 norms of random sums. Proc. Am. Math. Soc. 2005, 133, 2327–2334. [Google Scholar] [CrossRef]
- Alon, N.; Ben-David, S.; Cesa-Bianchi, N.; Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM 1997, 44, 615–631. [Google Scholar] [CrossRef]
- Bartlett, P.L.; Long, P.M. Prediction, learning, uniform convergence, and scale-sensitive dimensions. J. Comput. Syst. Sci. 1998, 56, 174–190. [Google Scholar] [CrossRef][Green Version]
- Anthony, M.; Bartlett, P.L. Neural Network Learning: Theoretical Foundations; Cambridge University Press: Cambridge, UK, 1999; pp. xiv+389. [Google Scholar] [CrossRef]
- Rudin, W. Real and Complex Analysis, 3rd ed.; McGraw-Hill, Inc.: Columbus, OH, USA, 1987. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).