Next Article in Journal
On the Coupling Between Cosmological Dynamics and Quantum Behavior: A Multiscale Thermodynamic Framework
Previous Article in Journal
Symplectic QSD, LCD, and ACD Codes over a Non-Commutative Non-Unitary Ring of Order Nine
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bounded Variation Separates Weak and Strong Average Lipschitz

Computer Science Department, Ben-Gurion University, Beer Sheva 84105, Israel
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(9), 974; https://doi.org/10.3390/e27090974
Submission received: 22 August 2025 / Revised: 16 September 2025 / Accepted: 17 September 2025 / Published: 18 September 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

We closely examine a recently introduced notion of average smoothness. The latter defined a weak and strong average-Lipschitz seminorm for real-valued functions on general metric spaces. Specializing to the standard metric on the real line, we compare these notions to bounded variation (BV) and discover that the weak notion is strictly weaker than BV while the strong notion is strictly stronger. Along the way, we discover that the weak average smooth class is also considerably larger in a certain combinatorial sense, which is made precise by the fat-shattering dimension.

1. Introduction

A function f : [ 0 , 1 ] R is L-Lipschitz if | f ( x ) f ( x ) | L | x x | for all x , x [ 0 , 1 ] , and f Lip is the smallest L for which this holds. If f has an integrable derivative, its variation  V ( f ) is given by V ( f ) = 0 1 | f ( x ) | d x (the more general definition is given in (2)). Since | f ( x ) | f Lip , we have the obvious relation V ( f ) f Lip . No reverse inequality is possible: since for monotone f, we have V ( f ) = | f ( 0 ) f ( 1 ) | [1], a function whose value increases from 0 to ε with a sharp “jump” in the middle can have an arbitrarily large L and arbitrarily small V.
Motivated by questions in machine learning and statistics, Ashlagi et al. [2] introduced two notions of “average Lipschitz” in general metric probability spaces: a weak one and a strong one (follow-up works extended these results to average Hölder smoothness [3,4]). For the special case of the metric space Ω = [ 0 , 1 ] equipped with the standard metric ρ ( x , x ) = | x x | and the uniform distribution U, their definitions are as follows. Both notions rely on the local slope of f : [ 0 , 1 ] R at a point x, which is defined (and denoted) as follows:
Λ f ( x ) = sup x [ 0 , 1 ] x | f ( x ) f ( x ) | | x x | , x [ 0 , 1 ] .
The strong and weak average smoothness of f are defined, respectively, by
f S = E Λ f ( X ) , f W = W Λ f ( X ) = sup t > 0 t U x Ω : Λ f ( x ) t ,
where X is a random variable distributed according to U on [ 0 , 1 ] , E is the usual expectation, and W is the weak L 1 norm of the random variable Z:
W [ Z ] = sup t > 0 t P ( | Z | t ) .
Both · S and · W satisfy the homogeneity axiom of seminorms (meaning that α f = | α | · f ), and · S additionally satisfies the triangle inequality and hence is a true seminorm. The weak L 1 norm satisfies the weaker inequality W [ X + Y ] 2 ( W [ X ] + W [ Y ] ) [5], which · W also inherits.
We now recall the definition of the variation of f : [ a , b ] R :
V a b ( f ) = sup a = x 0 < x 1 < x 2 < < x n b i = 1 n | f ( x i ) f ( x i 1 ) |
(when a = 0 and b = 1 , we omit these), as well as the Lipschitz and bounded variation function classes:
Lip = f : [ 0 , 1 ] R ; f Lip < , BV = f : [ 0 , 1 ] R ; V ( f ) < .
The discussion above implies the (well-known) strict containment
Lip BV .
In addition, we define the strong and weak average smoothness classes
Lip ¯ s = f : [ 0 , 1 ] R ; f S < , Lip ¯ w = f : [ 0 , 1 ] R ; f W < .
By Markov’s inequality and the fact that the expectation is bounded by the supremum, we have
f W f S sup x Ω Λ f ( x ) = f Lip
whence
Lip Lip ¯ s Lip ¯ w ;
all of these containments were shown to be strict in [2]. The containments in (3) and (4) leave open the relation between BV and Lip ¯ s , Lip ¯ w , which we resolve in this work:
Theorem 1. 
Lip ¯ s BV Lip ¯ w .
We also provide a quantitative, finitary relation between these clases:
Theorem 2. 
For any f : [ 0 , 1 ] R , we have 1 2 f W V ( f ) f S .
Finally, we recall the definition of the fat-shattering dimension, which is a combinatorial complexity measure of function classes of central importance in statistics, empirical processes, and machine learning [6,7]. Let F be a collection of functions mapping [ 0 , 1 ] to R . For γ > 0 , a set S = x 1 , , x m [ 0 , 1 ] is said to be γ -shattered by F if
sup r R m min y 1 , 1 m sup f F min i [ m ] y i ( f ( x i ) r i ) γ .
The γ -fat-shattering dimension, denoted by fat γ ( F ) , is the size of the largest γ -shattered set (possibly ). It is known [8] that for F = { f : [ 0 , 1 ] R V ( f ) L } , we have fat γ ( F ) = 1 + L 2 γ . This same bound holds for F = { f : [ 0 , 1 ] R Lip f L } .
Although the strong smoothness class has the same combinatorial complexity as the BV and Lipschitz classes, for weak average smoothness, this quantity turns out to be considerably greater:
Theorem 3. 
For L > 0 , let F W = f : [ 0 , 1 ] R ; f W L and F S = f : [ 0 , 1 ] R ; f S L . Then,
1. 
fat γ ( F W ) = whenever γ L 6 ;
2. 
fat γ ( F S ) = 1 + L 2 γ for γ > 0 .
Notation.
We write [ n ] : = 1 , , n and use m ( · ) to denote the Lebesgue measure (length) of sets in R .

2. Proofs

We begin with a variant of the standard covering lemma.
Lemma 1. 
For any sequence s 1 , , s n of closed segments in R , there is a subsequence indexed by I [ n ] such that for all distinct i , j I we have s i s j = and i I m ( s i ) 1 2 m i = 1 n s i .
Proof. 
We proceed by induction on n. Let G = ( [ n ] , E ) denote the intersection graph of the s i : the vertices correspond to the segments and ( i , j ) E if s i s j .
Suppose that G contains a cycle, and let s 1 = [ a 1 , b 1 ] , , s k = [ a k , b k ] be the segments in the cycle sorted by their right endpoint. Since s 1 s k , we have a k b 1 . If a k 1 a 1 , then s k 1 s 1 s k . Otherwise, a k 1 < a 1 and s 1 s k 1 . Either way, we have found a segment that is completely covered by the other vertices of G. After removing it, we obtain I [ n ] of size n 1 with i I s i = i = 1 n s i , so applying the inductive hypothesis on the segments in I yields the desired result. If G does not contain a cycle, then G = A B is bipartite, where A , B [ n ] are disjoint and nonempty. Clearly, m i = 1 n s i m i A s i + m i B s i , and thus max m i A s i , m i B s i 1 2 m i = 1 n s i , so taking either I = A or I = B (which is possible since the segments inside each part are disjoint) yields the desired result. □
Next, we reduce the proof of Theorem 2 to the case of right-continuous monotone functions.
Lemma 2. 
If for every right-continuous monotone function f : [ 0 , 1 ] R we have f W 2 V ( f ) , then the bound holds for all f : [ 0 , 1 ] R . Furthermore, both inequalities are tight.
Proof. 
We begin by observing that we can restrict our attention to monotone functions, since T f ( x ) = V f ( [ 0 , x ] ) is monotone and has the same variation as f, but Λ T f ( x ) = sup x x | T f ( x ) T f ( x ) | | x x | sup x x | f ( x ) f ( x ) | | x x | , which means T f W f W .
Thus, the inequality T f W 2 V ( T f ) immediately implies that f W T f W 2 V ( T f ) = 2 V ( f ) . If f is monotone, it can only have jump discontinuities. Let I [ 0 , 1 ] denote the set of right discontinuities of f. Note that since f is monotone, I is at most countable. Define the modified version of f to be
f ˜ ( x ) = f ( x ) x I lim ε 0 + f ( x + ε ) x I .
Note that f ˜ is monotone and right-continuous. It is not hard to see that if 0 I , then V ( f ˜ ) = V ( f ) and Λ f ˜ ( x ) = Λ f ( x ) for all x I , which implies that f W = f ˜ W and allows us to restrict our discussion to right-continuous functions. If 0 I , then we can extend the domain of f ˜ to [ ε , 1 ] for all ε > 0 , where f ˜ ( x ) = f ( 0 ) for all x < 0 . Denote the extended function by f ˜ ε , then since f W = lim ε 0 f ˜ ε W and V ( f ˜ ε ) = V ( f ) for all ε > 0 , we can conclude that
f W = lim ε 0 f ˜ ε W lim ε 0 2 V ( f ˜ ε ) = 2 V ( f ) .

2.1. Proof of Theorem 2

We first show that f W 2 V ( f ) . We may assume without loss of generality that V ( f ) < . We will use the notation V f ( [ a , b ] ) for the variation of f when restricted to the segment [ a , b ] . Since f is of bounded variation, the function T f ( x ) = V f ( [ 0 , x ] ) is well defined for x > 0 . By Lemma 2, we may assume without loss of generality that f is right-continuous. Thus, T f : [ 0 , 1 ] R is monotone and right-continuous and thus induces a Lebesgue–Stieltjes measure on [ 0 , 1 ] , which we denote by μ f . We now define the maximal function M f : [ 0 , 1 ] R as follows:
M f ( x ) = sup r 1 , r 2 > 0 μ f ( [ x r 1 , x + r 2 ] ) r 1 + r 2 = sup r 1 , r 2 > 0 V f ( [ x r 1 , x + r 2 ] ) r 1 + r 2 ,
where the segments [ a , x ] , [ x , b ] are taken to be [ 0 , x ] , [ x , 1 ] , respectively, whenever a < 0 or b > 1 . A standard argument shows that M f 1 ( t , ) is open, whence M f is measurable.
We now observe that M f Λ f everywhere in [ 0 , 1 ] . Indeed, if x > x , then M f ( x ) V f ( [ x ε , x ] ) ε + ( x x ) | f ( x ) f ( x ) | x x + ε holds for ε > 0 , and hence M f ( x ) sup x > x | f ( x ) f ( x ) | x x . The case of x < x is completely analogous, whence M f ( x ) sup x x | f ( x ) f ( x ) | | x x | = Λ f ( x ) . For X uniformly distributed over [ 0 , 1 ] , we have P Λ f ( X ) t P M f ( X ) t and showing f W 2 V ( f ) reduces to bounding the latter probability by 2 V ( f ) / t .
We now closely follow the proof of Theorem 7.4 in [9] and bound m M f 1 ( t , ) by bounding m ( K ) for arbitrary compact K M f 1 ( t , ) . For x K M f 1 ( t , ) , denote by r 1 ( x ) , r 2 ( x ) some lengths such that μ f [ x r 1 ( x ) , x + r 2 ( x ) ] r 1 ( x ) + r 2 ( x ) t . Denote by S x the open interval ( x r 1 ( x ) , x + r 2 ( x ) ) . Then, clearly, K x K S x . Since K is compact, a finite cover by intervals S x 1 , , S x n exists. By Lemma 1, there exists I [ n ] such that for all distinct i , j I , we have S x i S x j = and i I m ( S x i ) 1 2 m i = 1 n S x i . Finally, by the definition of the S x ’s, for each i [ n ] , it holds that m ( S x i ) μ f ( S x i ) t . We can now write
m ( K ) m i = 1 n S x i 2 i I m ( S x i ) 2 t i I μ f ( S x i ) 2 t μ f ( [ 0 , 1 ] ) ,
where the last inequality holds since the intervals in I are disjoint. Since μ f ( [ 0 , 1 ] ) = V ( f ) , it immediately follows that f W 2 V ( f ) .
It remains to show that V ( f ) f S . Let us denote by P n the partition 0 x 1 < x 2 < < x n 1 of [ 0 , 1 ] , and let V ( P n ) = i = 1 n 1 | f ( x i + 1 ) f ( x i ) | denote the variation of f relative to P n . It suffices to show that for any such partition P n , we have f S V ( P n ) . Now
f S = E Λ f ( X ) i = 1 n 1 | x i + 1 x i | E Λ f ( X ) | X [ x i , x i + 1 ] .
Note that for all x [ x i , x i + 1 ] , we have
Λ f ( x ) max | f ( x ) f ( x i ) | x x i , | f ( x i + 1 ) f ( x ) | x i + 1 x | f ( x i + 1 ) f ( x i ) | x i + 1 x i .
Applying this to (9) yields
E Λ f i = 1 n 1 | x i + 1 x i | | f ( x i + 1 ) f ( x i ) | x i + 1 x i = V ( P n ) .
Finally, the tightness of the first claimed inequality is witnessed by the step function f ( x ) = 1 [ x > 1 / 2 ] and of the second inequality by f ( x ) = x . □

2.2. Proof of Theorem 1

The claimed containments are immediate from Theorem 2; only the separations remain to be shown. The first of these is obvious: the step function has bounded variation but infinite strong average Lipschitz [2] (Appendix I). We proceed with the second separation:
Lemma 3. 
There exists an f : [ 0 , 1 ] [ 0 , 1 ] such that V ( f ) = but f W 2 .
Proof. 
Let f : [ 0 , 1 ] [ 0 , 1 ] be the piecewise linear function defined on x n = 1 n , n 1 , by
f 1 n = k = 1 n ( 1 ) k + 1 k
and extended to [ 0 , 1 ] by linear interpolation.
Clearly, V ( f ) = n = 1 1 n = . To bound f W , note that any x , x [ 0 , 1 ] witnessing | f ( x ) f ( x ) | | x x | t also verify | x x | 1 t . Let I n denote the interval 1 n + 1 , 1 n . If Λ f ( x ) t , then there is an x such that | f ( x ) f ( x ) | | x x | t . Now, either x or x lies in I n for n t . If x I n with n t , then x 1 t . If, however, x I n for n t , then x 1 t and since | x x | 1 t , we have x 2 t . We conclude that Λ f ( x ) t implies x 2 t and hence P Λ f ( x ) t 2 t ; this proves the claim. □
Remark.
Another function with this property is x sin 1 x .

2.3. Proof of Theorem 3

2.3.1. Proof That fat γ ( F W ) = Whenever γ L 6

Consider the partition of [ 0 , 1 ] into segments I n = x n + 1 , x n where x n = 2 n . We define f ( x n ) = ( 1 ) n γ . This specifies f at all endpoints of I n . For x ( x n + 1 , x n ) , we define f ( x ) = ( 1 ) n γ | I n | ( x x n + 1 ) + ( 1 ) n γ | I n | ( x x n ) , i.e., f is piecewise linear with slope ( 1 ) n 4 γ 2 n in I n . Similarly to Lemma 3, if | f ( x ) f ( x ) | | x x | t , then | x x | 2 γ t . Now, suppose Λ f ( x ) t , i.e., there exists x with | f ( x ) f ( x ) | | x x | t . This implies that either x or x lies in I n for some n log t 4 γ (the slope of the line connecting x and x lies between the slopes of the segments containing x , x ). If x I n for some n log t 4 γ , then x x n 4 γ t . If, however, x I n for some n log t 4 γ then x 4 γ t and since | x x | 2 γ t , we have x 6 γ t . Since Λ f ( x ) t implies x 6 γ t , we can conclude that f W 6 γ . An immediate corollary is that Lip ¯ L w γ -shatters the infinite set { x n } n = 1 for γ L 6 — which is even stronger than having arbitrarily large γ -shattered sets. Note that this is close to tight, since for γ > L 2 , we cannot γ -shatter even a set of two points. Suppose f ( x 1 ) > L 2 and f ( x 2 ) < L 2 , then for x [ x 1 , x 2 ] we have Λ f ( x ) > L | x 2 x 1 | , hence f W > L , which means { x 1 , x 2 } is not γ -shattered by Lip ¯ L w . □

2.3.2. Proof That fat γ ( F S ) = 1 + L 2 γ for γ > 0

The upper bound follows immediately from V ( f ) f Lip . For the lower bound, take a 2 γ / L packing x 1 , , x 1 + L 2 γ of [ 0 , 1 ] . For labeling y i 1 , 1 n , consider the linear interpolation of x i , y i γ , and observe that the interpolation f satisfies Λ f ( x ) 2 γ 2 γ / L = L everywhere. □

Author Contributions

Conceptualization, A.E. and A.K.; validation, A.E. and A.K.; formal analysis, A.E. and A.K.; writing—review and editing, A.E. and A.K.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Appell, J.; Banaś, J.; Merentes, N. Bounded Variation and Around; De Gruyter Series in Nonlinear Analysis and Applications; De Gruyter: Berlin/Heidelberg, Germany, 2014; Volume 17, pp. x+476. [Google Scholar]
  2. Ashlagi, Y.; Gottlieb, L.; Kontorovich, A. Functions with average smoothness: Structure, algorithms, and learning. J. Mach. Learn. Res. 2024, 25, 117:1–117:54. [Google Scholar]
  3. Hanneke, S.; Kontorovich, A.; Kornowski, G. Efficient Agnostic Learning with Average Smoothness. In Proceedings of the International Conference on Algorithmic Learning Theory, La Jolla, CA, USA, 25–28 February 2024; Volume 237, pp. 719–731. Available online: https://proceedings.mlr.press/v237/hanneke24a.html (accessed on 22 August 2025).
  4. Kornowski, G.; Hanneke, S.; Kontorovich, A. Near-Optimal Learning with Average Hölder Smoothness. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Available online: https://proceedings.neurips.cc/paper_files/paper/2023/hash/42afce512806ab874b9f99ed9a08055e-Abstract-Conference.html (accessed on 22 August 2025).
  5. Hagelstein, P.A. Weak L1 norms of random sums. Proc. Am. Math. Soc. 2005, 133, 2327–2334. [Google Scholar] [CrossRef]
  6. Alon, N.; Ben-David, S.; Cesa-Bianchi, N.; Haussler, D. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM 1997, 44, 615–631. [Google Scholar] [CrossRef]
  7. Bartlett, P.L.; Long, P.M. Prediction, learning, uniform convergence, and scale-sensitive dimensions. J. Comput. Syst. Sci. 1998, 56, 174–190. [Google Scholar] [CrossRef]
  8. Anthony, M.; Bartlett, P.L. Neural Network Learning: Theoretical Foundations; Cambridge University Press: Cambridge, UK, 1999; pp. xiv+389. [Google Scholar] [CrossRef]
  9. Rudin, W. Real and Complex Analysis, 3rd ed.; McGraw-Hill, Inc.: Columbus, OH, USA, 1987. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Elperin, A.; Kontorovich, A. Bounded Variation Separates Weak and Strong Average Lipschitz. Entropy 2025, 27, 974. https://doi.org/10.3390/e27090974

AMA Style

Elperin A, Kontorovich A. Bounded Variation Separates Weak and Strong Average Lipschitz. Entropy. 2025; 27(9):974. https://doi.org/10.3390/e27090974

Chicago/Turabian Style

Elperin, Ariel, and Aryeh Kontorovich. 2025. "Bounded Variation Separates Weak and Strong Average Lipschitz" Entropy 27, no. 9: 974. https://doi.org/10.3390/e27090974

APA Style

Elperin, A., & Kontorovich, A. (2025). Bounded Variation Separates Weak and Strong Average Lipschitz. Entropy, 27(9), 974. https://doi.org/10.3390/e27090974

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop