1. Introduction
A function
is
L-Lipschitz if
for all
, and
is the smallest
L for which this holds. If
f has an integrable derivative, its
variation is given by
(the more general definition is given in (
2)). Since
, we have the obvious relation
. No reverse inequality is possible: since for monotone
f, we have
[
1], a function whose value increases from 0 to
with a sharp “jump” in the middle can have an arbitrarily large
L and arbitrarily small
V.
Motivated by questions in machine learning and statistics, Ashlagi et al. [
2] introduced two notions of “average Lipschitz” in general metric probability spaces: a weak one and a strong one (follow-up works extended these results to average Hölder smoothness [
3,
4]). For the special case of the metric space
equipped with the standard metric
and the uniform distribution
U, their definitions are as follows. Both notions rely on the
local slope of
at a point
x, which is defined (and denoted) as follows:
The
strong and
weak average smoothness of
f are defined, respectively, by
where
X is a random variable distributed according to
U on
,
is the usual expectation, and
is the
weak norm of the random variable
Z:
Both
and
satisfy the homogeneity axiom of seminorms (meaning that
), and
additionally satisfies the triangle inequality and hence is a true seminorm. The
weak norm satisfies the weaker inequality
[
5], which
also inherits.
We now recall the definition of the variation of
:
(when
and
, we omit these), as well as the Lipschitz and bounded variation function classes:
The discussion above implies the (well-known) strict containment
In addition, we define the strong and weak average smoothness classes
By Markov’s inequality and the fact that the expectation is bounded by the supremum, we have
whence
all of these containments were shown to be strict in [
2]. The containments in (
3) and (
4) leave open the relation between BV and
, which we resolve in this work:
Theorem 1. .
We also provide a quantitative, finitary relation between these clases:
Theorem 2. For any , we have
Finally, we recall the definition of the
fat-shattering dimension, which is a combinatorial complexity measure of function classes of central importance in statistics, empirical processes, and machine learning [
6,
7]. Let
F be a collection of functions mapping
to
. For
, a set
is said to be
-shattered by
F if
The
-fat-shattering dimension, denoted by
, is the size of the largest
-shattered set (possibly
∞). It is known [
8] that for
, we have
. This same bound holds for
.
Although the strong smoothness class has the same combinatorial complexity as the BV and Lipschitz classes, for weak average smoothness, this quantity turns out to be considerably greater:
Theorem 3. For , let and . Then,
- 1.
whenever ;
- 2.
for .
Notation.
We write and use to denote the Lebesgue measure (length) of sets in .
2. Proofs
We begin with a variant of the standard covering lemma.
Lemma 1. For any sequence of closed segments in , there is a subsequence indexed by such that for all distinct we have and .
Proof. We proceed by induction on n. Let denote the intersection graph of the : the vertices correspond to the segments and if .
Suppose that G contains a cycle, and let be the segments in the cycle sorted by their right endpoint. Since , we have . If , then . Otherwise, and . Either way, we have found a segment that is completely covered by the other vertices of G. After removing it, we obtain of size with , so applying the inductive hypothesis on the segments in I yields the desired result. If G does not contain a cycle, then is bipartite, where are disjoint and nonempty. Clearly, , and thus , so taking either or (which is possible since the segments inside each part are disjoint) yields the desired result. □
Next, we reduce the proof of Theorem 2 to the case of right-continuous monotone functions.
Lemma 2. If for every right-continuous monotone function we have , then the bound holds for all . Furthermore, both inequalities are tight.
Proof. We begin by observing that we can restrict our attention to monotone functions, since is monotone and has the same variation as f, but , which means .
Thus, the inequality
immediately implies that
. If
f is monotone, it can only have jump discontinuities. Let
denote the set of right discontinuities of
f. Note that since
f is monotone,
I is at most countable. Define the modified version of
f to be
Note that
is monotone and right-continuous. It is not hard to see that if
, then
and
for all
, which implies that
and allows us to restrict our discussion to right-continuous functions. If
, then we can extend the domain of
to
for all
, where
for all
. Denote the extended function by
, then since
and
for all
, we can conclude that
□
2.1. Proof of Theorem 2
We first show that
. We may assume without loss of generality that
. We will use the notation
for the variation of
f when restricted to the segment
. Since
f is of bounded variation, the function
is well defined for
. By Lemma 2, we may assume without loss of generality that
f is right-continuous. Thus,
is monotone and right-continuous and thus induces a Lebesgue–Stieltjes measure on
, which we denote by
. We now define the
maximal function
as follows:
where the segments
are taken to be
, respectively, whenever
or
. A standard argument shows that
is open, whence
is measurable.
We now observe that everywhere in . Indeed, if , then holds for , and hence . The case of is completely analogous, whence . For X uniformly distributed over , we have and showing reduces to bounding the latter probability by .
We now closely follow the proof of Theorem 7.4 in [
9] and bound
by bounding
for arbitrary compact
. For
, denote by
some lengths such that
. Denote by
the open interval
. Then, clearly,
. Since
K is compact, a finite cover by intervals
exists. By Lemma 1, there exists
such that for all distinct
, we have
and
. Finally, by the definition of the
’s, for each
, it holds that
. We can now write
where the last inequality holds since the intervals in
I are disjoint. Since
, it immediately follows that
.
It remains to show that
. Let us denote by
the partition
of
, and let
denote the variation of
f relative to
. It suffices to show that for any such partition
, we have
. Now
Note that for all
, we have
Applying this to (
9) yields
Finally, the tightness of the first claimed inequality is witnessed by the step function and of the second inequality by . □
2.2. Proof of Theorem 1
The claimed containments are immediate from Theorem 2; only the separations remain to be shown. The first of these is obvious: the step function has bounded variation but infinite strong average Lipschitz [
2] (Appendix I). We proceed with the second separation:
Lemma 3. There exists an such that but .
Proof. Let
be the piecewise linear function defined on
,
, by
and extended to
by linear interpolation.
Clearly, . To bound , note that any witnessing also verify . Let denote the interval . If , then there is an such that . Now, either x or lies in for . If with , then . If, however, for , then and since , we have . We conclude that implies and hence ; this proves the claim. □
Remark.
Another function with this property is .
2.3. Proof of Theorem 3
2.3.1. Proof That Whenever
Consider the partition of into segments where . We define . This specifies f at all endpoints of . For , we define , i.e., f is piecewise linear with slope in . Similarly to Lemma 3, if , then . Now, suppose , i.e., there exists with . This implies that either x or lies in for some (the slope of the line connecting x and lies between the slopes of the segments containing ). If for some , then . If, however, for some then and since , we have . Since implies , we can conclude that . An immediate corollary is that -shatters the infinite set for — which is even stronger than having arbitrarily large -shattered sets. Note that this is close to tight, since for , we cannot -shatter even a set of two points. Suppose and , then for we have , hence , which means is not -shattered by . □
2.3.2. Proof That
for
The upper bound follows immediately from . For the lower bound, take a packing of . For labeling , consider the linear interpolation of , and observe that the interpolation f satisfies everywhere. □