1. Introduction
Let
m be a positive integer and denote by
the set of all probability measures on the measurable space
, with
being the Borel
-algebra on
. Given a Borel measurable function
, consider the functional
The functional
in (
1) is defined when both integrals exist and are finite. Hence,
is the variation of the expectation of the measurable function
h due to a change in the probability measure from
to
. These changes in the probability measure are often referred to as probability distribution drifts in some application areas. See for instance [
1,
2,
3] and references therein.
In order to define the expectation of
in (
2) when
x is obtained by sampling a probability measure in
, the structure formalized below is required.
Definition 1. A family of elements of indexed by is said to be a conditional probability measure if, for all sets , the mapis Borel measurable. The set of all such conditional probability measures is denoted by . In this setting, consider the functional
Hence,
where the functional
is defined in (
1), is the variation of the integral (expectation) of the function
h when the probability measure changes from the joint probability measure
to another joint probability measure
, both in
.
Special attention is given to the quantity
, for some
, with
being the marginal on
of the joint probability measure
on
. That is, for all sets
,
The relevance of the quantity
stems from the fact that it captures the variation of the expectation of the function
h when the probability measure changes from the joint probability measure
to the product of its marginals
. That is,
where the functional
is defined in (
1).
1.1. Novelty and Contributions
This work makes two key contributions: First, it provides a closed-form expression for the variation
in (
2) for a fixed
and two arbitrary probability measures
and
, formulated explicitly in terms of relative entropies. Second, it derives a closed-form expression for the expected variation
in (
4), again in terms of information measures, for arbitrary conditional probability measures
,
, and an arbitrary probability measure
.
A further contribution of this work is the derivation of specific closed-form expressions for
in (
10), which reveal deep connections with both mutual information [
4] and lautum information [
5]. Notably, when
is a Gibbs conditional probability measure, this variation simplifies (up to a constant factor) to the sum of the mutual and lautum information induced by the joint distribution
.
These results were originally discovered in the analysis of generalization error of machine-learning algorithms, see for instance [
6,
7,
8,
9,
10]. Therein, the function
h in (
2) was assumed to represent an empirical risk. This paper presents such results in a comprehensive and general setting that is no longer tied to such assumptions. Also, strong connections with information projections and Pythagorean identities [
11,
12] are discussed. This new general presentation not only unifies previously scattered insights but also makes the results applicable across a broad range of domains in which probability distribution shifts are relevant.
1.2. Applications
The study of the variation of the integral (expectation) of
h (for some fixed
) due to a measure change from
to
, i.e., the value
in (
2), plays a central role in the definition of integral probability metrics (IPMs) [
13,
14]. Using the notation in (
2), an IPM results from the optimization problem
for some fixed
and a particular class of functions
. Note, for instance, that the maximum mean discrepancy [
15] and the Wasserstein distance of order one [
16,
17,
18,
19] are both IPMs.
Other areas of mathematics in which the variation
in (
2) plays a key role are distributionally robust optimization (DRO) [
20,
21] and optimization with relative entropy regularization [
7,
8]. In these areas, the variation
is a central tool. See for instance, [
6,
22] and references therein.
Variations of the form
in (
2) have also been studied in [
9,
10] in the particular case of statistical machine learning for the analysis of generalization error. The central observation is that the generalization error of machine-learning algorithms can be written in the form
in (
10). This observation is the main building block of the method of gaps introduced in [
10], which leads to a number of closed-form expressions for the generalization error involving mutual information, lautum information, among other information measures.
2. Preliminaries
The main results presented in this work involve Gibbs conditional probability measures. Such measures are parametrized by a Borel measurable function
; a
-finite measure
Q on
; and a real
. Often, the measure
Q is called the reference measure, and the real parameter
is called the
temperature parameter. The use of a
-finite reference measure was first introduced in [
7]. This class of measures includes the Lebesgue measure and the counting measure, enabling a unified treatment of random variables with either probability density functions or probability mass functions. The standard case in which the reference measure is a probability measure representing a prior is also covered. Finally, note that the variable
x will remain inactive until
Section 4. However, it is introduced now for consistency.
Consider the following function:
Under the assumption that
Q is a probability measure, the function
in (
12) is the cumulant generating function of the random variable
, for some fixed
and
. Using this notation, the definition of the Gibbs conditional probability measure is presented hereunder.
Definition 2 (Gibbs Conditional Probability Measure)
. Given a Borel measurable function ; a σ-finite measure Q on ; and a , the probability measure is said to be an -Gibbs conditional probability measure if for some set ; and for all , where the function is defined in (12); the set is the support of the σ-finite measure Q; and the function is the Radon-Nikodym derivative [23,24] of the probability measure with respect to Q. Note that, while is an -Gibbs conditional probability measure, the measure , obtained by conditioning it upon a given vector , is referred to as an -Gibbs probability measure.
The condition in (
13) is easily met under certain assumptions. For instance, if
h is a nonnegative function and
Q is a finite measure, then it holds for all
. Let
, with
standing for “
P absolutely continuous with respect to
Q”. The relevance of
-Gibbs probability measures relies on the fact that under some conditions, they are the unique solutions to problems of the form,
where
,
, and
denotes the relative entropy (or KL divergence) of
P with respect to
Q.
Definition 3 (Relative Entropy)
. Given two σ-finite measures P and Q on the same measurable space, such that P is absolutely continuous with respect to Q, the relative entropy of P with respect to Q is where the function is the Radon-Nikodym derivative of P with respect to Q. The key observation is that when
, the objective function in (
15) is convex with
P. Alternative, when
, the objective function in (
16) is concave. The connection between the optimization problems (
15) and (
16) and the Gibbs probability measure
in (
14) has been pointed out by several authors. See for instance, Theorem 3 in [
7] and [
6,
25,
26,
27,
28,
29,
30,
31,
32,
33] for the former; and Theorem 1 in [
9], together with [
34,
35,
36] for the latter. In these references, a variety of assumptions and proof techniques have been used to highlight such connections. A general and unified statement of these observations is presented hereunder.
Lemma 1. Assume that the optimization problem in (15) (respectively, in (16)) admits a solution. Then, if (respectively, if ), the probability measure in (14) is the unique solution. Proof. For the case in which
, the proof follows the same approach as the proof of Theorem 3 in [
7]. Alternatively, for the case in which
, the proof follows along the lines of the proof of Theorem 1 in [
9]. □
The following lemma highlights a key property of -Gibbs probability measures.
Lemma 2. Given an -Gibbs probability measure, denoted by , with ,moreover, if ,alternatively, if ,where the function is defined in (12). Proof. The proof of (
19) follows from taking the logarithm of both sides of (
14) and integrating with respect to
. As for the proof of (
18), it follows by noticing that for all
, the Radon-Nikodym derivative
in (
14) is strictly positive. Thus, from Theorem 5 in [
37], it holds that
. Hence, taking the negative logarithm on both sides of (
14) and integrating with respect to
Q leads to (
18). Finally, the equalities in (
20) and (
21) follow from Lemma 1 and (
19). □
The Equalities (
19)–(
21) in Lemma 2 can be seen as an immediate restatement of Donsker–Varadhan variational representation of the relative entropy [
38]. Alternative interesting proofs for (
18) have been presented by several authors including [
9,
33]. A proof for (
19) appears in [
6] (Lemma 3), in the specific case of
.
The following lemma introduces the main building block of this work, which is a characterization of the variation of the expectation of the function
when the probability measure changes from the probability measure
in (
14) to an arbitrary measure
, i.e.,
, for some fixed
. Such a result appeared for the first time in [
6] (Theorem 1) for the case in which
; and in [
9] (Theorem 6) for the case in which
, in different contexts of statistical machine learning. A general and unified statement of such results is presented hereunder.
Lemma 3. Consider an -Gibbs probability measure, denoted by , with and . For all , Proof. The proof follows along the lines of the proofs of Theorem 1 in [
6] and Theorem 6 in [
9] for the cases in which
and
, respectively. A unified proof is presented hereunder by noticing that for all
,
where (
24) follows from Theorem 4 in [
37]; (
26) follows from Theorem 5 in [
37]; and (
14); and (
27) follows from (
19). □
It is interesting to highlight that
in (
22) characterizes the variation of the expectation of the function
, when
(respectively, when
) and the probability measure changes from the solution to the optimization problem in (
15) (respectively, in (
16)) to an alternative measure
P. This result takes another perspective if it is seen in the context of information projections [
12]. Let
Q be a probability measure and
be a convex set. From Theorem 1 in [
12], it holds that for all measures
,
where
satisfies
In the particular case in which the set
in (
29) satisfies
for some real
c, with the vector
x and the function
h defined in Lemma 3, the optimal measure
in (
29) is the Gibbs probability measure
in (
14), with
chosen to satisfy
The case in which the measure
Q in (
29) is a
-finite measure, for instance, either the Lebesgue measure or the counting measure, respectively, leads to the classical framework of differential entropy or discrete entropy maximization, which have been studied under particular assumptions on the set
in [
34,
35,
36].
When the reference measure
Q is a probability measure, under the assumption that (
31) holds, it follows from Theorem 3 in [
12] that for all
, with
in (
30),
which is known as the Pythagorean theorem for relative entropy. Such a geometric interpretation follows from admitting relative entropy as an analog of squared Euclidean distance. The first appearance of such a “Pythagorean theorem” was in [
11] and was later revisited in [
12]. Interestingly, the same result can be obtained from Lemma 3 by noticing that for all
, with
in (
30),
The converse of the Pythagorean theorem, e.g., Proposition 48 in [
39], together with Lemma 3, lead to the geometric construction shown in
Figure 1. A similar interpretation was also presented in [
10] in the context of the generalization error of machine-learning algorithms. Nonetheless, the interpretation in
Figure 1 is general and independent of such an application.
The relevance of Lemma 3 in the context of information projections follows from the fact that Q might be a -finite measure. The class of -finite measures includes the class of probability measures, and thus, unifies the results separately obtained in the realm of maximum entropy methods and information-projection methods.
The following lemma highlights that -Gibbs conditional probability measures are related to another class of optimization problems.
Lemma 4. Assume that the following optimization problems possess at least one solution for some ,andConsider the -Gibbs probability measure in (14), with such that . Then, the -Gibbs probability measure is a solution to (34) if ; or to (35) if . Proof. Note that if
, then,
. Hence, from Lemma 3, it holds that for all probability measures
P such that
,
with equality if
. This implies that
is a solution to (
34). Note also that if
, from Lemma 3, it holds that for all probability measures
P such that
,
with equality if
. This implies that
is a solution to (
35). □
3. Characterization of in (2)
The main result of this section is the following theorem.
Theorem 1. For all probability measures and , both absolutely continuous with respect to a given σ-finite measure Q on , the variation in (2) satisfies,where the probability measure , with , is an -Gibbs probability measure. Proof. The proof follows from Lemma 3 and by observing that
which completes the proof. □
Theorem 1 might be particularly simplified in the case in which the reference measure Q is a probability measure. Consider for instance the case in which (or ). In such a case, the reference measure might be chosen as (or ), as shown hereunder.
Corollary 1. Consider the variation in (2). If the probability measure is absolutely continuous with respect to , then,Alternatively, if the probability measure is absolutely continuous with respect to , then,where the probability measures and are, respectively, - and -Gibbs probability measures, with . In the case in which neither is absolutely continuous with respect to ; nor is absolutely continuous with respect to , the reference measure Q in Theorem 1 can always be chosen as a convex combination of and . That is, for all Borel sets , , with .
Theorem 1 can be specialized to the cases in which Q is either the Lebesgue measure or the counting measure.
If
Q is the Lebesgue measure, then the probability measures
and
in (
42) admit probability density functions
and
, respectively. Moreover, the terms
and
are Shannon’s differential entropies [
4] induced by
and
, denoted by
and
, respectively. That is, for all
,
The probability measure
, with
,
, and
Q the Lebesgue measure, possesses a probability density function, denoted by
, which satisfies
If
Q is the counting measure, then the probability measures
and
in (
42) admit probability mass functions
and
, with
a countable subset of
. Moreover,
and
are, respectively, Shannon’s discrete entropies [
4] induced by
and
, denoted by
and
, respectively. That is, for all
,
The probability measure
, with
and
Q the counting measure, possesses a probability mass function, denoted by
, which satisfies
These observations lead to the following corollary of Theorem 1.
Corollary 2. Given two probability measures and , with probability density functions and , respectively, the variation in (2) satisfies,where the probability density function of the measure , with and Q the Lebesgue measure, is defined in (46); and the entropy functional is defined in (45). Alternatively, given two probability measures and , with probability mass functions and , respectively, the variation in (2) satisfies,where the probability mass function of the measure , with and Q the counting measure, is defined in (48); and the entropy functional is defined in (47). 4. Characterizations of in (4)
The main result of this section is a characterization of
in (
4).
Theorem 2. Consider the variation in (4) and assume that for all , the probability measures and are both absolutely continuous with respect to a σ-measure Q. Then,where the probability measure , with , is an -Gibbs conditional probability measure. Proof. The proof follows from (
4) and Theorem 1. □
Two special cases are particularly noteworthy. When the reference measure
Q is the Lebesgue measure both
and
in (
51) become Shannon’s differential conditional entropies, denoted by
and
, respectively. That is, for all
,
where
is the entropy functional in (
45).
When the reference measure
Q is the counting measure both
and
in (
51) become Shannon’s discrete conditional entropies, denoted by
and
, respectively. That is, for all
,
where
is the entropy functional in (
47).
These observations lead to the following corollary of Theorem 2.
Corollary 3. Consider the variation in (4) and assume that for all , the probability measures and possess probability density functions. Then,where the probability density function of the measure , with and Q the Lebesgue measure, is defined in (46); and for all , the conditional entropy is defined in (52). Alternatively, assume that for all , the probability measures and possess probability mass functions. Then,where the probability mass function of the measure , with and Q the counting measure, is defined in (48); and for all , the conditional entropy is defined in (53). The general expression for the expected variation
in (
4) might be simplified according to Corollary 1. For instance, if for all
, the probability measure
is absolutely continuous with respect to
, the measure
can be chosen to be the reference measure in the calculation of
, with the functional
in (
1). This observation leads to the following corollary of Theorem 2.
Corollary 4. Consider the variation in (4) and assume that for all , . Then,Alternatively, if for all , the probability measure is absolutely continuous with respect to , then,where the measures and are, respectively, - and , -Gibbs probability measures. The Gibbs probability measures
and
in Corollary 4 are particularly interesting as their reference measures depend on
x. Gibbs measures of this form appear, for instance, in Corollary 10 in [
7].
5. Characterizations of in (10)
The main result of this section is a characterization of
in (
10), which describes the variation of the expectation of the function
h when the probability measure changes from the joint probability measure
to the product of its marginals
.
This result is presented hereunder and involves the mutual information
and lautum information
, defined as follows:
Theorem 3. Consider the expected variation in (10) and assume that, for all : - 1.
The probability measures and are both absolutely continuous with respect to a given σ-finite measure Q; and
- 2.
The probability measures and are mutually absolutely continuous.
Then,where the probability measure , with , is an -Gibbs conditional probability measure. An alternative expression for
in (
10) involving only relative entropies is presented by the following theorem.
Theorem 4. Consider the expected variation in (10) and assume that, for all , the probability measure is absolutely continuous with respect to a given σ-finite measure Q. Then, it follows thatwhere , with , is an -Gibbs conditional probability measure. Theorem 4 expresses the variation
in (
10) as difference of two relative entropies. The former compares
with
, where
are independently sampled from the same probability measure
. The latter compares these two conditional measures conditioning on the same element of
. That is, it compares
with
.
An interesting observation from Theorems 3 and 4 is that the last two terms in the right-hand side of (
60) are both zero in the case in which
is an
-Gibbs conditional probability measure. Similarly, in such a case, the second term in the right-hand side of (
61) is also zero. This observation is highlighted by the following corollary.
Corollary 5. Consider an -Gibbs conditional probability measure, denoted by , with ; and a probability measure . Let the measure be such that for all sets ,Then, Please note that mutual information and lautum information are both nonnegative information measures, which from Corollary 5, implies that
in (
64) might be either positive or negative depending exclusively on the sign of the regularization factor
. The following corollary exploits such an observation to present a property of Gibbs conditional probability measures and their corresponding marginal probability measures.
Corollary 6. Given a probability measure , the -Gibbs conditional probability measure in (14) and the probability measure in (62) satisfyor Corollary 6 highlights the fact that a deviation from the joint probability measure to the product of its marginals might increase or decrease the expectation of the function h depending on the sign of .
6. Examples
An immediate application of the results presented above is the analysis of the generalization error of machine-learning algorithms [
10], which was the scenario in which these results were originally discovered. In the remainder of this section, such results are presented as consequences of the more general results presented in this work.
Let
,
and
, with
, be sets of
models,
patterns, and
labels, respectively. A pair
is referred to as a
data point. A dataset
is a tuple of
n data points of the form:
Consider the function
where
is the risk or loss induced by the model
with respect to the data point
.
Given a fixed dataset
of the form in (
67), consider also the functional
where the function
is defined in (
68). Using this notation, the
empirical risk induced by the model
with respect to the dataset
is
. The expectation of the empirical risk with respect to a fixed dataset
when models are sampled from a probability measure
is
.
A machine-learning algorithm is represented by a conditional probability measure
. The instance of such an algorithm generated by training it upon the dataset
in (
67) is represented by the probability measure
. The generalization error induced by the algorithm
is defined as follows.
Definition 4 (Generalization Error)
. The generalization error induced by the algorithm , under the assumption that training and test datasets are independently sampled from a probability measure , is denoted by , and where the functionals and are defined in (69). Often, the term
is recognized to be the test error induced by the algorithm
with respect to a test dataset
when it is trained upon the dataset
. Alternatively, the term
is recognized to be the training error induced by the algorithm
when it is trained upon the dataset
. From this perspective, the generalization error
in (
70) is the expectation of the difference between test error and training error when the test and training datasets are independently sampled from the same probability measure
. The key observation is that such generalization error
can be written as a variation of an expectation of the empirical risk function
in (
68), as shown hereunder.
Lemma 5 (Lemma 3 in [
10])
. Consider the generalization error in (70) and assume that for all , the probability measure is absolutely continuous with respect to the probability measure , which satisfies for all measurable subsets of , Then, where the functional and the function are defined in (3) and (68), respectively. From Theorems 3 and 5, the following holds.
Theorem 5 (Theorem 14 in [
10])
. Consider the generalization error in (70) and assume that for all :- (a)
The probability measures in (71) and are both absolutely continuous with respect to some σ-finite measure ; - (b)
The measure Q is absolutely continuous with respect to ; and
- (c)
The measure is absolutely continuous with respect to .
Then, where the measure is an -Gibbs conditional probability measure, with the function defined in (68). Theorem 5 shows one of many closed-form expressions that can be obtained for the generalization error
in (
70) in terms of information measures. A complete exposition of several equivalent alternative expressions, as well as a discussion on their relevance, is presented in [
10].
The important observation in this example is that the measure
in (
73) is an
-Gibbs conditional probability measure, which represents the celebrated
Gibbs algorithm in statistical machine learning [
40]. Thus, the term
in (
73) can be interpreted as a log-likelihood ratio in a hypothesis test in which the objective is to distinguish the probability measures
and
based on the observation of the model
. The former represents the algorithm under study trained upon
, whereas the latter represents a Gibbs algorithm trained upon the same dataset
.
From this perspective, the difference between the last two terms in (
73), i.e.,
can be interpreted as the variation of the expectation of the log-likelihood ratio
when the probability measure from which the model
and dataset
are drawn changes from the ground-truth distribution
to the product of the corresponding marginals
. As originally suggested in [
10], Theorem 5 establishes an interesting connection between hypothesis testing, information measures, and generalization error. Nonetheless, this connection goes beyond this application in statistical machine learning as the same connection can be established directly from Theorem 3. This establishes a connection between the variation of the expectation due to changes in the probability measure, information measures, and hypothesis testing.