Variations on the Expectation Due to Changes in the Probability Measure

Perlaza, Samir M.; Bisson, Gaetan

doi:10.3390/e27080865

Open AccessFeature PaperArticle

Variations on the Expectation Due to Changes in the Probability Measure

by

Samir M. Perlaza

^1,2,3,*

and

Gaetan Bisson

³

¹

Centre Inria d’Université Côte d’Azur, INRIA, 06902 Sophia Antipolis, France

²

Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA

³

GAATI Mathematics Laboratory, University of French Polynesia, 98702 Faaa, French Polynesia

^*

Author to whom correspondence should be addressed.

Entropy 2025, 27(8), 865; https://doi.org/10.3390/e27080865

Submission received: 23 July 2025 / Revised: 11 August 2025 / Accepted: 12 August 2025 / Published: 14 August 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figure

Review Reports Versions Notes

Abstract

In this paper, closed-form expressions for the variation of the expectation of a given function due to changes in the probability measure (probability distribution drifts) are presented. These expressions unveil interesting connections with Gibbs probability measures, information projections, Pythagorean identities for relative entropy, mutual information, and lautum information.

Keywords:

Gibbs probability measures; Pythagorean identities; sensitivity; probability distribution drifts; variations of the expectation

1. Introduction

Let m be a positive integer and denote by

▵ (R^{m})

the set of all probability measures on the measurable space

(R^{m}, B (R^{m}))

, with

B (R^{m})

being the Borel

σ

-algebra on

R^{m}

. Given a Borel measurable function

h : R^{n} \times R^{m} \to R

, consider the functional

\begin{matrix} G_{h} : \{\begin{matrix} R^{n} \times ▵ (R^{m}) \times ▵ (R^{m}) & ⟶ & R \\ (x, P_{1}, P_{2}) & ⟼ & \int h (x, y) d P_{1} (y) - \int h (x, y) d P_{2} (y) . \end{matrix} \end{matrix}

(1)

The functional

G_{h}

in (1) is defined when both integrals exist and are finite. Hence,

\begin{matrix} G_{h} (x, P_{1}, P_{2}) & = & \int h (x, y) d P_{1} (y) - \int h (x, y) d P_{2} (y), \end{matrix}

(2)

is the variation of the expectation of the measurable function h due to a change in the probability measure from

P_{2}

to

P_{1}

. These changes in the probability measure are often referred to as probability distribution drifts in some application areas. See for instance [1,2,3] and references therein.

In order to define the expectation of

G_{h} (x, P_{1}, P_{2})

in (2) when x is obtained by sampling a probability measure in

▵ (R^{n})

, the structure formalized below is required.

Definition 1.

A family

P_{Y | X} ≜ {(P_{Y | X = x})}_{x \in R^{n}}

of elements of

▵ (R^{m})

indexed by

R^{n}

is said to be a conditional probability measure if, for all sets

A \in B (R^{m})

, the map

\begin{matrix} \{\begin{matrix} R^{n} & ⟶ & [0, 1] \\ x & ⟼ & P_{Y | X = x} (A) \end{matrix} \end{matrix}

is Borel measurable. The set of all such conditional probability measures is denoted by

▵ (R^{m} | R^{n})

.

In this setting, consider the functional

\begin{matrix} {\bar{G}}_{h} : \{\begin{matrix} ▵ (R^{m} | R^{n}) \times ▵ (R^{m} | R^{n}) \times ▵ (R^{n}) & ⟶ & R \\ (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & ⟼ & \int G_{h} (x, P_{Y | X = x}^{(1)}, P_{Y | X = x}^{(2)}) d P_{X} (x) \end{matrix} . \end{matrix}

(3)

Hence,

\begin{matrix} {\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & = & \int G_{h} (x, P_{Y | X = x}^{(1)}, P_{Y | X = x}^{(2)}) d P_{X} (x) \end{matrix}

(4)

\begin{matrix} = & \int (\int h (x, y) d P_{Y | X = x}^{(1)} (y) - \int h (x, y) d P_{Y | X = x}^{(2)} (y)) d P_{X} (x) \end{matrix}

(5)

\begin{matrix} = & \int h (x, y) d P_{Y | X}^{(1)} P_{X} (y, x) - \int h (x, y) d P_{Y | X}^{(2)} P_{X} (y, x), \end{matrix}

(6)

where the functional

G_{h}

is defined in (1), is the variation of the integral (expectation) of the function h when the probability measure changes from the joint probability measure

P_{Y | X}^{(1)} P_{X}

to another joint probability measure

P_{Y | X}^{(2)} P_{X}

, both in

▵ (R^{m} \times R^{n})

.

Special attention is given to the quantity

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

, for some

P_{Y | X} \in ▵ (R^{m} | R^{n})

, with

P_{Y}

being the marginal on

▵ (R^{m})

of the joint probability measure

P_{Y | X} P_{X}

on

▵ (R^{m} \times R^{n})

. That is, for all sets

A \in B (R^{m})

,

\begin{matrix} P_{Y} (A) & = & \int P_{Y | X = x} (A) d P_{X} (x) . \end{matrix}

(7)

The relevance of the quantity

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

stems from the fact that it captures the variation of the expectation of the function h when the probability measure changes from the joint probability measure

P_{Y | X} P_{X}

to the product of its marginals

P_{Y} P_{X}

. That is,

\begin{matrix} {\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X}) & = & \int G_{h} (x, P_{Y}, P_{Y | X = x}) d P_{X} (x) \end{matrix}

(8)

\begin{matrix} = & \int (\int h (x, y) d P_{Y} (y) - \int h (x, y) d P_{Y | X = x} (y)) d P_{X} (x) \end{matrix}

(9)

\begin{matrix} = & \int h (x, y) d P_{Y} P_{X} (y, x) - \int h (x, y) d P_{Y | X} P_{X} (y, x), \end{matrix}

(10)

where the functional

G_{h}

is defined in (1).

1.1. Novelty and Contributions

This work makes two key contributions: First, it provides a closed-form expression for the variation

G_{h} (x, P_{1}, P_{2})

in (2) for a fixed

x \in R^{n}

and two arbitrary probability measures

P_{1}

and

P_{2}

, formulated explicitly in terms of relative entropies. Second, it derives a closed-form expression for the expected variation

{\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})

in (4), again in terms of information measures, for arbitrary conditional probability measures

P_{Y | X}^{(1)}

,

P_{Y | X}^{(2)}

, and an arbitrary probability measure

P_{X}

.

A further contribution of this work is the derivation of specific closed-form expressions for

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10), which reveal deep connections with both mutual information [4] and lautum information [5]. Notably, when

P_{Y | X}

is a Gibbs conditional probability measure, this variation simplifies (up to a constant factor) to the sum of the mutual and lautum information induced by the joint distribution

P_{Y | X} P_{X}

.

These results were originally discovered in the analysis of generalization error of machine-learning algorithms, see for instance [6,7,8,9,10]. Therein, the function h in (2) was assumed to represent an empirical risk. This paper presents such results in a comprehensive and general setting that is no longer tied to such assumptions. Also, strong connections with information projections and Pythagorean identities [11,12] are discussed. This new general presentation not only unifies previously scattered insights but also makes the results applicable across a broad range of domains in which probability distribution shifts are relevant.

1.2. Applications

The study of the variation of the integral (expectation) of h (for some fixed

x \in R^{n}

) due to a measure change from

P_{2}

to

P_{1}

, i.e., the value

G_{h} (x, P_{1}, P_{2})

in (2), plays a central role in the definition of integral probability metrics (IPMs) [13,14]. Using the notation in (2), an IPM results from the optimization problem

\begin{matrix} sup_{h \in H} | G_{h} (x, P_{1}, P_{2}) |, \end{matrix}

(11)

for some fixed

x \in R^{n}

and a particular class of functions

H

. Note, for instance, that the maximum mean discrepancy [15] and the Wasserstein distance of order one [16,17,18,19] are both IPMs.

Other areas of mathematics in which the variation

G_{h} (x, P_{1}, P_{2})

in (2) plays a key role are distributionally robust optimization (DRO) [20,21] and optimization with relative entropy regularization [7,8]. In these areas, the variation

G_{h} (x, P_{1}, P_{2})

is a central tool. See for instance, [6,22] and references therein.

Variations of the form

G_{h} (x, P_{1}, P_{2})

in (2) have also been studied in [9,10] in the particular case of statistical machine learning for the analysis of generalization error. The central observation is that the generalization error of machine-learning algorithms can be written in the form

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10). This observation is the main building block of the method of gaps introduced in [10], which leads to a number of closed-form expressions for the generalization error involving mutual information, lautum information, among other information measures.

2. Preliminaries

The main results presented in this work involve Gibbs conditional probability measures. Such measures are parametrized by a Borel measurable function

h : R^{n} \times R^{m} \to R

; a

σ

-finite measure Q on

R^{m}

; and a real

λ \in R

. Often, the measure Q is called the reference measure, and the real parameter

λ

is called the temperature parameter. The use of a

σ

-finite reference measure was first introduced in [7]. This class of measures includes the Lebesgue measure and the counting measure, enabling a unified treatment of random variables with either probability density functions or probability mass functions. The standard case in which the reference measure is a probability measure representing a prior is also covered. Finally, note that the variable x will remain inactive until Section 4. However, it is introduced now for consistency.

Consider the following function:

\begin{matrix} K_{h, Q, x} : \{\begin{matrix} R & ⟶ & R \\ t & ⟼ & log (\int exp (t h (x, y)) d Q (y)) . \end{matrix} \end{matrix}

(12)

Under the assumption that Q is a probability measure, the function

K_{h, Q, x}

in (12) is the cumulant generating function of the random variable

h (x, Y)

, for some fixed

x \in R^{n}

and

Y \sim Q

. Using this notation, the definition of the Gibbs conditional probability measure is presented hereunder.

Definition 2

(Gibbs Conditional Probability Measure). Given a Borel measurable function

h : R^{n} \times R^{m} \to R

; a σ-finite measure Q on

R^{m}

; and a

λ \in R

, the probability measure

P_{Y | X}^{(h, Q, λ)} \in ▵ (R^{m} | R^{n})

is said to be an

(h, Q, λ)

-Gibbs conditional probability measure if

\begin{matrix} \forall x \in X, K_{h, Q, x} (- λ) & < & + \infty; \end{matrix}

(13)

for some set

X \subseteq R^{n}

; and for all

(x, y) \in X \times supp Q

,

\begin{matrix} \frac{d P_{Y | X = x}^{(h, Q, λ)}}{d Q} (y) & = & exp (- λ h (x, y) - K_{h, Q, x} (- λ)), \end{matrix}

(14)

where the function

K_{h, Q, x}

is defined in (12); the set

supp Q

is the support of the σ-finite measure Q; and the function

\frac{d P_{Y | X = x}^{(h, Q, λ)}}{d Q}

is the Radon-Nikodym derivative [23,24] of the probability measure

P_{Y | X = x}^{(h, Q, λ)}

with respect to Q.

Note that, while

P_{Y | X}^{(h, Q, λ)}

is an

(h, Q, λ)

-Gibbs conditional probability measure, the measure

P_{Y | X = x}^{(h, Q, λ)}

, obtained by conditioning it upon a given vector

x \in X

, is referred to as an

(h, Q, λ)

-Gibbs probability measure.

The condition in (13) is easily met under certain assumptions. For instance, if h is a nonnegative function and Q is a finite measure, then it holds for all

λ \in (0, + \infty)

. Let

▵_{Q} (R^{m}) ≜ \{P \in ▵ (R^{m}) : P ≪ Q\}

, with

P ≪ Q

standing for “P absolutely continuous with respect to Q”. The relevance of

(h, Q, λ)

-Gibbs probability measures relies on the fact that under some conditions, they are the unique solutions to problems of the form,

\begin{matrix} min_{P \in ▵_{Q} (R^{m})} \int h (x, y) d P (y) + \frac{1}{λ} D (P ∥ Q), and \end{matrix}

(15)

\begin{matrix} max_{P \in ▵_{Q} (R^{m})} \int h (x, y) d P (y) + \frac{1}{λ} D (P ∥ Q), \end{matrix}

(16)

where

λ \in R ∖ {0}

,

x \in R

, and

D (P ∥ Q)

denotes the relative entropy (or KL divergence) of P with respect to Q.

Definition 3

(Relative Entropy). Given two σ-finite measures P and Q on the same measurable space, such that P is absolutely continuous with respect to Q, the relative entropy of P with respect to Q is

D (P ∥ Q) = \int \frac{d P}{d Q} (x) log (\frac{d P}{d Q} (x)) d Q (x),

(17)

where the function

\frac{d P}{d Q}

is the Radon-Nikodym derivative of P with respect to Q.

The key observation is that when

λ > 0

, the objective function in (15) is convex with P. Alternative, when

λ < 0

, the objective function in (16) is concave. The connection between the optimization problems (15) and (16) and the Gibbs probability measure

P_{Y | X = x}^{(h, Q, λ)}

in (14) has been pointed out by several authors. See for instance, Theorem 3 in [7] and [6,25,26,27,28,29,30,31,32,33] for the former; and Theorem 1 in [9], together with [34,35,36] for the latter. In these references, a variety of assumptions and proof techniques have been used to highlight such connections. A general and unified statement of these observations is presented hereunder.

Lemma 1.

Assume that the optimization problem in (15) (respectively, in (16)) admits a solution. Then, if

λ > 0

(respectively, if

λ < 0

), the probability measure

P_{Y | X = x}^{(h, Q, λ)}

in (14) is the unique solution.

Proof.

For the case in which

λ > 0

, the proof follows the same approach as the proof of Theorem 3 in [7]. Alternatively, for the case in which

λ < 0

, the proof follows along the lines of the proof of Theorem 1 in [9]. □

The following lemma highlights a key property of

(h, Q, λ)

-Gibbs probability measures.

Lemma 2.

Given an

(h, Q, λ)

-Gibbs probability measure, denoted by

P_{Y | X = x}^{(h, Q, λ)}

, with

x \in R^{n}

,

\begin{matrix} - \frac{1}{λ} K_{h, Q, x} (- λ) & = & \int h (x, y) d Q (y) - \frac{1}{λ} D (Q ∥ P_{Y | X = x}^{(h, Q, λ)}) \end{matrix}

(18)

\begin{matrix} = & \int h (x, y) d P_{Y | X = x}^{(h, Q, λ)} (y) + \frac{1}{λ} D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q); \end{matrix}

(19)

moreover, if

λ > 0

,

\begin{matrix} - \frac{1}{λ} K_{h, Q, x} (- λ) & = & min_{P \in ▵_{Q} (R^{m})} \int h (x, y) d P (y) + \frac{1}{λ} D (P ∥ Q); \end{matrix}

(20)

alternatively, if

λ < 0

,

\begin{matrix} - \frac{1}{λ} K_{h, Q, x} (- λ) & = & max_{P \in ▵_{Q} (R^{m})} \int h (x, y) d P (y) + \frac{1}{λ} D (P ∥ Q), \end{matrix}

(21)

where the function

K_{h, Q, x}

is defined in (12).

Proof.

The proof of (19) follows from taking the logarithm of both sides of (14) and integrating with respect to

P_{Y | X = x}^{(h, Q, λ)}

. As for the proof of (18), it follows by noticing that for all

(x, y) \in R^{n} \times supp Q

, the Radon-Nikodym derivative

\frac{d P_{Y | X = x}^{(h, Q, λ)}}{d Q} (y)

in (14) is strictly positive. Thus, from Theorem 5 in [37], it holds that

\frac{d Q}{d P_{Y | X = x}^{(h, Q, λ)}} (y) = {(\frac{d P_{Y | X = x}^{(h, Q, λ)}}{d Q} (y))}^{- 1}

. Hence, taking the negative logarithm on both sides of (14) and integrating with respect to Q leads to (18). Finally, the equalities in (20) and (21) follow from Lemma 1 and (19). □

The Equalities (19)–(21) in Lemma 2 can be seen as an immediate restatement of Donsker–Varadhan variational representation of the relative entropy [38]. Alternative interesting proofs for (18) have been presented by several authors including [9,33]. A proof for (19) appears in [6] (Lemma 3), in the specific case of

λ > 0

.

The following lemma introduces the main building block of this work, which is a characterization of the variation of the expectation of the function

h (x, \cdot) : R^{m} \to R

when the probability measure changes from the probability measure

P_{Y | X = x}^{(h, Q, λ)}

in (14) to an arbitrary measure

P \in ▵_{Q} (R^{m})

, i.e.,

G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)})

, for some fixed

x \in R^{n}

. Such a result appeared for the first time in [6] (Theorem 1) for the case in which

λ > 0

; and in [9] (Theorem 6) for the case in which

λ < 0

, in different contexts of statistical machine learning. A general and unified statement of such results is presented hereunder.

Lemma 3.

Consider an

(h, Q, λ)

-Gibbs probability measure, denoted by

P_{Y | X = x}^{(h, Q, λ)} \in ▵ (R^{m})

, with

λ \neq 0

and

x \in R

. For all

P \in ▵_{Q} (R^{m})

,

\begin{matrix} G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)}) & = & \frac{1}{λ} (D (P ∥ P_{Y | X = x}^{(h, Q, λ)}) + D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q) - D (P ∥ Q)) . \end{matrix}

(22)

Proof.

The proof follows along the lines of the proofs of Theorem 1 in [6] and Theorem 6 in [9] for the cases in which

λ > 0

and

λ < 0

, respectively. A unified proof is presented hereunder by noticing that for all

P \in ▵_{Q} (R^{m})

,

\begin{matrix} D (P ∥ P_{Y | X = x}^{(h, Q, λ)}) & = & \int log (\frac{d P}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P (y) \end{matrix}

(23)

\begin{matrix} = & \int log (\frac{d Q}{d P_{Y | X = x}^{(h, Q, λ)}} (y) \frac{d P}{d Q} (y)) d P (y) \end{matrix}

(24)

\begin{matrix} = & \int log (\frac{d Q}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P (y) + D (P ∥ Q) \end{matrix}

(25)

\begin{matrix} = & λ \int h (x, y) d P (y) + K_{h, Q, x} (- λ) + D (P ∥ Q) \end{matrix}

(26)

\begin{matrix} = & λ G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)}) - D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q) + D (P ∥ Q), \end{matrix}

(27)

where (24) follows from Theorem 4 in [37]; (26) follows from Theorem 5 in [37]; and (14); and (27) follows from (19). □

It is interesting to highlight that

G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)})

in (22) characterizes the variation of the expectation of the function

h (x, \cdot) : R^{m} \to R

, when

λ > 0

(respectively, when

λ < 0

) and the probability measure changes from the solution to the optimization problem in (15) (respectively, in (16)) to an alternative measure P. This result takes another perspective if it is seen in the context of information projections [12]. Let Q be a probability measure and

S \subseteq ▵_{Q} (R^{m})

be a convex set. From Theorem 1 in [12], it holds that for all measures

P \in S

,

\begin{matrix} D (P ∥ Q) ⩾ D (P ∥ P^{★}) + D (P^{★} ∥ Q), \end{matrix}

(28)

where

P^{★}

satisfies

\begin{matrix} P^{★} \in arg min_{P \in S} D (P ∥ Q) . \end{matrix}

(29)

In the particular case in which the set

S

in (29) satisfies

\begin{matrix} S & ≜ & \{P \in ▵_{Q} (R^{m}) : \int h (x, y) d P (y) = c\}, \end{matrix}

(30)

for some real c, with the vector x and the function h defined in Lemma 3, the optimal measure

P^{★}

in (29) is the Gibbs probability measure

P_{Y | X = x}^{(h, Q, λ)}

in (14), with

λ > 0

chosen to satisfy

\begin{matrix} \int h (x, y) d P_{Y | X = x}^{(h, Q, λ)} (y) = c . \end{matrix}

(31)

The case in which the measure Q in (29) is a

σ

-finite measure, for instance, either the Lebesgue measure or the counting measure, respectively, leads to the classical framework of differential entropy or discrete entropy maximization, which have been studied under particular assumptions on the set

S

in [34,35,36].

When the reference measure Q is a probability measure, under the assumption that (31) holds, it follows from Theorem 3 in [12] that for all

P \in S

, with

S

in (30),

\begin{matrix} D (P ∥ Q) & = & D (P ∥ P_{Y | X = x}^{(h, Q, λ)}) + D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q), \end{matrix}

(32)

which is known as the Pythagorean theorem for relative entropy. Such a geometric interpretation follows from admitting relative entropy as an analog of squared Euclidean distance. The first appearance of such a “Pythagorean theorem” was in [11] and was later revisited in [12]. Interestingly, the same result can be obtained from Lemma 3 by noticing that for all

P \in S

, with

S

in (30),

\begin{matrix} G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)}) & = & 0 . \end{matrix}

(33)

The converse of the Pythagorean theorem, e.g., Proposition 48 in [39], together with Lemma 3, lead to the geometric construction shown in Figure 1. A similar interpretation was also presented in [10] in the context of the generalization error of machine-learning algorithms. Nonetheless, the interpretation in Figure 1 is general and independent of such an application.

The relevance of Lemma 3 in the context of information projections follows from the fact that Q might be a

σ

-finite measure. The class of

σ

-finite measures includes the class of probability measures, and thus, unifies the results separately obtained in the realm of maximum entropy methods and information-projection methods.

The following lemma highlights that

(h, Q, λ)

-Gibbs conditional probability measures are related to another class of optimization problems.

Lemma 4.

Assume that the following optimization problems possess at least one solution for some

x \in R^{n}

,

\begin{matrix} min_{P \in ▵_{Q} (R^{m})} & \int h (x, y) d P (y) \\ s . t . & D (P ∥ Q) ⩽ ρ . \end{matrix}

(34)

and

\begin{matrix} max_{P \in ▵_{Q} (R^{m})} & \int h (x, y) d P (y) \\ s . t . & D (P ∥ Q) ⩽ ρ . \end{matrix}

(35)

Consider the

(h, Q, λ)

-Gibbs probability measure

P_{Y | X = x}^{(h, Q, λ)}

in (14), with

λ \in R ∖ {0}

such that

ρ = D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q)

. Then, the

(h, Q, λ)

-Gibbs probability measure

P_{Y | X = x}^{(h, Q, λ)}

is a solution to (34) if

λ > 0

; or to (35) if

λ < 0

.

Proof.

Note that if

λ > 0

, then,

\frac{1}{λ} D (P ∥ P_{Y | X = x}^{(h, Q, λ)}) ⩾ 0

. Hence, from Lemma 3, it holds that for all probability measures P such that

D (P ∥ Q) ⩽ ρ

,

\begin{matrix} G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)}) & ⩾ & \frac{1}{λ} (D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q) - D (P ∥ Q)) \end{matrix}

(36)

\begin{matrix} = & \frac{1}{λ} (ρ - D (P ∥ Q)) \end{matrix}

(37)

\begin{matrix} ⩾ & 0, \end{matrix}

(38)

with equality if

D (P ∥ P_{Y | X = x}^{(h, Q, λ)}) = 0

. This implies that

P_{Y | X = x}^{(h, Q, λ)}

is a solution to (34). Note also that if

λ < 0

, from Lemma 3, it holds that for all probability measures P such that

D (P ∥ Q) ⩽ ρ

,

\begin{matrix} G_{h} (x, P, P_{Y | X = x}^{(h, Q, λ)}) & ⩽ & \frac{1}{λ} (D (P_{Y | X = x}^{(h, Q, λ)} ∥ Q) - D (P ∥ Q)) \end{matrix}

(39)

\begin{matrix} = & \frac{1}{λ} (ρ - D (P ∥ Q)) \end{matrix}

(40)

\begin{matrix} ⩽ & 0, \end{matrix}

(41)

with equality if

D (P ∥ P_{Y | X = x}^{(h, Q, λ)}) = 0

. This implies that

P_{Y | X = x}^{(h, Q, λ)}

is a solution to (35). □

3. Characterization of $G_{h} (x, P_{1}, P_{2})$ in (2)

The main result of this section is the following theorem.

Theorem 1.

For all probability measures

P_{1}

and

P_{2}

, both absolutely continuous with respect to a given σ-finite measure Q on

R^{m}

, the variation

G_{h} (x, P_{1}, P_{2})

in (2) satisfies,

\begin{matrix} G_{h} (x, P_{1}, P_{2}) & = & \frac{1}{λ} (D (P_{1} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{2} ∥ P_{Y | X = x}^{(h, Q, λ)}) + D (P_{2} ∥ Q) - D (P_{1} ∥ Q)), \end{matrix}

(42)

where the probability measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

, is an

(h, Q, λ)

-Gibbs probability measure.

Proof.

The proof follows from Lemma 3 and by observing that

G_{h} (x, P_{1}, P_{2}) = G_{h} (x, P_{1}, P_{Y | X = x}^{(h, Q, λ)}) - G_{h} (x, P_{2}, P_{Y | X = x}^{(h, Q, λ)}),

which completes the proof. □

Theorem 1 might be particularly simplified in the case in which the reference measure Q is a probability measure. Consider for instance the case in which

P_{1} ≪ P_{2}

(or

P_{2} ≪ P_{1}

). In such a case, the reference measure might be chosen as

P_{2}

(or

P_{1}

), as shown hereunder.

Corollary 1.

Consider the variation

G_{h} (x, P_{1}, P_{2})

in (2). If the probability measure

P_{1}

is absolutely continuous with respect to

P_{2}

, then,

\begin{matrix} G_{h} (x, P_{1}, P_{2}) & = & \frac{1}{λ} (D (P_{1} ∥ P_{Y | X = x}^{(h, P_{2}, λ)}) - D (P_{2} ∥ P_{Y | X = x}^{(h, P_{2}, λ)}) - D (P_{1} ∥ P_{2})) . \end{matrix}

(43)

Alternatively, if the probability measure

P_{2}

is absolutely continuous with respect to

P_{1}

, then,

\begin{matrix} G_{h} (x, P_{1}, P_{2}) & = & \frac{1}{λ} (D (P_{1} ∥ P_{Y | X = x}^{(h, P_{1}, λ)}) - D (P_{2} ∥ P_{Y | X = x}^{(h, P_{1}, λ)}) + D (P_{2} ∥ P_{1})), \end{matrix}

(44)

where the probability measures

P_{Y | X = x}^{(h, P_{1}, λ)}

and

P_{Y | X = x}^{(h, P_{2}, λ)}

are, respectively,

(h, P_{1}, λ)

- and

(h, P_{2}, λ)

-Gibbs probability measures, with

λ \neq 0

.

In the case in which neither

P_{1}

is absolutely continuous with respect to

P_{2}

; nor

P_{2}

is absolutely continuous with respect to

P_{1}

, the reference measure Q in Theorem 1 can always be chosen as a convex combination of

P_{1}

and

P_{2}

. That is, for all Borel sets

A \in B (R^{m})

,

Q (A) = α P_{1} (A) + (1 - α) P_{2} (A)

, with

α \in (0, 1)

.

Theorem 1 can be specialized to the cases in which Q is either the Lebesgue measure or the counting measure.

If Q is the Lebesgue measure, then the probability measures

P_{1}

and

P_{2}

in (42) admit probability density functions

f_{1}

and

f_{2}

, respectively. Moreover, the terms

- D (P_{1} ∥ Q)

and

- D (P_{2} ∥ Q)

are Shannon’s differential entropies [4] induced by

P_{1}

and

P_{2}

, denoted by

h (P_{1})

and

h (P_{2})

, respectively. That is, for all

i \in {1, 2}

,

\begin{matrix} h (P_{i}) & ≜ & - \int f_{i} (x) log f_{i} (x) d x . \end{matrix}

(45)

The probability measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

,

x \in R^{n}

, and Q the Lebesgue measure, possesses a probability density function, denoted by

f_{Y | X = x}^{(h, Q, λ)} : R^{m} \to (0, + \infty)

, which satisfies

\begin{matrix} f_{Y | X = x}^{(h, Q, λ)} (y) & = & \frac{exp (- λ h (x, y))}{\int exp (- λ h (x, y)) d y} . \end{matrix}

(46)

If Q is the counting measure, then the probability measures

P_{1}

and

P_{2}

in (42) admit probability mass functions

p_{1} : Y \to [0, 1]

and

p_{2} : Y \to [0, 1]

, with

Y

a countable subset of

R^{m}

. Moreover,

- D (P_{1} ∥ Q)

and

- D (P_{2} ∥ Q)

are, respectively, Shannon’s discrete entropies [4] induced by

P_{1}

and

P_{2}

, denoted by

H (P_{1})

and

H (P_{2})

, respectively. That is, for all

i \in {1, 2}

,

\begin{matrix} H (P_{i}) & ≜ & - \sum_{y \in Y} p_{i} (y) log p_{i} (y) . \end{matrix}

(47)

The probability measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

and Q the counting measure, possesses a probability mass function, denoted by

p_{Y | X = x}^{(h, Q, λ)} : Y \to (0, + \infty)

, which satisfies

\begin{matrix} p_{Y | X = x}^{(h, Q, λ)} (y) & = & \frac{exp (- λ h (x, y))}{\sum_{y \in Y} exp (- λ h (x, y))} . \end{matrix}

(48)

These observations lead to the following corollary of Theorem 1.

Corollary 2.

Given two probability measures

P_{1}

and

P_{2}

, with probability density functions

f_{1}

and

f_{2}

, respectively, the variation

G_{h} (x, P_{1}, P_{2})

in (2) satisfies,

\begin{matrix} G_{h} (x, P_{1}, P_{2}) & = & \frac{1}{λ} (D (P_{1} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{2} ∥ P_{Y | X = x}^{(h, Q, λ)}) - h (P_{2}) + h (P_{1})), \end{matrix}

(49)

where the probability density function of the measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

and Q the Lebesgue measure, is defined in (46); and the entropy functional

h

is defined in (45). Alternatively, given two probability measures

P_{1}

and

P_{2}

, with probability mass functions

p_{1}

and

p_{2}

, respectively, the variation

G_{h} (x, P_{1}, P_{2})

in (2) satisfies,

\begin{matrix} G_{h} (x, P_{1}, P_{2}) & = & \frac{1}{λ} (D (P_{1} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{2} ∥ P_{Y | X = x}^{(h, Q, λ)}) - H (P_{2}) + H (P_{1})), \end{matrix}

(50)

where the probability mass function of the measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

and Q the counting measure, is defined in (48); and the entropy functional

H

is defined in (47).

4. Characterizations of ${\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})$ in (4)

The main result of this section is a characterization of

{\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})

in (4).

Theorem 2.

Consider the variation

{\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})

in (4) and assume that for all

x \in supp P_{X}

, the probability measures

P_{Y | X = x}^{(1)}

and

P_{Y | X = x}^{(2)}

are both absolutely continuous with respect to a σ-measure Q. Then,

\begin{matrix} {\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & = & \frac{1}{λ} \int (D (P_{Y | X = x}^{(1)} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{Y | X = x}^{(2)} ∥ P_{Y | X = x}^{(h, Q, λ)}) \\ + D (P_{Y | X = x}^{(2)} ∥ Q) - D (P_{Y | X = x}^{(1)} ∥ Q)) d P_{X} (x), \end{matrix}

(51)

where the probability measure

P_{Y | X}^{(h, Q, λ)}

, with

λ \neq 0

, is an

(h, Q, λ)

-Gibbs conditional probability measure.

Proof.

The proof follows from (4) and Theorem 1. □

Two special cases are particularly noteworthy. When the reference measure Q is the Lebesgue measure both

- \int D (P_{Y | X = x}^{(1)} ∥ Q) d P_{X} (x)

and

- \int D (P_{Y | X = x}^{(2)} ∥ Q) d P_{X} (x)

in (51) become Shannon’s differential conditional entropies, denoted by

h (P_{Y | X}^{(1)} | P_{X})

and

h (P_{Y | X}^{(2)} | P_{X})

, respectively. That is, for all

i \in {1, 2}

,

\begin{matrix} h (P_{Y | X}^{(i)} | P_{X}) & ≜ & \int h (P_{Y | X = x}^{(i)}) d P_{X} (x), \end{matrix}

(52)

where

h

is the entropy functional in (45).

When the reference measure Q is the counting measure both

- \int D (P_{Y | X = x}^{(1)} ∥ Q) d P_{X} (x)

and

- \int D (P_{Y | X = x}^{(2)} ∥ Q) d P_{X} (x)

in (51) become Shannon’s discrete conditional entropies, denoted by

H (P_{Y | X}^{(1)} | P_{X})

and

H (P_{Y | X}^{(2)} | P_{X})

, respectively. That is, for all

i \in {1, 2}

,

\begin{matrix} H (P_{Y | X}^{(i)} | P_{X}) & ≜ & \int H (P_{Y | X = x}^{(i)}) d P_{X} (x), \end{matrix}

(53)

where

H

is the entropy functional in (47).

These observations lead to the following corollary of Theorem 2.

Corollary 3.

Consider the variation

{\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})

in (4) and assume that for all

x \in supp P_{X}

, the probability measures

P_{Y | X = x}^{(1)}

and

P_{Y | X = x}^{(2)}

possess probability density functions. Then,

\begin{matrix} {\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & = & \frac{1}{λ} \int (D (P_{Y | X = x}^{(1)} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{Y | X = x}^{(2)} ∥ P_{Y | X = x}^{(h, Q, λ)})) d P_{X} (x) \\ - \frac{1}{λ} h (P_{Y | X}^{(2)} | P_{X}) + \frac{1}{λ} h (P_{Y | X}^{(1)} | P_{X}), \end{matrix}

(54)

where the probability density function of the measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

and Q the Lebesgue measure, is defined in (46); and for all

i \in {1, 2}

, the conditional entropy

h (P_{Y | X}^{(i)} | P_{X})

is defined in (52). Alternatively, assume that for all

x \in supp P_{X}

, the probability measures

P_{Y | X = x}^{(1)}

and

P_{Y | X = x}^{(2)}

possess probability mass functions. Then,

\begin{matrix} {\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & = & \frac{1}{λ} \int (D (P_{Y | X = x}^{(1)} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{Y | X = x}^{(2)} ∥ P_{Y | X = x}^{(h, Q, λ)})) d P_{X} (x) \\ - \frac{1}{λ} H (P_{Y | X}^{(2)} | P_{X}) + \frac{1}{λ} H (P_{Y | X}^{(1)} | P_{X}), \end{matrix}

(55)

where the probability mass function of the measure

P_{Y | X = x}^{(h, Q, λ)}

, with

λ \neq 0

and Q the counting measure, is defined in (48); and for all

i \in {1, 2}

, the conditional entropy

H (P_{Y | X}^{(i)} | P_{X})

is defined in (53).

The general expression for the expected variation

{\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})

in (4) might be simplified according to Corollary 1. For instance, if for all

x \in supp P_{X}

, the probability measure

P_{Y | X = x}^{(1)}

is absolutely continuous with respect to

P_{Y | X = x}^{(2)}

, the measure

P_{Y | X = x}^{(2)}

can be chosen to be the reference measure in the calculation of

G_{h} (x, P_{Y | X = x}^{(1)}, P_{Y | X = x}^{(2)})

, with the functional

G_{h}

in (1). This observation leads to the following corollary of Theorem 2.

Corollary 4.

Consider the variation

{\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})

in (4) and assume that for all

x \in supp P_{X}

,

P_{Y | X = x}^{(1)} ≪ P_{Y | X = x}^{(2)}

. Then,

\begin{matrix} {\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & = & \frac{1}{λ} \int (D (P_{Y | X = x}^{(1)} ∥ P_{Y | X = x}^{(h, P_{Y | X = x}^{(2)}, λ)}) - D (P_{Y | X = x}^{(2)} ∥ P_{Y | X = x}^{(h, P_{Y | X = x}^{(2)}, λ)}) \\ - D (P_{Y | X = x}^{(1)} ∥ P_{Y | X = x}^{(2)})) d P_{X} (x) . \end{matrix}

(56)

Alternatively, if for all

x \in supp P_{X}

, the probability measure

P_{Y | X = x}^{(2)}

is absolutely continuous with respect to

P_{Y | X = x}^{(1)}

, then,

\begin{matrix} {\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X}) & = & \frac{1}{λ} \int (D (P_{Y | X = x}^{(1)} ∥ P_{Y | X = x}^{(h, P_{Y | X = x}^{(1)}, λ)}) - D (P_{Y | X = x}^{(2)} ∥ P_{Y | X = x}^{(h, P_{Y | X = x}^{(1)}, λ)}) \\ + D (P_{Y | X = x}^{(2)} ∥ P_{Y | X = x}^{(1)})) d P_{X} (x), \end{matrix}

(57)

where the measures

P_{Y | X = x}^{(h, P_{Y | X = x}^{(1)}, λ)}

and

P_{Y | X = x}^{(h, P_{Y | X = x}^{(2)}, λ)}

are, respectively,

(h, P_{Y | X = x}^{(1)}, λ)

- and

(h

,

P_{Y | X = x}^{(2)}, λ)

-Gibbs probability measures.

The Gibbs probability measures

P_{Y | X = x}^{(h, P_{Y | X = x}^{(1)}, λ)}

and

P_{Y | X = x}^{(h, P_{Y | X = x}^{(2)}, λ)}

in Corollary 4 are particularly interesting as their reference measures depend on x. Gibbs measures of this form appear, for instance, in Corollary 10 in [7].

5. Characterizations of ${\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})$ in (10)

The main result of this section is a characterization of

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10), which describes the variation of the expectation of the function h when the probability measure changes from the joint probability measure

P_{Y | X} P_{X}

to the product of its marginals

P_{Y} P_{X}

.

This result is presented hereunder and involves the mutual information

I (P_{Y | X}; P_{X})

and lautum information

L (P_{Y | X}; P_{X})

, defined as follows:

\begin{matrix} I (P_{Y | X}; P_{X}) & ≜ & \int D (P_{Y | X = x} ∥ P_{Y}) d P_{X} (x); and \end{matrix}

(58)

\begin{matrix} L (P_{Y | X}; P_{X}) & ≜ & \int D (P_{Y} ∥ P_{Y | X = x}) d P_{X} (x) . \end{matrix}

(59)

Theorem 3.

Consider the expected variation

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10) and assume that, for all

x \in supp P_{X}

:

1.: The probability measures $P_{Y}$ and $P_{Y | X = x}$ are both absolutely continuous with respect to a given σ-finite measure Q; and
2.: The probability measures $P_{Y}$ and $P_{Y | X = x}$ are mutually absolutely continuous.

Then,

\begin{matrix} {\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X}) = \frac{1}{λ} (I (P_{Y | X}; P_{X}) + L (P_{Y | X}; P_{X}) \\ + \int \int log (\frac{d P_{Y | X = x}}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P_{Y} (y) d P_{X} (x) - \int \int log (\frac{d P_{Y | X = x}}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P_{Y | X = x} (y) d P_{X} (x)), \end{matrix}

(60)

where the probability measure

P_{Y | X}^{(h, Q, λ)}

, with

λ \neq 0

, is an

(h, Q, λ)

-Gibbs conditional probability measure.

Proof.

The proof is presented in Appendix A. □

An alternative expression for

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10) involving only relative entropies is presented by the following theorem.

Theorem 4.

Consider the expected variation

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10) and assume that, for all

x \in supp P_{X}

, the probability measure

P_{Y | X = x}

is absolutely continuous with respect to a given σ-finite measure Q. Then, it follows that

\begin{matrix} {\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X}) \\ = & \frac{1}{λ} \int \int (D (P_{Y | X = x_{2}} ∥ P_{Y | X = x_{1}}^{(h, Q, λ)}) - D (P_{Y | X = x_{2}} ∥ P_{Y | X = x_{2}}^{(h, Q, λ)})) d P_{X} (x_{1}) d P_{X} (x_{2}), \end{matrix}

(61)

where

P_{Y | X}^{(h, Q, λ)}

, with

λ \neq 0

, is an

(h, Q, λ)

-Gibbs conditional probability measure.

Proof.

The proof is presented in Appendix B. □

Theorem 4 expresses the variation

{\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})

in (10) as difference of two relative entropies. The former compares

P_{Y | X = x_{1}}

with

P_{Y | X = x_{2}}^{(h, Q, λ)}

, where

(x_{1}, x_{2}) \in X \times X

are independently sampled from the same probability measure

P_{X}

. The latter compares these two conditional measures conditioning on the same element of

X

. That is, it compares

P_{Y | X = x_{2}}

with

P_{Y | X = x_{2}}^{(h, Q, λ)}

.

An interesting observation from Theorems 3 and 4 is that the last two terms in the right-hand side of (60) are both zero in the case in which

P_{Y | X}

is an

(h, Q, λ)

-Gibbs conditional probability measure. Similarly, in such a case, the second term in the right-hand side of (61) is also zero. This observation is highlighted by the following corollary.

Corollary 5.

Consider an

(h, Q, λ)

-Gibbs conditional probability measure, denoted by

P_{Y | X}^{(h, Q, λ)} \in ▵ (R^{m} | R^{n})

, with

λ \neq 0

; and a probability measure

P_{X} \in ▵ (R^{n})

. Let the measure

P_{Y}^{(h, Q, λ)} \in ▵ (R^{m})

be such that for all sets

A \in B (R^{m})

,

\begin{matrix} P_{Y}^{(h, Q, λ)} (A) & = & \int P_{Y | X = x}^{(h, Q, λ)} (A) d P_{X} (x) . \end{matrix}

(62)

Then,

\begin{matrix} {\bar{G}}_{h} (P_{Y}^{(h, Q, λ)}, P_{Y | X}^{(h, Q, λ)}, P_{X}) & = & \frac{1}{λ} (I (P_{Y | X}^{(h, Q, λ)}; P_{X}) + L (P_{Y | X}^{(h, Q, λ)}; P_{X})) \end{matrix}

(63)

\begin{matrix} = & \frac{1}{λ} \int \int D (P_{Y | X = x_{2}}^{(h, Q, λ)} ∥ P_{Y | X = x_{1}}^{(h, Q, λ)}) d P_{X} (x_{1}) d P_{X} (x_{2}) . \end{matrix}

(64)

Please note that mutual information and lautum information are both nonnegative information measures, which from Corollary 5, implies that

{\bar{G}}_{h} (P_{Y}^{(h, Q, λ)}, P_{Y | X}^{(h, Q, λ)}, P_{X})

in (64) might be either positive or negative depending exclusively on the sign of the regularization factor

λ

. The following corollary exploits such an observation to present a property of Gibbs conditional probability measures and their corresponding marginal probability measures.

Corollary 6.

Given a probability measure

P_{X} \in ▵ (R^{n})

, the

(h, Q, λ)

-Gibbs conditional probability measure

P_{Y | X}^{(h, Q, λ)}

in (14) and the probability measure

P_{Y}^{(h, Q, λ)}

in (62) satisfy

\begin{matrix} \int \int h (x, y) d P_{Y}^{(h, Q, λ)} (y) d P_{X} (x) & ⩾ & \int \int h (x, y) d P_{Y | X = x}^{(h, Q, λ)} (y) d P_{X} (x) i f λ > 0; \end{matrix}

(65)

or

\begin{matrix} \int \int h (x, y) d P_{Y}^{(h, Q, λ)} (y) d P_{X} (x) & ⩽ & \int \int h (x, y) d P_{Y | X = x}^{(h, Q, λ)} (y) d P_{X} (x) i f λ < 0 . \end{matrix}

(66)

Corollary 6 highlights the fact that a deviation from the joint probability measure

P_{Y | X}^{(h, Q, λ)} P_{X} \in ▵ (Y \times X)

to the product of its marginals

P_{Y}^{(h, Q, λ)} P_{X} \in ▵ (Y \times X)

might increase or decrease the expectation of the function h depending on the sign of

λ

.

6. Examples

An immediate application of the results presented above is the analysis of the generalization error of machine-learning algorithms [10], which was the scenario in which these results were originally discovered. In the remainder of this section, such results are presented as consequences of the more general results presented in this work.

Let

M

,

X

and

Y

, with

M \subseteq R^{d}

, be sets of models, patterns, and labels, respectively. A pair

(x, y) \in X \times Y

is referred to as a data point. A dataset

z \in {(X \times Y)}^{n}

is a tuple of n data points of the form:

z = ((x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})) \in {(X \times Y)}^{n} .

(67)

Consider the function

\begin{matrix} L : \{\begin{matrix} {(X \times Y)}^{n} \times M & ⟶ & [0, + \infty] \\ (z, θ) & ⟼ & \frac{1}{n} \sum_{i = 1}^{n} ℓ (x_{i}, y_{i}, θ), \end{matrix} \end{matrix}

(68)

where

ℓ (x_{i}, y_{i}, θ)

is the risk or loss induced by the model

θ

with respect to the data point

(x_{i}, y_{i})

.

Given a fixed dataset

z

of the form in (67), consider also the functional

R_{z} : \{\begin{matrix} ▵ (M) & ⟶ & R \\ P & ⟼ & \int L (z, θ) d P (θ), \end{matrix}

(69)

where the function

L

is defined in (68). Using this notation, the empirical risk induced by the model

θ

with respect to the dataset

z

is

L (z, θ)

. The expectation of the empirical risk with respect to a fixed dataset

z

when models are sampled from a probability measure

P \in ▵ (M)

is

R_{z} (P)

.

A machine-learning algorithm is represented by a conditional probability measure

P_{Θ | Z} \in ▵ (M | {(X \times Y)}^{n})

. The instance of such an algorithm generated by training it upon the dataset

z

in (67) is represented by the probability measure

P_{Θ | Z = z} \in ▵ (M)

. The generalization error induced by the algorithm

P_{Θ | Z}

is defined as follows.

Definition 4

(Generalization Error). The generalization error induced by the algorithm

P_{Θ | Z} \in ▵ (M | {(X \times Y)}^{n})

, under the assumption that training and test datasets are independently sampled from a probability measure

P_{Z} \in ▵ ({(X \times Y)}^{n})

, is denoted by

\bar{\bar{G}} (P_{Θ | Z}, P_{Z})

, and

\begin{matrix} \bar{\bar{G}} (P_{Θ | Z}, P_{Z}) & ≜ & \int \int (R_{u} (P_{Θ | Z = z}) - R_{z} (P_{Θ | Z = z})) d P_{Z} (u) d P_{Z} (z), \end{matrix}

(70)

where the functionals

R_{u}

and

R_{z}

are defined in (69).

Often, the term

R_{u} (P_{Θ | Z = z})

is recognized to be the test error induced by the algorithm

P_{Θ | Z}

with respect to a test dataset

u \in {(X \times Y)}^{n}

when it is trained upon the dataset

z \in {(X \times Y)}^{n}

. Alternatively, the term

R_{z} (P_{Θ | Z = z})

is recognized to be the training error induced by the algorithm

P_{Θ | Z}

when it is trained upon the dataset

z \in {(X \times Y)}^{n}

. From this perspective, the generalization error

\bar{\bar{G}} (P_{Θ | Z}, P_{Z})

in (70) is the expectation of the difference between test error and training error when the test and training datasets are independently sampled from the same probability measure

P_{Z}

. The key observation is that such generalization error

\bar{\bar{G}} (P_{Θ | Z}, P_{Z})

can be written as a variation of an expectation of the empirical risk function

L

in (68), as shown hereunder.

Lemma 5

(Lemma 3 in [10]). Consider the generalization error

\bar{\bar{G}} (P_{Θ | Z}, P_{Z})

in (70) and assume that for all

z

, the probability measure

P_{Θ | Z = z}

is absolutely continuous with respect to the probability measure

P_{Θ} \in ▵ (M)

, which satisfies for all measurable subsets

C

of

M

,

\begin{matrix} P_{Θ} (C) & = & \int P_{Θ | Z = z} (C) d P_{Z} (z) . \end{matrix}

(71)

Then,

\begin{matrix} \bar{\bar{G}} (P_{Θ | Z}, P_{Z}) & = & {\bar{G}}_{L} (P_{Θ}, P_{Θ | Z}, P_{Z}), \end{matrix}

(72)

where the functional

{\bar{G}}_{L}

and the function

L

are defined in (3) and (68), respectively.

From Theorems 3 and 5, the following holds.

Theorem 5

(Theorem 14 in [10]). Consider the generalization error

\bar{\bar{G}} (P_{Θ | Z}, P_{Z})

in (70) and assume that for all

z \in {(X \times Y)}^{n}

:

(a): The probability measures $P_{Θ}$ in (71) and $P_{Θ | Z = z}$ are both absolutely continuous with respect to some σ-finite measure $Q \in ▵ (M)$ ;
(b): The measure Q is absolutely continuous with respect to $P_{Θ}$ ; and
(c): The measure $P_{Θ}$ is absolutely continuous with respect to $P_{Θ | Z = z}$ .

Then,

\begin{matrix} \bar{\bar{G}} (P_{Θ | Z}, P_{Z}) & = & λ (I (P_{Θ | Z}; P_{Z}) + L (P_{Θ | Z}; P_{Z})) + λ \int \int log \frac{d P_{Θ | Z = z}}{d P_{Θ | Z = z}^{(Q, λ)}} (θ) d P_{Θ} (θ) d P_{Z} (z) \\ - λ \int \int log \frac{d P_{Θ | Z = z}}{d P_{Θ | Z = z}^{(Q, λ)}} (θ) d P_{Θ | Z = z} (θ) d P_{Z} (z), \end{matrix}

(73)

where the measure

P_{Θ | Z}^{(Q, λ)}

is an

(L, Q, λ)

-Gibbs conditional probability measure, with the function

L

defined in (68).

Theorem 5 shows one of many closed-form expressions that can be obtained for the generalization error

\bar{\bar{G}} (P_{Θ | Z}, P_{Z})

in (70) in terms of information measures. A complete exposition of several equivalent alternative expressions, as well as a discussion on their relevance, is presented in [10].

The important observation in this example is that the measure

P_{Θ | Z}^{(Q, λ)}

in (73) is an

(L, Q, λ)

-Gibbs conditional probability measure, which represents the celebrated Gibbs algorithm in statistical machine learning [40]. Thus, the term

log \frac{d P_{Θ | Z = z}}{d P_{Θ | Z = z}^{(Q, λ)}} (θ)

in (73) can be interpreted as a log-likelihood ratio in a hypothesis test in which the objective is to distinguish the probability measures

P_{Θ | Z = z}

and

P_{Θ | Z = z}^{(Q, λ)}

based on the observation of the model

θ

. The former represents the algorithm under study trained upon

z

, whereas the latter represents a Gibbs algorithm trained upon the same dataset

z

.

From this perspective, the difference between the last two terms in (73), i.e.,

\begin{matrix} λ \int \int log \frac{d P_{Θ | Z = z}}{d P_{Θ | Z = z}^{(Q, λ)}} (θ) d P_{Θ} (θ) d P_{Z} (z) - λ \int \int log \frac{d P_{Θ | Z = z}}{d P_{Θ | Z = z}^{(Q, λ)}} (θ) d P_{Θ | Z = z} (θ) d P_{Z} (z), \end{matrix}

(74)

can be interpreted as the variation of the expectation of the log-likelihood ratio

\begin{matrix} log \frac{d P_{Θ | Z = z}}{d P_{Θ | Z = z}^{(Q, λ)}} (θ) \end{matrix}

(75)

when the probability measure from which the model

θ

and dataset

z

are drawn changes from the ground-truth distribution

P_{Θ | Z} P_{Z}

to the product of the corresponding marginals

P_{Θ} P_{Z}

. As originally suggested in [10], Theorem 5 establishes an interesting connection between hypothesis testing, information measures, and generalization error. Nonetheless, this connection goes beyond this application in statistical machine learning as the same connection can be established directly from Theorem 3. This establishes a connection between the variation of the expectation due to changes in the probability measure, information measures, and hypothesis testing.

7. Final Remarks

A simple reformulation of Varadhan’s variational representation of relative entropy (Lemma 2) yields an explicit expression for the variation of the expectation of a real function when the probability measure shifts from a Gibbs measure to an arbitrary probability measure (Lemma 3). This result connects directly with information-projection methods, Pythagorean identities for relative entropy, and optimization problems with a constraint on the relative entropy of the measure to be optimized with respect to a reference measure (Lemma 4). A simple algebraic manipulation of Lemma 3 provides a general formula. It describes how the expectation changes with respect to changes in the probability measure (Theorem 1). The result is general, and the only assumption is that both initial (before the variation) and final (after the variation) measures are absolutely continuous with respect to a common reference. A key insight is the central role of Gibbs measures in this framework. The change in expectation is described through relative entropy comparisons between the initial and final measures, each with respect to a specific Gibbs measure built from the function under study. Notably, the reference measure of these Gibbs distributions need not be a probability measure. It can be a

σ

-finite measure, such as the Lebesgue measure or the counting measure. In such cases, the resulting expressions include Shannon’s fundamental information measures, including entropy and conditional entropy (Corollary 2). Building on these results, the variation of expectations under changes in joint probability measures is studied. Two cases are of special interest. In the former, one marginal remains unchanged (Theorem 2). In the latter, the joint measure changes to the product of its marginals (Theorems 3 and 4). In the case of Gibbs joint measures, the resulting expressions involve only standard information quantities: mutual information, lautum information, and relative entropy. These results show a broad connection between the variation in the expectation of measurable functions, induced by changes in probability measure, and information measures such as mutual and lautum information.

Author Contributions

All authors have equally contributed to all tasks. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the European Commission through the H2020-MSCA-RISE-2019 project 872172; the French National Agency for Research (ANR) through the Project ANR-21-CE25-0013 and the project ANR-22-PEFT-0010 of the France 2030 program PEPR Réseaux du Futur; and in part by the Agence de l’innovation de défense (AID) through the project UK-FR 2024352.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Theorem 3.

The proof follows from Theorem 2, which holds under assumption

(a)

and leads to

\begin{matrix} {\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X}) \\ = & \frac{1}{λ} \int (D (P_{Y} ∥ P_{Y | X = x}^{(h, Q, λ)}) - D (P_{Y | X = x} ∥ P_{Y | X = x}^{(h, Q, λ)}) + D (P_{Y | X = x} ∥ Q) - D (P_{Y} ∥ Q)) d P_{X} (x) . \end{matrix}

(A1)

The proof continues by noticing that

\begin{matrix} \int D (P_{Y | X = x} ∥ Q) d P_{X} (x) & = & \int \int log (\frac{d P_{Y | X = x}}{d Q} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

(A2)

\begin{matrix} = & \int \int log (\frac{d P_{Y | X = x}}{d P_{Y}} (y) \frac{d P_{Y}}{d Q} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

(A3)

\begin{matrix} = & \int \int log (\frac{d P_{Y | X = x}}{d P_{Y}} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

\begin{matrix} + \int \int log (\frac{d P_{Y}}{d Q} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

(A4)

\begin{matrix} = & \int \int log (\frac{d P_{Y | X = x}}{d P_{Y}} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

\begin{matrix} + \int \int \frac{d P_{Y | X = x}}{d P_{Y}} (y) log (\frac{d P_{Y}}{d Q} (y)) d P_{Y} (y) d P_{X} (x) \end{matrix}

(A5)

\begin{matrix} = & \int \int log (\frac{d P_{Y | X = x}}{d P_{Y}} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

\begin{matrix} + \int log (\frac{d P_{Y}}{d Q} (y)) (\int \frac{d P_{Y | X = x}}{d P_{Y}} (y) d P_{X} (x)) d P_{Y} (y) \end{matrix}

(A6)

\begin{matrix} = & \int \int log (\frac{d P_{Y | X = x}}{d P_{Y}} (y)) d P_{Y | X = x} (y) d P_{X} (x) \end{matrix}

\begin{matrix} + \int log (\frac{d P_{Y}}{d Q} (y)) d P_{Y} (y) \end{matrix}

(A7)

\begin{matrix} = & I (P_{Y | X}; P_{X}) + D (P_{Y} ∥ Q), \end{matrix}

(A8)

where (A3) follows from Theorem 4 in [37]; (A5) follows from Theorem 2 in [37]; and (A7) follows from Theorem 10 in [37], which implies that for all

y \in R^{m}

,

\int \frac{d P_{Y | X = x}}{d P_{Y}} (y) d P_{X} (x) = 1

.

Note also that

\begin{matrix} \int D (P_{Y} ∥ P_{Y | X = x}^{(h, Q, λ)}) d P_{X} (x) & = & \int \int log (\frac{d P_{Y}}{d P_{Y | X = x}^{(h, Q, λ)}}) d P_{Y} (y) d P_{X} (x) \end{matrix}

(A9)

\begin{matrix} = & \int \int log (\frac{d P_{Y}}{d P_{Y | X = x}} (y) \frac{d P_{Y | X = x}}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P_{Y} (y) d P_{X} (x) \end{matrix}

(A10)

\begin{matrix} = & \int \int log (\frac{d P_{Y}}{d P_{Y | X = x}} (y)) d P_{Y} (y) d P_{X} (x) \end{matrix}

\begin{matrix} + \int \int log (\frac{d P_{Y | X = x}}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P_{Y} (y) d P_{X} (x) \end{matrix}

(A11)

\begin{matrix} = & L (P_{Y | X = x}; P_{X}) + \int \int log (\frac{d P_{Y | X = x}}{d P_{Y | X = x}^{(h, Q, λ)}} (y)) d P_{Y} (y) d P_{X} (x), \end{matrix}

(A12)

where (A10) follows from Theorem 4 in [37]. Finally, using (A8) and (A12) in (A1) yields (60), which completes the proof. □

Appendix B

Proof of Theorem 4.

The proof follows by observing that the functional

{\bar{G}}_{h}

in (10) satisfies

\begin{matrix} {\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X}) \\ = & \int \int (\int h (x_{2}, y) d P_{Y | X = x_{1}} (y) - \int h (x_{1}, y) d P_{Y | X = x_{1}} (y)) d P_{X} (x_{2}) d P_{X} (x_{1}) . \end{matrix}

(A13)

Using the functional

G_{h}

in (1), the terms above can be written as follows

\begin{matrix} \int h (x_{1}, y) d P_{Y | X = x_{1}} (y) & = & G_{h} (x_{1}, P_{Y | X = x_{1}}, P_{Y | X = x_{1}}^{(h, Q, λ)}) + \int h (x_{1}, y) d P_{Y | X = x_{1}}^{(h, Q, λ)} (y), \end{matrix}

(A14)

and

\begin{matrix} \int h (x_{2}, y) d P_{Y | X = x_{1}} (y) & = & G_{h} (x_{2}, P_{Y | X = x_{1}}, P_{Y | X = x_{2}}^{(h, Q, λ)}) + \int h (x_{2}, y) d P_{Y | X = x_{2}}^{(h, Q, λ)} (y) . \end{matrix}

(A15)

Using (A14) and (A15) in (A13) yields

\begin{array}{l} {\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X}) \end{array}

\begin{matrix} = & \int \int (G_{h} (x_{2}, P_{Y | X = x_{1}}, P_{Y | X = x_{2}}^{(h, Q, λ)}) + \int h (x_{2}, y) d P_{Y | X = x_{2}}^{(h, Q, λ)} (y) \end{matrix}

(A16)

\begin{matrix} - G_{h} (x_{1}, P_{Y | X = x_{1}}, P_{Y | X = x_{1}}^{(h, Q, λ)}) - \int h (x_{1}, y) d P_{Y | X = x_{1}}^{(h, Q, λ)} (y)) d P_{X} (x_{2}) d P_{X} (x_{1}), \end{matrix}

\begin{matrix} = & \int \int (G_{h} (x_{2}, P_{Y | X = x_{1}}, P_{Y | X = x_{2}}^{(h, Q, λ)}) - \int G_{h} (x_{1}, P_{Y | X = x_{1}}, P_{Y | X = x_{1}}^{(h, Q, λ)})) d P_{X} (x_{2}) d P_{X} (x_{1}) \end{matrix}

(A17)

\begin{matrix} = & \frac{1}{λ} \int \int (D (P_{Y | X = x_{2}} ∥ P_{Y | X = x_{1}}^{(h, Q, λ)}) - D (P_{Y | X = x_{2}} ∥ P_{Y | X = x_{2}}^{(h, Q, λ)})) d P_{X} (x_{1}) d P_{X} (x_{2}), \end{matrix}

(A18)

where the last equality holds from Lemma 3, which implies

\begin{matrix} \int \int G_{h} (x_{2}, P_{Y | X = x_{1}}, P_{Y | X = x_{2}}^{(h, Q, λ)}) d P_{X} (x_{2}) d P_{X} (x_{1}) \\ = & \frac{1}{λ} \int \int D (P_{Y | X = x_{1}} ∥ P_{Y | X = x_{2}}^{(h, Q, λ)}) \frac{1}{λ} d P_{X} (x_{2}) d P_{X} (x_{1}) + \frac{1}{λ} \int D (P_{Y | X = x_{2}}^{(h, Q, λ)} ∥ Q) d P_{X} (x_{2}) \\ - \frac{1}{λ} \int D (P_{Y | X = x_{1}} ∥ Q) d P_{X} (x_{1}), \end{matrix}

(A19)

and

\begin{matrix} \int \int G_{h} (x_{1}, P_{Y | X = x_{1}}, P_{Y | X = x_{1}}^{(h, Q, λ)}) d P_{X} (x_{2}) d P_{X} (x_{1}) \end{matrix}

\begin{matrix} = & \int G_{h} (x_{1}, P_{Y | X = x_{1}}, P_{Y | X = x_{1}}^{(h, Q, λ)}) d P_{X} (x_{1}) \end{matrix}

(A20)

\begin{matrix} = & \frac{1}{λ} \int D (P_{Y | X = x_{1}} ∥ P_{Y | X = x_{1}}^{(h, Q, λ)}) d P_{X} (x_{1}) + \frac{1}{λ} \int D (P_{Y | X = x_{1}}^{(h, Q, λ)} ∥ Q) d P_{X} (x_{1}) \end{matrix}

\begin{matrix} - \frac{1}{λ} \int D (P_{Y | X = x_{1}} ∥ Q) d P_{X} (x_{1}), \end{matrix}

(A21)

which completes the proof. □

References

Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. Learning with drift detection. In Proceedings of the 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, 29 September–1 October 2004; pp. 286–295. [Google Scholar]
Webb, G.I.; Lee, L.K.; Goethals, B.; Petitjean, F. Analyzing concept drift and shift from sample data. Data Min. Knowl. Discov. 2018, 32, 1179–1199. [Google Scholar] [CrossRef]
Oliveira, G.H.F.M.; Minku, L.L.; Oliveira, A.L. Tackling virtual and real concept drifts: An adaptive Gaussian mixture model approach. IEEE Trans. Knowl. Data Eng. 2021, 35, 2048–2060. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Palomar, D.P.; Verdú, S. Lautum information. IEEE Trans. Inf. Theory 2008, 54, 964–975. [Google Scholar] [CrossRef]
Perlaza, S.M.; Esnaola, I.; Bisson, G.; Poor, H.V. On the Validation of Gibbs Algorithms: Training Datasets, Test Datasets and their Aggregation. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Taipei, Taiwan, 25–30 June 2023. [Google Scholar]
Perlaza, S.M.; Bisson, G.; Esnaola, I.; Jean-Marie, A.; Rini, S. Empirical Risk Minimization with Relative Entropy Regularization. IEEE Trans. Inf. Theory 2024, 70, 5122–5161. [Google Scholar] [CrossRef]
Zou, X.; Perlaza, S.M.; Esnaola, I.; Altman, E. Generalization Analysis of Machine Learning Algorithms via the Worst-Case Data-Generating Probability Measure. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Zou, X.; Perlaza, S.M.; Esnaola, I.; Altman, E.; Poor, H.V. The Worst-Case Data-Generating Probability Measure in Statistical Learning. IEEE J. Sel. Areas Inf. Theory 2024, 5, 175–189. [Google Scholar] [CrossRef]
Perlaza, S.M.; Zou, X. The Generalization Error of Machine Learning Algorithms. arXiv 2024, arXiv:2411.12030. [Google Scholar] [CrossRef]
Chentsov, N.N. Nonsymmetrical distance between probability distributions, entropy and the theorem of Pythagoras. Math. Notes Acad. Sci. USSR 1968, 4, 686–691. [Google Scholar] [CrossRef]
Csiszár, I.; Matus, F. Information projections revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
Müller, A. Integral probability metrics and their generating classes of functions. Adv. Appl. Probab. 1997, 29, 429–443. [Google Scholar] [CrossRef]
Zolotarev, V.M. Probability metrics. Teor. Veroyatnostei i ee Primen. 1983, 28, 264–287. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Villani, C. Optimal Transport: Old and New, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Liu, W.; Yu, G.; Wang, L.; Liao, R. An Information-Theoretic Framework for Out-of-Distribution Generalization with Applications to Stochastic Gradient Langevin Dynamics. arXiv 2024, arXiv:2403.19895. [Google Scholar] [CrossRef]
Liu, W.; Yu, G.; Wang, L.; Liao, R. An Information-Theoretic Framework for Out-of-Distribution Generalization. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Athens, Greece, 7–12 July 2024; pp. 2670–2675. [Google Scholar]
Agrawal, R.; Horel, T. Optimal Bounds between f-Divergences and Integral Probability Metrics. J. Mach. Learn. Res. 2021, 22, 1–59. [Google Scholar]
Rahimian, H.; Mehrotra, S. Frameworks and results in distributionally robust optimization. Open J. Math. Optim. 2022, 3, 1–85. [Google Scholar] [CrossRef]
Xu, C.; Lee, J.; Cheng, X.; Xie, Y. Flow-based distributionally robust optimization. IEEE J. Sel. Areas Inf. Theory 2024, 5, 62–77. [Google Scholar] [CrossRef]
Hu, Z.; Hong, L.J. Kullback-Leibler divergence constrained distributionally robust optimization. Optim. Online 2013, 1, 9. [Google Scholar]
Radon, J. Theorie und Anwendungen der Absolut Additiven Mengenfunktionen, 1st ed.; Hölder: Vienna, Austria, 1913. [Google Scholar]
Nikodym, O. Sur une généralisation des intégrales de MJ Radon. Fundam. Math. 1930, 15, 131–179. [Google Scholar] [CrossRef]
Aminian, G.; Bu, Y.; Toni, L.; Rodrigues, M.; Wornell, G. An Exact Characterization of the Generalization Error for the Gibbs Algorithm. Adv. Neural Inf. Process. Syst. 2021, 34, 8106–8118. [Google Scholar]
Perlaza, S.M.; Bisson, G.; Esnaola, I.; Jean-Marie, A.; Rini, S. Empirical Risk Minimization with Relative Entropy Regularization: Optimality and Sensitivity. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 684–689. [Google Scholar]
Jiang, W.; Tanner, M.A. Gibbs posterior for variable selection in high-dimensional classification and data mining. Ann. Stat. 2008, 36, 2207–2231. [Google Scholar] [CrossRef]
Alquier, P.; Ridgway, J.; Chopin, N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016, 17, 8374–8414. [Google Scholar]
Bu, Y.; Aminian, G.; Toni, L.; Wornell, G.W.; Rodrigues, M. Characterizing and understanding the generalization error of transfer learning with Gibbs algorithm. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), Virtual Conference, 28–30 March 2022; pp. 8673–8699. [Google Scholar]
Raginsky, M.; Rakhlin, A.; Tsao, M.; Wu, Y.; Xu, A. Information-theoretic analysis of stability and bias of learning algorithms. In Proceedings of the IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 26–30. [Google Scholar]
Zou, B.; Li, L.; Xu, Z. The Generalization Performance of ERM algorithm with Strongly Mixing Observations. Mach. Learn. 2009, 75, 275–295. [Google Scholar] [CrossRef]
He, H.; Aminian, G.; Bu, Y.; Rodrigues, M.; Tan, V.Y. How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm? In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), Valencia, Spain, 25–27 April 2023; pp. 8494–8520. [Google Scholar]
Hellström, F.; Durisi, G.; Guedj, B.; Raginsky, M. Generalization Bounds: Perspectives from Information Theory and PAC-Bayes. Found. Trends^® Mach. Learn. 2025, 18, 1–223. [Google Scholar] [CrossRef]
Jaynes, E.T. Information Theory and Statistical Mechanics I. Phys. Rev. J. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Jaynes, E.T. Information Theory and Statistical Mechanics II. Phys. Rev. J. 1957, 108, 171–190. [Google Scholar] [CrossRef]
Kapur, J.N. Maximum Entropy Models in Science and Engineering, 1st ed.; Wiley: New York, NY, USA, 1989. [Google Scholar]
Bermudez, Y.; Bisson, G.; Esnaola, I.; Perlaza, S.M. Proofs for Folklore Theorems on the Radon-Nikodym Derivative; Technical Report RR-9591; Centre Inria d’Université Côte d’Azur, INRIA: Sophia Antipolis, France, 2025. [Google Scholar]
Donsker, M.D.; Varadhan, S.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
Heath, T.L. The Thirteen Books of Euclid’s Elements, 2nd ed.; Dover Publications, Inc.: New York, NY, USA, 1956. [Google Scholar]
Azizian, W.; Lutzeler, F.; Malick, J.; Mertikopoulos, P. What is the Long-Run Distribution of Stochastic Gradient Descent? A Large Deviations Analysis. In Proceedings of the 41st International Conference on Machine Learning, 21–27 July 2024; Volume 235, pp. 2168–2229. [Google Scholar]

Figure 1. Geometric interpretation of Lemma 3, with Q a probability measure.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Perlaza, S.M.; Bisson, G. Variations on the Expectation Due to Changes in the Probability Measure. Entropy 2025, 27, 865. https://doi.org/10.3390/e27080865

AMA Style

Perlaza SM, Bisson G. Variations on the Expectation Due to Changes in the Probability Measure. Entropy. 2025; 27(8):865. https://doi.org/10.3390/e27080865

Chicago/Turabian Style

Perlaza, Samir M., and Gaetan Bisson. 2025. "Variations on the Expectation Due to Changes in the Probability Measure" Entropy 27, no. 8: 865. https://doi.org/10.3390/e27080865

APA Style

Perlaza, S. M., & Bisson, G. (2025). Variations on the Expectation Due to Changes in the Probability Measure. Entropy, 27(8), 865. https://doi.org/10.3390/e27080865

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Variations on the Expectation Due to Changes in the Probability Measure

Abstract

1. Introduction

1.1. Novelty and Contributions

1.2. Applications

2. Preliminaries

3. Characterization of $G_{h} (x, P_{1}, P_{2})$ in (2)

4. Characterizations of ${\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})$ in (4)

5. Characterizations of ${\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})$ in (10)

6. Examples

7. Final Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Variations on the Expectation Due to Changes in the Probability Measure

Abstract

1. Introduction

1.1. Novelty and Contributions

1.2. Applications

2. Preliminaries

3. Characterization of G h x , P 1 , P 2 in (2)

4. Characterizations of G ¯ h P Y | X ( 1 ) , P Y | X ( 2 ) , P X in (4)

5. Characterizations of G ¯ h P Y , P Y | X , P X in (10)

6. Examples

7. Final Remarks

Author Contributions

Funding

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Characterization of $G_{h} (x, P_{1}, P_{2})$ in (2)

4. Characterizations of ${\bar{G}}_{h} (P_{Y | X}^{(1)}, P_{Y | X}^{(2)}, P_{X})$ in (4)

5. Characterizations of ${\bar{G}}_{h} (P_{Y}, P_{Y | X}, P_{X})$ in (10)